Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transcribe_wav - json with timings - subtitles #43

Closed
marcello-pietrobon opened this issue May 31, 2021 · 1 comment
Closed

transcribe_wav - json with timings - subtitles #43

marcello-pietrobon opened this issue May 31, 2021 · 1 comment

Comments

@marcello-pietrobon
Copy link

Hello,
first of all a big thank you for your amazing project, which is of the kind 'you saved my life' :)

I'm trying to adapt your code for transcribe_wav command using the Kaldi acoustic model type in order to extract a subtitle file
I tend to believe Kaldi choice is the fastest and the one with the least WER for this job, when not having a GPU card on my PC. Do you agree?.

Judging from the output I get from transcribe_wav it seems to me that the only thing I'd need is just to have a json output with the timing (start, end) of each spoken word of the transcribed audio file.
Best would be to be able to have an aligner and in fact I tried to adapt the code from the gentle project
https://github.com/lowerquality/gentle/blob/master/align.py

Maybe this feature or something close is already available by changing some options in the kaldi_cmd used in _transcribe_wav_nnet3(), I just don't know.
I've tried to reuse part of the code of gentle but I see that I would probably need to adapt some C++ code (like the gentle\ext\m3.cc application code that gentle uses for this job) but of course I don'r want to try and go that far before asking.

Any suggestion on what should I do, or work around?

Thank you,
Marcello

@synesthesiam
Copy link
Owner

Hi @marcello-pietrobon, thank you for the kind words. I'm glad that voice2json has been able to help you. :)

I would agree that Kaldi is the best choice for now. I'm in the process of upgrading my DeepSpeech code to 0.9.3, so that may make the choice more complicated in the future (in a good way).

Have you tried the transcribe-stream command? If you run it like this:

$ voice2json transcribe-stream --event-sink /dev/stdout

You'll see the timing messages that tell you when the voice command has started and stopped (in seconds since the audio began). That plus the following transcript should hopefully be what you're looking for.

Don't forget too that transcribe-stream takes raw audio instead of WAV, so you'll need to do something like:

$ sox input.wav -r 16000 -b 16 -c 1 -e signed-integer -t raw - | voice2json transcribe-stream --audio-source - ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants