transcribe_wav - json with timings - subtitles #43

marcello-pietrobon · 2021-05-31T20:29:26Z

Hello,
first of all a big thank you for your amazing project, which is of the kind 'you saved my life' :)

I'm trying to adapt your code for transcribe_wav command using the Kaldi acoustic model type in order to extract a subtitle file
I tend to believe Kaldi choice is the fastest and the one with the least WER for this job, when not having a GPU card on my PC. Do you agree?.

Judging from the output I get from transcribe_wav it seems to me that the only thing I'd need is just to have a json output with the timing (start, end) of each spoken word of the transcribed audio file.
Best would be to be able to have an aligner and in fact I tried to adapt the code from the gentle project
https://github.com/lowerquality/gentle/blob/master/align.py

Maybe this feature or something close is already available by changing some options in the kaldi_cmd used in _transcribe_wav_nnet3(), I just don't know.
I've tried to reuse part of the code of gentle but I see that I would probably need to adapt some C++ code (like the gentle\ext\m3.cc application code that gentle uses for this job) but of course I don'r want to try and go that far before asking.

Any suggestion on what should I do, or work around?

Thank you,
Marcello

synesthesiam · 2021-06-01T01:50:43Z

Hi @marcello-pietrobon, thank you for the kind words. I'm glad that voice2json has been able to help you. :)

I would agree that Kaldi is the best choice for now. I'm in the process of upgrading my DeepSpeech code to 0.9.3, so that may make the choice more complicated in the future (in a good way).

Have you tried the transcribe-stream command? If you run it like this:

$ voice2json transcribe-stream --event-sink /dev/stdout

You'll see the timing messages that tell you when the voice command has started and stopped (in seconds since the audio began). That plus the following transcript should hopefully be what you're looking for.

Don't forget too that transcribe-stream takes raw audio instead of WAV, so you'll need to do something like:

$ sox input.wav -r 16000 -b 16 -c 1 -e signed-integer -t raw - | voice2json transcribe-stream --audio-source - ...

synesthesiam closed this as completed Jul 8, 2021

marcello-pietrobon mentioned this issue Jul 20, 2021

Question: state of the art of DeepSpeech 0.9.3 #58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transcribe_wav - json with timings - subtitles #43

transcribe_wav - json with timings - subtitles #43

marcello-pietrobon commented May 31, 2021

synesthesiam commented Jun 1, 2021

transcribe_wav - json with timings - subtitles #43

transcribe_wav - json with timings - subtitles #43

Comments

marcello-pietrobon commented May 31, 2021

synesthesiam commented Jun 1, 2021