You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
first of all a big thank you for your amazing project, which is of the kind 'you saved my life' :)
I'm trying to adapt your code for transcribe_wav command using the Kaldi acoustic model type in order to extract a subtitle file
I tend to believe Kaldi choice is the fastest and the one with the least WER for this job, when not having a GPU card on my PC. Do you agree?.
Judging from the output I get from transcribe_wav it seems to me that the only thing I'd need is just to have a json output with the timing (start, end) of each spoken word of the transcribed audio file.
Best would be to be able to have an aligner and in fact I tried to adapt the code from the gentle project https://github.com/lowerquality/gentle/blob/master/align.py
Maybe this feature or something close is already available by changing some options in the kaldi_cmd used in _transcribe_wav_nnet3(), I just don't know.
I've tried to reuse part of the code of gentle but I see that I would probably need to adapt some C++ code (like the gentle\ext\m3.cc application code that gentle uses for this job) but of course I don'r want to try and go that far before asking.
Any suggestion on what should I do, or work around?
Thank you,
Marcello
The text was updated successfully, but these errors were encountered:
Hi @marcello-pietrobon, thank you for the kind words. I'm glad that voice2json has been able to help you. :)
I would agree that Kaldi is the best choice for now. I'm in the process of upgrading my DeepSpeech code to 0.9.3, so that may make the choice more complicated in the future (in a good way).
Have you tried the transcribe-stream command? If you run it like this:
You'll see the timing messages that tell you when the voice command has started and stopped (in seconds since the audio began). That plus the following transcript should hopefully be what you're looking for.
Don't forget too that transcribe-stream takes raw audio instead of WAV, so you'll need to do something like:
Hello,
first of all a big thank you for your amazing project, which is of the kind 'you saved my life' :)
I'm trying to adapt your code for transcribe_wav command using the Kaldi acoustic model type in order to extract a subtitle file
I tend to believe Kaldi choice is the fastest and the one with the least WER for this job, when not having a GPU card on my PC. Do you agree?.
Judging from the output I get from transcribe_wav it seems to me that the only thing I'd need is just to have a json output with the timing (start, end) of each spoken word of the transcribed audio file.
Best would be to be able to have an aligner and in fact I tried to adapt the code from the gentle project
https://github.com/lowerquality/gentle/blob/master/align.py
Maybe this feature or something close is already available by changing some options in the kaldi_cmd used in _transcribe_wav_nnet3(), I just don't know.
I've tried to reuse part of the code of gentle but I see that I would probably need to adapt some C++ code (like the gentle\ext\m3.cc application code that gentle uses for this job) but of course I don'r want to try and go that far before asking.
Any suggestion on what should I do, or work around?
Thank you,
Marcello
The text was updated successfully, but these errors were encountered: