-
-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
my adventures with GpWhisper: log to files the different commands ? #122
Comments
@teto Would really appreciate a short guide on how to get local whisper working for english transcription on your device! |
I would rather make the plugin more approachable with things such as #125, more logging etc. The merge of the multiple providers support is a very good news. Now it depends on my available time :'( and merges. |
Is there a way i can help with the logging part? |
@mecattaf that would be awesome. We talked a bit about it at #125 (comment) . It's important to discuss the implementation with @Robitx first. |
@mecattaf what's not working for you ? main works nicely for me. And some of the configuration is setup dependant. we can add checks to verify for instance that the generated mp3 is not empty. More logging should help though so maybe add a PR that logs at crucial times. For instance when running external commands. |
@teto could you be so kind to share your config file? Or perhaps a short guide on how to set it up? Many thanks! |
@Robitx suggested in Discord to share these config examples in the repo wiki. This would be a great start. |
as you can see I have nothing specific anymore, the commented code is remnants from before the multiple providers MR was merged: https://github.com/teto/home/blob/577d3a6cbb37ad874b601bc0af73e7162486d479/config/nvim/lua/teto/gp/setup.lua#L24 |
Just to backup some of my thoughts concerning whisper. TLDR: The cross platform audio world sucks. SoX is the only cross platform candidate usable for recording, but unknown amount of people were hitting latency issues causing beginning/end of the recording cut offs. That lead me to eventually introduce ffmpeg with avfoundation for mac os and arecord for linux which are prioritized over SoX for recording if found. Yes, ffmpeg could potentially replace SoX completely, but cross platform mess of possible incantations based on available input devices is something I'd like to avoid (https://ffmpeg.org/ffmpeg-devices.html#Input-Devices). Whisper (at least through OpenAI api) is limited to 25MB input files with mostly proprietary formats (mp3, mp4, mpeg, mpga, m4a, wav, and webm). Wav is around 10MB of mono recording per minute, mp3 is around 1MB per minute, which means compression for any non trivial length of audio and dealing with SoX potentially missing mp3 support (at least NixOS and Ubuntu both have this problem). I haven't tested/looked up if and how transcription speed depends on the format - we might be able to avoid mp3 for simple whisper instructions like GpWhisperRewrite. But for use cases such as dictating something for transcription wav is unusable without some splitting mechanism which would again complicate things. Then, there is the question on what to use for running whisper locally. Whisper model is relatively small enough that I might consider bundling some cross platform solution directly with the Gp plugin. Ollama can't be expected anytime soon ollama/ollama#1168, but there are other candidates such as https://github.com/fedirz/faster-whisper-server. The
and setting into conf:
basically works already (although slow, first call timeouts since it pulls the model), and it uses customized models so the currently hard coded
|
@Robitx do you think this should be done in a separate plugin? |
@mecattaf I don't think there is a need to separate it, I've added GpWhisper exactly because voice control/dictation is important for me too. Concerning the splitting mechanism - SoX itself can do it, the trouble are the silence thresholds which will differ from device to device and often time of day on the same device.
Notes: |
Got it! Three things I really like from this project: https://github.com/mkiol/dsnote
I do not like the Speech Note ui, the gp.nvim experience is second to none imo. Hopefully we can get the best of both worlds :) |
@mecattaf Gp uses rudimentary threshold detection already. Reading RMS level ("average loudness" for example -10dB) and multiplying it by some constant => RMS*1.75 = -17.5 dB and everything below that would be considered silence (audio under threshold for specified duration would cause split). Lines 3370 to 3379 in 0e7a4a2
But there is a lot of SoX magic not utilized yet, since I didn't have time to play with it. For example compand effect which could get voice/silence to predictable level before splitting and make it easier for whisper to proccess.
I'll try to spend some time on it during weekend. |
@teto I think it would be useful to compile all the info related to offline whisper somewhere. Should this be a self-contained markdown file, or a separate repo? I would like to |
this is up to whisper's project IMO
I feel like the hundreds of tutorials on how to run LLMs locally. Maybe just link one of those with some comments ? I think the wiki is the most appropriate place to do so. Go ahead but gp.nvim != whisper so link the whisper doc when appropriate rather than duplicate it with the risk of it getting outdated |
@mecattaf sorry I haven't got around to it yet, spend last two week(s/ends) cleaning up the code base and squashing some bug reports. |
Now that I have my GPU used by localai I wanted to try whisper locally via
:GpWhisper
after installing sox and I got a not very helpful:I had installed sox because checkhealth asked for it:
Note that the mp3 check is invalid as
sox -h | grep -i mp3
did return mp3 but there seems to be a dinstinction between writing and reading mp3 https://bugs.launchpad.net/ubuntu/+source/sox/+bug/223783I am on nix and I had to install
(sox.override({enableLame = true;}))
for sox to be able to generate mp3.In oder to debug my setup, I
print
-ed stuff, would be nice if gp.nvim could log some of its operations to a file instead. I dont like plenary much but it has some facilities. With package managers like https://github.com/nvim-neorocks/rocks.nvim/ , it should become more tractable to use dependencies in the future.So anyway GpWhisper was trying to run:
So I found out that rec.wav did not exist/was empty. Checking for the size of the record could help diagnose wrong recording.
Then I had to split the command to find the issue. Turns out that the conversion to mp3 failed because of what I mentioned earlier: my version of sox listed mp3 in
sox -h
but it was not able to generate mp3 until I enabled the "lame" library.So now it works (yeah \o/) but initially I wanted to try it locally so I changed the hardcoded endpoint towards my local localai endpoint
.. " --max-time 20 http://localhost:11111/v1/audio/transcriptions -s "
and it works so fast it's scary (with a RTX3060, so no that fancy)
My first attempt was in my native language != English and the result was garbage ^^
maybe thedefault `whisper_language = "en" could be chosen via the locale instead ? but I nitpick.
Took me a few (2?) hours to get there so I'll pause for now :)
My USB mic needed some custom config that I am listing more for my future self than for the maintainers (sry ^^'):
The help/doc of arecord is not great so from there it was not clear how to specify the device.
I found the answer here https://unix.stackexchange.com/questions/360192/alsa-error-channel-count-2-not-available-for-playback-invalid-argument
:
plughw
accepts more options thanhw
it seems and in the endThe text was updated successfully, but these errors were encountered: