-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chapter unchaptered videos locally to existing audio #68
Comments
Hey @mrfragger good to hear from you.
Ollama and a language model are involved, but that is not the full story. If you run
|
People have asked for it. I have just been |
It's already possible in theory, as you can use yt2doc as a library by |
yea thanks Shun Liang...I'm especially interested in that #4 generate chapter title. Gonna try to do that locally somehow or wait till you overdose on jazz. I got Ollama up and going although it really beats up my Mac M1 8GB RAM. #3 eventually as document chunking or paragraph segmentation based on topic would be ideal. In meantime you know how to iterate over enumerate in python perhaps?
Is my voice loud enough? Some people are saying it's low. Okay. All right. Like I said, I'm going to disable the chat. Okay. If anyone has a question, you can message one of the hosts.
Basically this separates all the sentences and seems to do a decent enough job. I want to print say 8 sentences then \n\n for a new paragraph. I've tried I did get overly excited for this code (below) cuz it was doing paragraph segmentation based on semantic (similarity) but alas it was doing random order to sequence of paragraphs. I mean it's good if one is just trying to get a quick synopsis or overview of chapter and perhaps write their own chapter title but not good if wish to retain original document order of text. Also you need to define how many clusters to chunk to.
|
Thanks for this package. Can you add -f 139 (to download m4a audio only) that's what I always do then transcribe with whisper.cpp and make an opus chaptered audiobook.
Everything works fine now but youtube playlists that are 30+ videos long that don't have chapter names I'd like to generate those automatically if possible.
I figured out first how to split vtt subs by chapters based on youtube timestamps
00:00:00 001 Lesson 1 [1:36:45]
01:36:45 002 Lesson 2 [1:34:52]
03:11:37 003 Lesson 3 [1:21:54]
04:33:31 004 Lesson 4 [1:24:22]
05:57:53 005 Lesson 5 [1:47:18]
07:45:10 006 Lesson 6 [1:41:01]
09:26:11 007 Lesson 7 [1:41:30]
11:07:42 008 Lesson 8 [1:23:23]
12:31:04 009 Lesson 9 [1:34:03]
14:05:07 010 Lesson 10 [1:37:04]
15:42:11 011 Lesson 11 [1:33:52]
17:16:04 012 Lesson 12 [1:20:45]
18:36:48 013 Lesson 13 [1:46:06]
up to Lesson 31
use mpv with script mpv for chapters to write the youtube timestamp list to a text file.
first change any mm:ss.ms to hh:mm:ss.ms
gsed -i -E 's/(^[0-9]{2}:[0-9]{2}.[0-9]{3} --> )([0-9]{2}:[0-9]{2}.[0-9]{3})/00:\100:\2/g' some.vtt
Put chapter times adding +1 as subtitles might be 1 second ahead (also 1 second behind..did that but 00 minus 1 = -1 then need to change to 59 and create [59,00,01] but gets too complicated.
awk '{print $1}' *.chapters.txt | awk -F ":" '{ printf "!!!!/^"$1":"$2":"$3","$3+1"#### "}' | sed -E -e 's/([0-9]{2},[0-9]{2})/[\1]/g' -e 's/:([0-9]{2},)([0-9]{1})/:[\1\0\2]/g' -e "s/!!!!/'/g" -e "s/####//'/g" > chaptimes
cat chaptimes to get text... tried $(cat "$chaptimes") and < chaptimes and stuff like that but just shows invalid pattern but not where it is. To get where csplit errors exactly paste the timestamp patterns instead.
gcsplit --prefix chapter --suffix-format="%03d.vtt" some.vtt '/^00:00:[00,01]/' '/^01:36:[45,46]/' '/^03:11:[37,38]/' '/^04:33:[31,32]/' '/^05:57:[53,54]/' '/^07:45:[10,11]/' '/^09:26:[11,12]/' '/^11:07:[42,43]/' '/^12:31:[04,05]/' '/^14:05:[07,08]/' '/^15:42:[11,12]/' '/^17:16:[04,05]/' '/^18:36:[48,49]/' '/^20:22:[54,55]/' '/^22:06:[45,46]/' '/^23:48:[25,26]/' '/^25:05:[41,42]/' '/^26:30:[10,11]/' '/^27:28:[32,33]/' '/^28:49:[20,21]/' '/^30:20:[17,18]/' '/^31:36:[20,21]/' '/^32:58:[30,31]/' '/^34:19:[29,30]/' '/^35:22:[33,34]/' '/^36:29:[26,27]/' '/^37:52:[00,01]/' '/^39:14:[00,01]/' '/^40:17:[16,17]/' '/^41:41:[04,05]/' '/^42:55:[55,56]/'
this gave an invalid pattern at 36:29:[26,27] cuz neither of those exist...so changing 1 second less 36:29:[25,26,27]
gcsplit --prefix chapter --suffix-format="%03d.vtt" play.vtt '/^00:00:[00,01]/' '/^01:36:[45,46]/' '/^03:11:[37,38]/' '/^04:33:[31,32]/' '/^05:57:[53,54]/' '/^07:45:[10,11]/' '/^09:26:[11,12]/' '/^11:07:[42,43]/' '/^12:31:[04,05]/' '/^14:05:[07,08]/' '/^15:42:[11,12]/' '/^17:16:[04,05]/' '/^18:36:[48,49]/' '/^20:22:[54,55]/' '/^22:06:[45,46]/' '/^23:48:[25,26]/' '/^25:05:[41,42]/' '/^26:30:[10,11]/' '/^27:28:[32,33]/' '/^28:49:[20,21]/' '/^30:20:[17,18]/' '/^31:36:[20,21]/' '/^32:58:[30,31]/' '/^34:19:[29,30]/' '/^35:22:[33,34]/' '/^36:29:[25,26,27]/' '/^37:52:[00,01]/' '/^39:14:[00,01]/' '/^40:17:[16,17]/' '/^41:41:[04,05]/' '/^42:55:[55,56]/'
now it outputs chapter chapter001.vtt, chapter002.vtt, chapter003.vtt, etc. to chapter031.vtt
now convert vtt subs to plain text
for f in *.vtt; do cat "$f" | sed -e '/WEBVTT/d' -e '/-->/d' | awk '!seen[$0]++' | awk 1 ORS=' ' > "${f%.*}".txt ; done
So now what can you do with the text..divide paragraphs...tried many but couldn't get any to work on Mac. So just use online tool first match "split text into paragraphs" ...I use split at 11 sentences. (not needed though to split for analysis)
Now tried using llama 3.2 to write title from key takeaways and keywords. It was pretty bad.
http://www.writewords.org.uk/phrase_count.asp
say number of words in phrase 3 or 4 seems to give a good overview of chapter to try to figure out a chapter title.
So this leads yt2doc ..how does it automatically add chapter headers from Ollama and one of those 3 models? I installed qwen 2.5 but how to prompt it to generate a title from the chapter text? Can yt2doc work locally without inputting a yt url just based on existing audio / video files?
Ultimately I'd love if possible to use yt2doc for existing content that's already been download and transcribed. Basically use the post-processing functions. A lot of the python stuff is way over my head.
The text was updated successfully, but these errors were encountered: