Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chapter unchaptered videos locally to existing audio #68

Open
mrfragger opened this issue Dec 12, 2024 · 4 comments
Open

Chapter unchaptered videos locally to existing audio #68

mrfragger opened this issue Dec 12, 2024 · 4 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@mrfragger
Copy link

mrfragger commented Dec 12, 2024

Thanks for this package. Can you add -f 139 (to download m4a audio only) that's what I always do then transcribe with whisper.cpp and make an opus chaptered audiobook.

Everything works fine now but youtube playlists that are 30+ videos long that don't have chapter names I'd like to generate those automatically if possible.

I figured out first how to split vtt subs by chapters based on youtube timestamps
00:00:00 001 Lesson 1 [1:36:45]
01:36:45 002 Lesson 2 [1:34:52]
03:11:37 003 Lesson 3 [1:21:54]
04:33:31 004 Lesson 4 [1:24:22]
05:57:53 005 Lesson 5 [1:47:18]
07:45:10 006 Lesson 6 [1:41:01]
09:26:11 007 Lesson 7 [1:41:30]
11:07:42 008 Lesson 8 [1:23:23]
12:31:04 009 Lesson 9 [1:34:03]
14:05:07 010 Lesson 10 [1:37:04]
15:42:11 011 Lesson 11 [1:33:52]
17:16:04 012 Lesson 12 [1:20:45]
18:36:48 013 Lesson 13 [1:46:06]
up to Lesson 31
use mpv with script mpv for chapters to write the youtube timestamp list to a text file.

first change any mm:ss.ms to hh:mm:ss.ms
gsed -i -E 's/(^[0-9]{2}:[0-9]{2}.[0-9]{3} --> )([0-9]{2}:[0-9]{2}.[0-9]{3})/00:\100:\2/g' some.vtt

Put chapter times adding +1 as subtitles might be 1 second ahead (also 1 second behind..did that but 00 minus 1 = -1 then need to change to 59 and create [59,00,01] but gets too complicated.
awk '{print $1}' *.chapters.txt | awk -F ":" '{ printf "!!!!/^"$1":"$2":"$3","$3+1"#### "}' | sed -E -e 's/([0-9]{2},[0-9]{2})/[\1]/g' -e 's/:([0-9]{2},)([0-9]{1})/:[\1\0\2]/g' -e "s/!!!!/'/g" -e "s/####//'/g" > chaptimes

cat chaptimes to get text... tried $(cat "$chaptimes") and < chaptimes and stuff like that but just shows invalid pattern but not where it is. To get where csplit errors exactly paste the timestamp patterns instead.

gcsplit --prefix chapter --suffix-format="%03d.vtt" some.vtt '/^00:00:[00,01]/' '/^01:36:[45,46]/' '/^03:11:[37,38]/' '/^04:33:[31,32]/' '/^05:57:[53,54]/' '/^07:45:[10,11]/' '/^09:26:[11,12]/' '/^11:07:[42,43]/' '/^12:31:[04,05]/' '/^14:05:[07,08]/' '/^15:42:[11,12]/' '/^17:16:[04,05]/' '/^18:36:[48,49]/' '/^20:22:[54,55]/' '/^22:06:[45,46]/' '/^23:48:[25,26]/' '/^25:05:[41,42]/' '/^26:30:[10,11]/' '/^27:28:[32,33]/' '/^28:49:[20,21]/' '/^30:20:[17,18]/' '/^31:36:[20,21]/' '/^32:58:[30,31]/' '/^34:19:[29,30]/' '/^35:22:[33,34]/' '/^36:29:[26,27]/' '/^37:52:[00,01]/' '/^39:14:[00,01]/' '/^40:17:[16,17]/' '/^41:41:[04,05]/' '/^42:55:[55,56]/'

this gave an invalid pattern at 36:29:[26,27] cuz neither of those exist...so changing 1 second less 36:29:[25,26,27]

gcsplit --prefix chapter --suffix-format="%03d.vtt" play.vtt '/^00:00:[00,01]/' '/^01:36:[45,46]/' '/^03:11:[37,38]/' '/^04:33:[31,32]/' '/^05:57:[53,54]/' '/^07:45:[10,11]/' '/^09:26:[11,12]/' '/^11:07:[42,43]/' '/^12:31:[04,05]/' '/^14:05:[07,08]/' '/^15:42:[11,12]/' '/^17:16:[04,05]/' '/^18:36:[48,49]/' '/^20:22:[54,55]/' '/^22:06:[45,46]/' '/^23:48:[25,26]/' '/^25:05:[41,42]/' '/^26:30:[10,11]/' '/^27:28:[32,33]/' '/^28:49:[20,21]/' '/^30:20:[17,18]/' '/^31:36:[20,21]/' '/^32:58:[30,31]/' '/^34:19:[29,30]/' '/^35:22:[33,34]/' '/^36:29:[25,26,27]/' '/^37:52:[00,01]/' '/^39:14:[00,01]/' '/^40:17:[16,17]/' '/^41:41:[04,05]/' '/^42:55:[55,56]/'

now it outputs chapter chapter001.vtt, chapter002.vtt, chapter003.vtt, etc. to chapter031.vtt

now convert vtt subs to plain text
for f in *.vtt; do cat "$f" | sed -e '/WEBVTT/d' -e '/-->/d' | awk '!seen[$0]++' | awk 1 ORS=' ' > "${f%.*}".txt ; done

So now what can you do with the text..divide paragraphs...tried many but couldn't get any to work on Mac. So just use online tool first match "split text into paragraphs" ...I use split at 11 sentences. (not needed though to split for analysis)

Now tried using llama 3.2 to write title from key takeaways and keywords. It was pretty bad.

http://www.writewords.org.uk/phrase_count.asp
say number of words in phrase 3 or 4 seems to give a good overview of chapter to try to figure out a chapter title.

So this leads yt2doc ..how does it automatically add chapter headers from Ollama and one of those 3 models? I installed qwen 2.5 but how to prompt it to generate a title from the chapter text? Can yt2doc work locally without inputting a yt url just based on existing audio / video files?

Ultimately I'd love if possible to use yt2doc for existing content that's already been download and transcribed. Basically use the post-processing functions. A lot of the python stuff is way over my head.

@shun-liang
Copy link
Owner

shun-liang commented Dec 12, 2024

Hey @mrfragger good to hear from you.

how does it automatically add chapter headers from Ollama and one of those 3 models?

Ollama and a language model are involved, but that is not the full story.

If you run yt2doc with the --segment-unchaptered, and give it a URL of YouTube video that's not chaptered, then it will

  1. Transcribe the video to text with Whisper. Output from Whisper are segment of texts; Each segment has a start time, end time, and text content (kind of look like this). There is no line break in the segments, and in small occassions there isn't even punctuations which is quite messy.
  2. The Whisper segments' text contents are then concatenated to a single string, and is then fed to https://github.com/segment-any-text/wtpsplit. wtpsplit breaks the text into a array of paragraphs which each is an array of sentences, by semantics, not rules. (See here)
  3. The paragraphs are then "topic segmented" by an LLM (Qwen, Gemma etc) which can be hosted by Ollama or something else. The topic segmentation is pretty much feeding the paragraphs to the LLM and ask it which paragraphs change topic from their previous ones. Due to the small context window of those models (they have to be small enough to run on a humble memory laptop) I endup running a sliding window of paragraphs with window size 8, and truncate the paragraphs up to their first 6 sentences. That's how we get chapter boundaries from an unchaptered video.
  4. For each chapter, we generate a title for each chapter with the LLM.

@shun-liang
Copy link
Owner

shun-liang commented Dec 12, 2024

Can yt2doc work locally without inputting a yt url just based on existing audio / video files?

People have asked for it. I have just been lazy handling too much jazz at life. It's on the todo list.

@shun-liang
Copy link
Owner

Basically use the post-processing functions.

It's already possible in theory, as you can use yt2doc as a library by pip install yt2doc and import the formatting module. It's just not documented, and they are not exposed as stand alone cli and you will have to write Python code.

@shun-liang shun-liang added enhancement New feature or request question Further information is requested labels Dec 13, 2024
@mrfragger
Copy link
Author

yea thanks Shun Liang...I'm especially interested in that #4 generate chapter title. Gonna try to do that locally somehow or wait till you overdose on jazz. I got Ollama up and going although it really beats up my Mac M1 8GB RAM. #3 eventually as document chunking or paragraph segmentation based on topic would be ideal. In meantime you know how to iterate over enumerate in python perhaps?

# -*- coding: utf-8 -*-
import re

def split_sentences_punctuation(text):
    """
    Splits text into sentences using punctuation marks.
    
    Parameters:
    text (str): The input text to be split.
    
    Returns:
    list: A list of sentences.
    """
    # Regular expression to split sentences based on punctuation marks
    sentences = re.split(r'(?<=[.!?]) +', text)
    return sentences

# Sample text
text = '''

Is my voice loud enough? Some people are saying it's low. Okay. All right. Like I said, I'm going to disable the chat. Okay. If anyone has a question, you can message one of the hosts.
'''

sentences = split_sentences_punctuation(text)


for i, sentence in enumerate(sentences):
    print(f"{sentence}")

Basically this separates all the sentences and seems to do a decent enough job. I want to print say 8 sentences then \n\n for a new paragraph.

I've tried
next{sentence}
print(f"{sentence}")
next{sentence}
print(f"{sentence}")
next{sentence}
print(f"{sentence}")
and stuff like that but.
That online tool split into paragraphs works easy enough but I don't wanna be doing thousands of open file, copy and paste then copy and paste then save. Has to be an automated process.

I did get overly excited for this code (below) cuz it was doing paragraph segmentation based on semantic (similarity) but alas it was doing random order to sequence of paragraphs. I mean it's good if one is just trying to get a quick synopsis or overview of chapter and perhaps write their own chapter title but not good if wish to retain original document order of text. Also you need to define how many clusters to chunk to.

# -*- coding: utf-8 -*-
import tensorflow as tf
import tensorflow_hub as hub
from sklearn.cluster import KMeans
import numpy as np

def embed_sentences(sentences):
    """
    Embed sentences using the Universal Sentence Encoder.
    
    Parameters:
    sentences (list): A list of sentences to be embedded.
    
    Returns:
    np.array: An array of sentence embeddings.
    """
    embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
    embeddings = embed(sentences)
    return np.array(embeddings)

def semantic_chunk(sentences, num_clusters):
    """
    Perform semantic chunking by clustering sentences based on their embeddings.
    
    Parameters:
    sentences (list): A list of sentences to be chunked.
    num_clusters (int): The number of clusters to form.
    
    Returns:
    list: A list of clusters, each containing similar sentences.
    """
    # Embed the sentences
    embeddings = embed_sentences(sentences)
    
    # Perform KMeans clustering
    kmeans = KMeans(n_clusters=num_clusters)
    # kmeans = KMeans(n_clusters=num_clusters, random_state=42)
    kmeans.fit(embeddings)
    
    # Group sentences by clusters
    clusters = [[] for _ in range(num_clusters)]
    for i, label in enumerate(kmeans.labels_):
        clusters[label].append(sentences[i])
    
    return clusters

# Sample text
text = '''
Some sample text with a bunch of sentences.
'''

# Split text into sentences
sentences = text.split('. ')
sentences[-1] = sentences[-1].rstrip('.')

# Perform semantic chunking
num_clusters = 200
clusters = semantic_chunk(sentences, num_clusters)

# Print the clusters
for i, cluster in enumerate(clusters):
    # print(f"PARAGRAPHBREAK {i+1}")
    print(f"\n")
    for sentence in cluster:
        print(f"{sentence}.", end =" ")
    print()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants