Skip to content

2023 05 09

wwerkk edited this page May 9, 2023 · 3 revisions

Trying to setup proper generation benchmarking. Stuff's not going right with KMeans, somehow the prediction for a KICK HIHAT CLAP HIHAT sequence is labels 0 1 1 2, where 0 is the kick, 1 the hihat and 2 the clap... This doesn't make sense!

Same results with spectral clustering...

Something about feature extraction is wrong here, perhaps? It seemed that the crude features, ie. energy and zero crossing rate worked the best for percussive input, but maybe they are prone to error?

After detokenising the predicted labels for first 8 frames in the original sequence, the sequencing feels right, but the labels are switched around, with the result: KICK HIHAT KICK CLAP

Okay, I was passing frames to stream.get_frame the entire time, not the labelled_frames dictionary...

It seems that extracting features (except the crude, which don't have such parameters) with default frame and hop size (ie. splitting each frame into sub-frames) makes for the loss of information and wonky clustering. These would later be averaged, but seemingly, it works better in this case if the feature is calculated once over the entire frame. Perhaps windowing has something to do with it?

It's improved a little, but spectral features seem to work the best for drums - as the only case, extracting them and clustering resulted in being able to successfully tokenize-detokenize a sequence without loss of information. MFCCs were a bit more lossy, though preserved some regularity of the input, crude features worked the worst and seemingly half-random.

Combinations could be tried.

Crude + spectral - works just fine. Spectral + mfccs - same case, but a bit more expensive to calculate. Crude + mfccs - pretty much similar as MFCCs only. All of the above - works fine too

New problem is, longer files and frames are becoming very heavy to compute...

Works really badly with real-life data - notebook's barely holding together when running. Will try UMAP to see if it improves the situation at all.

UMAP improves the situation ever so slightly, in the long term, it's probably best to go down the way of autoencoding. An LSTM autoencoder could predict the features and the closest matching frame could be picked off a K-d Tree, simple as.

Yep, seems like with simple stuff UMAP does improve things a little bit, ie. the pulsex example.

Default n_fft and hop_length for feature extraction works fine with some real-world percussive examples (pulsex), so there might be no need to do the heavier calculation?

Thought the librosa beat detection worked so-so, but after some rudimentary improvements it works pretty okay. Nicely enough, it seems to improve the training metrics. I don't know how or why. Would be nice to add some backtracking to the first detected beat and segment it from there on.

Importantly: I chucked away all of the notebook files from the repo, as all the caching and updating them with VSCode Git extension made the IDE pretty much unusable. Might switch over to Jupyter Lab for good, even though the lack of autocomplete hurts. Still, today's 100 relaunches and reboots caused by VSCode Jupyter kernel freezing on me made me desperate. Maybe just enough.

Clone this wiki locally