-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do you have plans to release a casual version? #6
Comments
I tried some English cases, and the results were roughly as follows. Is this result consistent with yours? It is not very well. |
Which configuration (e.g., bps) did you use? |
the result from test.zip, I set posthoc_bottleneck to false. I then reset posthoc_bottleneck to true and tried three different parameters: 1x46656_400bps, 2x15625_700bps, and 1x729_1000bps. The outcomes with these settings showed better performance. The new results can be found in test2.zip. I initially thought that setting posthoc_bottleneck to false would result in the highest bps and therefore the best performance. Why isn't that the case? Thank you for your reply. |
I'll check if we get the same results here. A few questions:
|
When I set posthoc_bottleneck=false, my code is as follow:
The version of posthoc_bottleneck=Ture is as follow:
|
Thanks for sharing your code @hertz-pj. This is what the output should sound like for this example. It's not perfect, but it's better: However, I can indeed reproduce the results that you're getting. Seems to be two factors in play:
|
Thank you very much for your reply. I haven't fully understood the principle of dithered_fsq yet, and I will study it further. I currently want to train a model with fps=50/80, and I want to use the original fsq instead of dithered_fsq. The main part of the configuration I'm using is as follows:
Is setting the sliding window to [63, 64] problematic? Would using dithered_fsq directly make a significant difference compared to the original fsq? I will share the final experimental results once I get them. |
The original FSQ differs in two main ways compared to
If you don't care about either of those aspects, you can use the original FSQ. |
BTW - if you want a pseudo-causal version of the model, you can set |
Should I change these configurations in the dataset configuration? Additionally, it seems that I cannot directly refer to the configuration method in stable-audio-tools, because you rewrote a WebDatasetDataLoader method. |
Those dataset params look fine to me. Are you doing a training with CTC? If not, you can just use the training scripts from stable-audio-tools directly and use the original dataloader. |
WavLM Perceptual loss plays a crucial role in this work, as evidenced by the significant reduction in mel loss from 1.18 to 0.86. It appears that this part is not included in the public code. Additionally, the fine-tuned perceptual loss model underwent 150k more training steps compared to the pre-trained model. Should we continue training the model without perceptual loss for an additional 150k steps before making a comparison between the two? |
The WavLM perceptual loss is integrated into
That would be another viable approach, yes. Learning rate schedulers mean that this is still an imperfect comparison however. |
Thank you very much to the author for open-sourcing the code and weights.
Do you have any plans to open-source the casual version?
The text was updated successfully, but these errors were encountered: