Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do you have plans to release a casual version? #6

Open
hertz-pj opened this issue Jan 12, 2025 · 13 comments
Open

Do you have plans to release a casual version? #6

hertz-pj opened this issue Jan 12, 2025 · 13 comments

Comments

@hertz-pj
Copy link

Thank you very much to the author for open-sourcing the code and weights.
Do you have any plans to open-source the casual version?

@hertz-pj
Copy link
Author

hertz-pj commented Jan 12, 2025

I tried some English cases, and the results were roughly as follows. Is this result consistent with yours? It is not very well.
test.zip

@liuxubo717
Copy link
Contributor

Which configuration (e.g., bps) did you use?

@hertz-pj
Copy link
Author

the result from test.zip, I set posthoc_bottleneck to false.

I then reset posthoc_bottleneck to true and tried three different parameters: 1x46656_400bps, 2x15625_700bps, and 1x729_1000bps. The outcomes with these settings showed better performance. The new results can be found in test2.zip.

I initially thought that setting posthoc_bottleneck to false would result in the highest bps and therefore the best performance. Why isn't that the case?

Thank you for your reply.
test2.zip

@julian-parker
Copy link
Collaborator

I'll check if we get the same results here. A few questions:

  • How are you running inference? Through the StableCodec object?
  • Is flash attention installed and working?
  • What GPU are you using?

@hertz-pj
Copy link
Author

When I set posthoc_bottleneck=false, my code is as follow:

import torch
import torchaudio
from stable_codec import StableCodec

model = StableCodec(
    model_config_path="./stable-codec-speech-16k/model_config.json",
    ckpt_path="./stable-codec-speech-16k/model.ckpt", # optional, can be `None`,
    device = torch.device("cuda")
)

model.set_posthoc_bottleneck("1x46656_400bps")
audiopath = "./1.wav"

latents, tokens = model.encode(audiopath, posthoc_bottleneck=False)
decoded_audio = model.decode(tokens, posthoc_bottleneck=False)

torchaudio.save("decoded.wav", decoded_audio.squeeze(0).cpu(), model.sample_rate)

The version of posthoc_bottleneck=Ture is as follow:

import torch
import torchaudio
from stable_codec import StableCodec


model = StableCodec(
    model_config_path="./stable-codec-speech-16k/model_config.json",
    ckpt_path="./stable-codec-speech-16k/model.ckpt", # optional, can be `None`,
    device = torch.device("cuda")
)

model.set_posthoc_bottleneck("1x46656_400bps")
audiopath = "./1.wav"

latents, tokens = model.encode(audiopath, posthoc_bottleneck=True)
decoded_audio = model.decode(tokens, posthoc_bottleneck=True)

torchaudio.save("46656_400bps.wav", decoded_audio.squeeze(0).cpu(), model.sample_rate)

@julian-parker
Copy link
Collaborator

Thanks for sharing your code @hertz-pj.

This is what the output should sound like for this example. It's not perfect, but it's better:
decoded.wav.zip

However, I can indeed reproduce the results that you're getting. Seems to be two factors in play:

  • The utterance you shared has heavy dynamic range compression applied, which means it gets attenuated quite heavily when normalized to -20 LUFS. In this situation its better to use normalize = False in the encoder.
  • There may be a bug in the bottleneck code which causes a further small degradation. I'm investigating this at the moment.

@hertz-pj
Copy link
Author

Thank you very much for your reply. I haven't fully understood the principle of dithered_fsq yet, and I will study it further.

I currently want to train a model with fps=50/80, and I want to use the original fsq instead of dithered_fsq. The main part of the configuration I'm using is as follows:

{
    "model_type": "autoencoder",
    "sample_size": 48000,
    "sample_rate": 16000,
    "audio_channels": 1,
    "model": {
        "pretransform": {
            "type": "patched",
            "enable_grad": true,
            "config": {
                "patch_size": 100,
                "channels": 1
            }
        },
        "encoder": {
            "type": "taae",
            "config": {
                "in_channels": 100,
                "channels": 1024,
                "c_mults": [1, 1, 1],
                "strides": [1, 1, 2],
                "latent_dim": 6,
                "transformer_depths": [8, 8, 12],
                "use_snake": false,
                "checkpointing": false,
                "conformer": false,
                "layer_scale": true,
                "sliding_window": [63,64]
            }
        },
        "decoder": {
            "type": "taae",
            "config": {
                "out_channels": 100,
                "channels": 1024,
                "c_mults": [1, 1, 1],
                "strides": [1, 1, 2],
                "latent_dim": 6,
                "transformer_depths": [8, 8, 12],
                "use_snake": false,
                "checkpointing": false,
                "conformer": false,
                "layer_scale": true,
                "sliding_window": [63,64]
            }
        },
        "bottleneck": {
            "type": "dithered_fsq",
            "config": {
                "dim": 6,
                "levels": 17,
                "dither_inference": false,
                "num_codebooks": 1,
                "noise_dropout": 0.5
            }
        }
    }
}

Is setting the sliding window to [63, 64] problematic? Would using dithered_fsq directly make a significant difference compared to the original fsq?

I will share the final experimental results once I get them.

@julian-parker
Copy link
Collaborator

sliding_window needs to be <= than the shortest sequence seen during training. Your shortest sequence should be 240 (48000 / 100 / 2), so [63,64] should work fine.

The original FSQ differs in two main ways compared to dithered_fsq:

  • It doesn't use noise during training, only straight-through gradients.
  • It distributes the quantized levels slightly differently (and non-symmetrically) in the FSQ space. This prevents the use of the post-hoc residual tricks that we use.

If you don't care about either of those aspects, you can use the original FSQ.

@julian-parker
Copy link
Collaborator

BTW - if you want a pseudo-causal version of the model, you can set sliding_window to be [64,0]. This gives a length 64 causal sliding window. Your convolutions will still not be causal however.

@hertz-pj
Copy link
Author

    return WebDatasetDataLoader(
        wds_configs,
        sample_rate=sample_rate,
        sample_size=sample_size,
        batch_size=batch_size,
        remove_silence=dataset_config.get("remove_silence", False),
        silence_threshold=dataset_config.get("silence_threshold", [0.01, 0.5]),
        max_silence_duration=dataset_config.get("max_silence_duration", 0.25),
        random_crop=dataset_config.get("random_crop", True),
        volume_norm=dataset_config.get("volume_norm", False),
        volume_norm_param=dataset_config.get("volume_norm_param", [-16, 2]),
        num_workers=num_workers,
        persistent_workers=True,
        pin_memory=True,
        force_channels=force_channels,
        epoch_steps=dataset_config.get("epoch_steps", 2000),
        pre_encoded=dataset_config.get("pre_encoded", False),
        resampled_shards=dataset_config.get("resampled_shards", True),
        force_align_text=dataset_config.get("force_align_text", False)
    ).data_loader

Should I change these configurations in the dataset configuration?

Additionally, it seems that I cannot directly refer to the configuration method in stable-audio-tools, because you rewrote a WebDatasetDataLoader method.

@julian-parker
Copy link
Collaborator

Those dataset params look fine to me.

Are you doing a training with CTC? If not, you can just use the training scripts from stable-audio-tools directly and use the original dataloader.

@hertz-pj
Copy link
Author

WavLM Perceptual loss plays a crucial role in this work, as evidenced by the significant reduction in mel loss from 1.18 to 0.86. It appears that this part is not included in the public code.

Additionally, the fine-tuned perceptual loss model underwent 150k more training steps compared to the pre-trained model. Should we continue training the model without perceptual loss for an additional 150k steps before making a comparison between the two?

@julian-parker
Copy link
Collaborator

julian-parker commented Jan 14, 2025

The WavLM perceptual loss is integrated into stable-audio-tools, you just need to add it to your training config.

            "hubert": {
                "weights": {
                  "hubert": 0.25
                },
                "config": {
                  "feature_ids": [-1],
                  "model_name": "WAV2VEC2_LARGE_LV60K"
                }
              }

Additionally, the fine-tuned perceptual loss model underwent 150k more training steps compared to the pre-trained model. Should we continue training the model without perceptual loss for an additional 150k steps before making a comparison between the two?

That would be another viable approach, yes. Learning rate schedulers mean that this is still an imperfect comparison however.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants