Do you have plans to release a casual version? #6

hertz-pj · 2025-01-12T06:05:52Z

Thank you very much to the author for open-sourcing the code and weights.
Do you have any plans to open-source the casual version?

hertz-pj · 2025-01-12T07:25:39Z

I tried some English cases, and the results were roughly as follows. Is this result consistent with yours? It is not very well.
test.zip

liuxubo717 · 2025-01-12T12:49:59Z

Which configuration (e.g., bps) did you use?

hertz-pj · 2025-01-13T09:33:26Z

the result from test.zip, I set posthoc_bottleneck to false.

I then reset posthoc_bottleneck to true and tried three different parameters: 1x46656_400bps, 2x15625_700bps, and 1x729_1000bps. The outcomes with these settings showed better performance. The new results can be found in test2.zip.

I initially thought that setting posthoc_bottleneck to false would result in the highest bps and therefore the best performance. Why isn't that the case?

Thank you for your reply.
test2.zip

julian-parker · 2025-01-13T09:35:35Z

I'll check if we get the same results here. A few questions:

How are you running inference? Through the StableCodec object?
Is flash attention installed and working?
What GPU are you using?

hertz-pj · 2025-01-13T11:22:45Z

When I set posthoc_bottleneck=false, my code is as follow:

import torch
import torchaudio
from stable_codec import StableCodec

model = StableCodec(
    model_config_path="./stable-codec-speech-16k/model_config.json",
    ckpt_path="./stable-codec-speech-16k/model.ckpt", # optional, can be `None`,
    device = torch.device("cuda")
)

model.set_posthoc_bottleneck("1x46656_400bps")
audiopath = "./1.wav"

latents, tokens = model.encode(audiopath, posthoc_bottleneck=False)
decoded_audio = model.decode(tokens, posthoc_bottleneck=False)

torchaudio.save("decoded.wav", decoded_audio.squeeze(0).cpu(), model.sample_rate)

The version of posthoc_bottleneck=Ture is as follow：

import torch
import torchaudio
from stable_codec import StableCodec


model = StableCodec(
    model_config_path="./stable-codec-speech-16k/model_config.json",
    ckpt_path="./stable-codec-speech-16k/model.ckpt", # optional, can be `None`,
    device = torch.device("cuda")
)

model.set_posthoc_bottleneck("1x46656_400bps")
audiopath = "./1.wav"

latents, tokens = model.encode(audiopath, posthoc_bottleneck=True)
decoded_audio = model.decode(tokens, posthoc_bottleneck=True)

torchaudio.save("46656_400bps.wav", decoded_audio.squeeze(0).cpu(), model.sample_rate)

julian-parker · 2025-01-13T11:41:38Z

Thanks for sharing your code @hertz-pj.

This is what the output should sound like for this example. It's not perfect, but it's better:
decoded.wav.zip

However, I can indeed reproduce the results that you're getting. Seems to be two factors in play:

The utterance you shared has heavy dynamic range compression applied, which means it gets attenuated quite heavily when normalized to -20 LUFS. In this situation its better to use normalize = False in the encoder.
There may be a bug in the bottleneck code which causes a further small degradation. I'm investigating this at the moment.

hertz-pj · 2025-01-13T12:15:54Z

Thank you very much for your reply. I haven't fully understood the principle of dithered_fsq yet, and I will study it further.

I currently want to train a model with fps=50/80, and I want to use the original fsq instead of dithered_fsq. The main part of the configuration I'm using is as follows:

{
    "model_type": "autoencoder",
    "sample_size": 48000,
    "sample_rate": 16000,
    "audio_channels": 1,
    "model": {
        "pretransform": {
            "type": "patched",
            "enable_grad": true,
            "config": {
                "patch_size": 100,
                "channels": 1
            }
        },
        "encoder": {
            "type": "taae",
            "config": {
                "in_channels": 100,
                "channels": 1024,
                "c_mults": [1, 1, 1],
                "strides": [1, 1, 2],
                "latent_dim": 6,
                "transformer_depths": [8, 8, 12],
                "use_snake": false,
                "checkpointing": false,
                "conformer": false,
                "layer_scale": true,
                "sliding_window": [63,64]
            }
        },
        "decoder": {
            "type": "taae",
            "config": {
                "out_channels": 100,
                "channels": 1024,
                "c_mults": [1, 1, 1],
                "strides": [1, 1, 2],
                "latent_dim": 6,
                "transformer_depths": [8, 8, 12],
                "use_snake": false,
                "checkpointing": false,
                "conformer": false,
                "layer_scale": true,
                "sliding_window": [63,64]
            }
        },
        "bottleneck": {
            "type": "dithered_fsq",
            "config": {
                "dim": 6,
                "levels": 17,
                "dither_inference": false,
                "num_codebooks": 1,
                "noise_dropout": 0.5
            }
        }
    }
}

Is setting the sliding window to [63, 64] problematic? Would using dithered_fsq directly make a significant difference compared to the original fsq?

I will share the final experimental results once I get them.

julian-parker · 2025-01-13T12:37:12Z

sliding_window needs to be <= than the shortest sequence seen during training. Your shortest sequence should be 240 (48000 / 100 / 2), so [63,64] should work fine.

The original FSQ differs in two main ways compared to dithered_fsq:

It doesn't use noise during training, only straight-through gradients.
It distributes the quantized levels slightly differently (and non-symmetrically) in the FSQ space. This prevents the use of the post-hoc residual tricks that we use.

If you don't care about either of those aspects, you can use the original FSQ.

julian-parker · 2025-01-13T12:44:27Z

BTW - if you want a pseudo-causal version of the model, you can set sliding_window to be [64,0]. This gives a length 64 causal sliding window. Your convolutions will still not be causal however.

hertz-pj · 2025-01-13T13:31:47Z

    return WebDatasetDataLoader(
        wds_configs,
        sample_rate=sample_rate,
        sample_size=sample_size,
        batch_size=batch_size,
        remove_silence=dataset_config.get("remove_silence", False),
        silence_threshold=dataset_config.get("silence_threshold", [0.01, 0.5]),
        max_silence_duration=dataset_config.get("max_silence_duration", 0.25),
        random_crop=dataset_config.get("random_crop", True),
        volume_norm=dataset_config.get("volume_norm", False),
        volume_norm_param=dataset_config.get("volume_norm_param", [-16, 2]),
        num_workers=num_workers,
        persistent_workers=True,
        pin_memory=True,
        force_channels=force_channels,
        epoch_steps=dataset_config.get("epoch_steps", 2000),
        pre_encoded=dataset_config.get("pre_encoded", False),
        resampled_shards=dataset_config.get("resampled_shards", True),
        force_align_text=dataset_config.get("force_align_text", False)
    ).data_loader

Should I change these configurations in the dataset configuration?

Additionally, it seems that I cannot directly refer to the configuration method in stable-audio-tools, because you rewrote a WebDatasetDataLoader method.

julian-parker · 2025-01-13T13:35:53Z

Those dataset params look fine to me.

Are you doing a training with CTC? If not, you can just use the training scripts from stable-audio-tools directly and use the original dataloader.

hertz-pj · 2025-01-14T03:01:59Z

WavLM Perceptual loss plays a crucial role in this work, as evidenced by the significant reduction in mel loss from 1.18 to 0.86. It appears that this part is not included in the public code.

Additionally, the fine-tuned perceptual loss model underwent 150k more training steps compared to the pre-trained model. Should we continue training the model without perceptual loss for an additional 150k steps before making a comparison between the two?

julian-parker · 2025-01-14T08:39:14Z

The WavLM perceptual loss is integrated into stable-audio-tools, you just need to add it to your training config.

            "hubert": {
                "weights": {
                  "hubert": 0.25
                },
                "config": {
                  "feature_ids": [-1],
                  "model_name": "WAV2VEC2_LARGE_LV60K"
                }
              }

Additionally, the fine-tuned perceptual loss model underwent 150k more training steps compared to the pre-trained model. Should we continue training the model without perceptual loss for an additional 150k steps before making a comparison between the two?

That would be another viable approach, yes. Learning rate schedulers mean that this is still an imperfect comparison however.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do you have plans to release a casual version? #6

Do you have plans to release a casual version? #6

hertz-pj commented Jan 12, 2025

hertz-pj commented Jan 12, 2025 •

edited

Loading

liuxubo717 commented Jan 12, 2025

hertz-pj commented Jan 13, 2025

julian-parker commented Jan 13, 2025

hertz-pj commented Jan 13, 2025

julian-parker commented Jan 13, 2025

hertz-pj commented Jan 13, 2025

julian-parker commented Jan 13, 2025

julian-parker commented Jan 13, 2025

hertz-pj commented Jan 13, 2025

julian-parker commented Jan 13, 2025

hertz-pj commented Jan 14, 2025

julian-parker commented Jan 14, 2025 •

edited

Loading

Do you have plans to release a casual version? #6

Do you have plans to release a casual version? #6

Comments

hertz-pj commented Jan 12, 2025

hertz-pj commented Jan 12, 2025 • edited Loading

liuxubo717 commented Jan 12, 2025

hertz-pj commented Jan 13, 2025

julian-parker commented Jan 13, 2025

hertz-pj commented Jan 13, 2025

julian-parker commented Jan 13, 2025

hertz-pj commented Jan 13, 2025

julian-parker commented Jan 13, 2025

julian-parker commented Jan 13, 2025

hertz-pj commented Jan 13, 2025

julian-parker commented Jan 13, 2025

hertz-pj commented Jan 14, 2025

julian-parker commented Jan 14, 2025 • edited Loading

hertz-pj commented Jan 12, 2025 •

edited

Loading

julian-parker commented Jan 14, 2025 •

edited

Loading