Skip to content

Baseline system for Language-based Audio Retrieval (Task 6B) in DCASE 2023 Challenge

License

Notifications You must be signed in to change notification settings

xieh97/dcase2023-audio-retrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Language-based Audio Retrieval in DCASE 2023 Challenge

This repository provides the baseline system for Language-based Audio Retrieval (Task 6B) in DCASE 2023 Challenge.

2023/03/20 Update: Training checkpoints for the baseline system and its audio encoder are available on Zenodo: DOI.

Language-based Audio Retrieval

Baseline Retrieval System

Baseline Retrieval System

- Audio Encoder                   # fine-tuned PANNs, i.e., CNN14
- Text Encoder                    # pretrained Sentence-BERT, i.e., all-mpnet-base-v2
- Contrastive Learning Objective  # InfoNCE loss

Quick Start

This codebase is developed with Python 3.9 and PyTorch 1.13.0.

  1. Check out source code and install required python packages:
git clone https://github.com/xieh97/dcase2023-audio-retrieval.git
pip install -r requirements.txt
  1. Download the Clotho dataset:
Clotho
├─ clotho_captions_development.csv
├─ clotho_captions_validation.csv
├─ clotho_captions_evaluation.csv
├─ development
│   └─...(3839 wavs)
├─ validation
│   └─...(1045 wavs)
└─ evaluation
    └─...(1045 wavs)
  1. Pre-process audio and caption data:
preprocessing
├─ audio_logmel.py              # extract log-mel energies from audio clips
├─ clotho_dataset.py            # process audio captions, generate fids and cids
├─ sbert_embeddings.py          # generate sentence embeddings using Sentence-BERT (all-mpnet-base-v2)
└─ cnn14_transfer.py            # transfer pretrained CNN14 (Cnn14_mAP=0.431.pth)
  1. Train the baseline system:
models
├─ core.py                      # dual-encoder framework
├─ audio_encoders.py            # audio encoders
└─ text_encoders.py             # text encoders

utils
├─ criterion_utils.py           # loss functions
├─ data_utils.py                # Pytorch dataset classes
└─ model_utils.py               # model.train(), model.eval(), etc.

conf.yaml                       # experimental settings
main.py                         # main()
  1. Calculate retrieval metrics:
postprocessing
├─ xmodal_scores.py             # calculate audio-text scores
└─ xmodal_retrieval.py          # calculate mAP, R@1, R@5, R@10, etc.

Examples

  1. Code example for using the pretrained audio encoder:
example
├─ audio_encoder.py             # code example for audio encoder
├─ example.wav                  # audio segment example
└─ audio_encoder.pth            # audio encoder checkpoint (https://doi.org/10.5281/zenodo.7752975)

About

Baseline system for Language-based Audio Retrieval (Task 6B) in DCASE 2023 Challenge

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages