- Digital Signal Processing Course Project (EE2015).
- Audio Classification using Mel Spectrograms and Convolution Neural Networks.
- Finished:
09/07/2023
-
The dataset includes audio files of 5 classes:
cailuong
,catru
,chauvan
,cheo
,hatxam
. -
Each class includes 500 wav files with a length of about 30s.
-
Vietnam Traditional Music (5 genres):
https://www.kaggle.com/datasets/homata123/vntm-for-building-model-5-genres. -
Download the dataset, create a folder named
rawdata
in the project's folder and configure the dataset as shown below.... ├── model_images ├── notebook ├── rawdata │ ├── cailuong | | ├── Cailuong000.wav | | ├── Cailuong001.wav | | ├── Cailuong002.wav | | ├── ... │ ├── catru | | ├── Catru000.wav | | ├── Catru001.wav | | ├── Catru002.wav | | ├── ... │ ├── chauvan | | ├── Chauvan000.wav | | ├── Chauvan001.wav | | ├── Chauvan002.wav | | ├── ... │ ├── cheo | | ├── Cheo000.wav | | ├── Cheo001.wav | | ├── Cheo002.wav | | ├── ... │ ├── hatxam | | ├── Hatxam000.wav | | ├── Hatxam001.wav | | ├── Hatxam002.wav | | ├── ... ├── test_audio ...
The project's workflow is illustrated in the figure below:
Audio feature extraction is a necessary step in audio signal processing, which is a subfield of signal processing. Different features capture different aspects of sound. Here are some signal domain features.
-
Time domain:
These are extracted from waveforms of the raw audio: Zero crossing rate, amplitude envelope, RMS energy ... -
Frequency domain:
Signals are generally converted from the time domain to the frequency domain using the Fourier Transform: Band energy ratio, spectral centroid, spectral flux ... -
Time-frequency representation:
The time-frequency representation is obtained by applying the Short-Time Fourier Transform (STFT) on the time domain waveform: Spectrogram, Mel-spectrogram, constant-Q transform...
In this repo, we extract Mel-spectrogram images from audios of the dataset and feed them to CNN model as an image classification task.
We propose 3 models using the extracted mel-spectrogram as input images. With each image, the output vector gives the probability of 5 class.
In the inference phase, we propose to use late fusion of probabilities, referred to as PROD fusion. Consider predicted probabilities of each model as
Finally, the predicted label
To run the code of this project, please follow these steps:
- Install required libraries, dependencies.
numpy
librosa
tensorflow
matplotlib
pydub
sklearn
seaborn
-
Note!
In order to avoid errors at local when using pydub.AudioSegment, it's better to downloadffmpeg
and add them to environment variables. Tutorial here: https://phoenixnap.com/kb/ffmpeg-windows -
Config your own parameters in
config.py
. Directory configs are available and compatible with the project's folder structure. Hence, it's not recommended to change them. -
Run
processing.py
. After running,mel-images
folder contains all the mel-spectrogram images extracted from 5 classes anddataset
folder contains train/val/test folder of images of 5 classes. Constructing the dataset is completed. -
At
build/train_model.py
, change the model_index to 1, 2, 3 at the last line to train model1, model2 or model3. Then, run this file. After running, the best model.h5
file will be saved atmodel
folder. Training is completed. -
Run Streamlit app at
app/app.py
, upload your new audios and get prediction. The audios uploaded on app will be saved ataudio_from_user
folder. Run app using this command:
streamlit run app/app.py
[1] Vietnam Traditional Music (5 genres), https://www.kaggle.com/datasets/homata123/vntm-for-building-model-5-genres.
[2] Librosa Library, https://librosa.org/doc/latest/index.html
[3] TensorFlow, https://www.tensorflow.org/
[4] CHU BA THANH, TRINH VAN LOAN, DAO THI LE THUY, AUTOMATIC IDENTIFICATION OF SOME VIETNAMESE FOLK SONGS CHEO AND QUANHO USING DEEP NEURAL NETWORKS, https://vjs.ac.vn/index.php/jcc/article/view/15961
[5] Valerio Velardo - The Sound of AI, https://www.youtube.com/@ValerioVelardoTheSoundofAI
[6] Dipti Joshi1, Jyoti Pareek, Pushkar Ambatkar, Comparative Study of Mfcc and Mel Spectrogram for Raga Classification Using CNN, https://indjst.org/articles/comparative-study-of-mfcc-and-mel-spectrogram-for-raga-classification-using-cnn
[7] Loris Nanni et al, Ensemble of convolutional neural networks to improve animal audio classification, https://asmp-eurasipjournals.springeropen.com/articles/10.1186/s13636-020-00175-3