diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml index 60223f4da..a1cb2d701 100644 --- a/docs/hub/_toctree.yml +++ b/docs/hub/_toctree.yml @@ -215,6 +215,8 @@ title: File names and splits - local: datasets-manual-configuration title: Manual Configuration + - local: datasets-audio + title: Audio Dataset - local: datasets-image title: Image Dataset - local: spaces diff --git a/docs/hub/datasets-audio.md b/docs/hub/datasets-audio.md new file mode 100644 index 000000000..80c519629 --- /dev/null +++ b/docs/hub/datasets-audio.md @@ -0,0 +1,218 @@ +# Audio Dataset + +This guide will show you how to configure your dataset repository with audio files. You can find accompanying examples of repositories in this [Audio datasets examples collection](https://huggingface.co/collections/datasets-examples/audio-dataset-66aca0b73e8f69e3d069e607). + +A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a Dataset Viewer on its page on the Hub. + +--- + +Additional information about your audio files - such as transcriptions - is automatically loaded as long as you include this information in a metadata file (`metadata.csv`/`metadata.jsonl`). + +Alternatively, audio files can be in Parquet files or in TAR archives following the [WebDataset](https://github.com/webdataset/webdataset) format. + +## Only audio files + +If your dataset only consists of one column with audio, you can simply store your audio files at the root: + +```plaintext +my_dataset_repository/ +├── 1.wav +├── 2.wav +├── 3.wav +└── 4.wav +``` + +or in a subdirectory: + +```plaintext +my_dataset_repository/ +└── audio + ├── 1.wav + ├── 2.wav + ├── 3.wav + └── 4.wav +``` + +Multiple [formats](./datasets-adding#file-formats) are supported at the same time, including AIFF, FLAC, MP3, OGG and WAV. + +```plaintext +my_dataset_repository/ +└── audio + ├── 1.aiff + ├── 2.ogg + ├── 3.mp3 + └── 4.flac +``` + +If you have several splits, you can put your audio files into directories named accordingly: + +```plaintext +my_dataset_repository/ +├── train +│   ├── 1.wav +│   └── 2.wav +└── test + ├── 3.wav + └── 4.wav +``` + +See [File names and splits](./datasets-file-names-and-splits) for more information and other ways to organize data by splits. + +## Additional columns + +If there is additional information you'd like to include about your dataset, like the transcription, add it as a `metadata.csv` file in your repository. This lets you quickly create datasets for different audio tasks like [text-to-speech](https://huggingface.co/tasks/text-to-speech) or [automatic speech recognition](https://huggingface.co/tasks/automatic-speech-recognition). + +```plaintext +my_dataset_repository/ +├── 1.wav +├── 2.wav +├── 3.wav +├── 4.wav +└── metadata.csv +``` + +Your `metadata.csv` file must have a `file_name` column which links image files with their metadata: + +```csv +file_name,animal +1.wav,cat +2.wav,cat +3.wav,dog +4.wav,dog +``` + +You can also use a [JSONL](https://jsonlines.org/) file `metadata.jsonl`: + +```jsonl +{"file_name": "1.wav","text": "cat"} +{"file_name": "2.wav","text": "cat"} +{"file_name": "3.wav","text": "dog"} +{"file_name": "4.wav","text": "dog"} +``` + +## Relative paths + +Metadata file must be located either in the same directory with the audio files it is linked to, or in any parent directory, like in this example: + +```plaintext +my_dataset_repository/ +└── test + ├── audio + │   ├── 1.wav + │   ├── 2.wav + │   ├── 3.wav + │   └── 4.wav + └── metadata.csv +``` + +In this case, the `file_name` column must be a full relative path to the audio files, not just the filename: + +```csv +file_name,animal +audio/1.wav,cat +audio/2.wav,cat +audio/3.wav,dog +audio/4.wav,dog +``` + +Metadata file cannot be put in subdirectories of a directory with the audio files. + +In this example, the `test` directory is used to setup the name of the training split. See [File names and splits](./datasets-file-names-and-splits) for more information. + +## Audio classification + +For audio classification datasets, you can also use a simple setup: use directories to name the audio classes. Store your audio files in a directory structure like: + +```plaintext +my_dataset_repository/ +├── cat +│   ├── 1.wav +│   └── 2.wav +└── dog + ├── 3.wav + └── 4.wav +``` + +The dataset created with this structure contains two columns: `audio` and `label` (with values `cat` and `dog`). + +You can also provide multiple splits. To do so, your dataset directory should have the following structure (see [File names and splits](./datasets-file-names-and-splits) for more information): + +```plaintext +my_dataset_repository/ +├── test +│   ├── cat +│   │   └── 2.wav +│   └── dog +│   └── 4.wav +└── train + ├── cat + │   └── 1.wav + └── dog + └── 3.wav +``` + +You can disable this automatic addition of the `label` column in the [YAML configuration](./datasets-manual-configuration). If your directory names have no special meaning, set `drop_labels: true` in the README header: + +```yaml +configs: + - config_name: default # Name of the dataset subset, if applicable. + drop_labels: true +``` + +## Large scale datasets + +### WebDataset format + +The [WebDataset](./datasets-webdataset) format is well suited for large scale audio datasets (see [AlienKevin/sbs_cantonese](https://huggingface.co/datasets/AlienKevin/sbs_cantonese) for example). +It consists of TAR archives containing audio files and their metadata and is optimized for streaming. It is useful if you have a large number of audio files and to get streaming data loaders for large scale training. + +```plaintext +my_dataset_repository/ +├── train-0000.tar +├── train-0001.tar +├── ... +└── train-1023.tar +``` + +To make a WebDataset TAR archive, create a directory containing the audio files and metadata files to be archived and create the TAR archive using e.g. the `tar` command. +The usual size per archive is generally around 1GB. +Make sure each audio file and metadata pair share the same file prefix, for example: + +```plaintext +train-0000/ +├── 000.flac +├── 000.json +├── 001.flac +├── 001.json +├── ... +├── 999.flac +└── 999.json +``` + +Note that for user convenience and to enable the [Dataset Viewer](./datasets-viewer), every dataset hosted in the Hub is automatically converted to Parquet format up to 5GB. +Read more about it in the [Parquet format](./datasets-viewer#access-the-parquet-files) documentation. + +### Parquet format + +Instead of uploading the audio files and metadata as individual files, you can embed everything inside a [Parquet](https://parquet.apache.org/) file. +This is useful if you have a large number of audio files, if you want to embed multiple audio columns, or if you want to store additional information about the audio in the same file. +Parquet is also useful for storing data such as raw bytes, which is not supported by JSON/CSV. + +```plaintext +my_dataset_repository/ +└── train.parquet +``` + +Audio columns are of type _struct_, with a binary field `"bytes"` for the audio data and a string field `"path"` for the image file name or path. +You should specify the feature types of the columns directly in YAML in the README header, for example: + +```yaml +dataset_info: + features: + - name: audio + dtype: audio + - name: caption + dtype: string +``` + +Alternatively, Parquet files with Audio data can be created using the `datasets` library by setting the column type to `Audio()` and using the `.to_parquet(...)` method or `.push_to_hub(...)`. You can find a guide on loading audio datasets in `datasets` [here](../datasets/audio_load). diff --git a/docs/hub/datasets-data-files-configuration.md b/docs/hub/datasets-data-files-configuration.md index 9a3faaa00..a60009ace 100644 --- a/docs/hub/datasets-data-files-configuration.md +++ b/docs/hub/datasets-data-files-configuration.md @@ -36,4 +36,4 @@ And if your images/audio files have metadata (e.g. captions, bounding boxes, tra We provide two guides that you can check out: - [How to create an image dataset](./datasets-image) ([example datasets](https://huggingface.co/collections/datasets-examples/image-dataset-6568e7cf28639db76eb92d65)) -- [How to create an audio dataset](/docs/datasets/audio_dataset) +- [How to create an audio dataset](./datasets-audio) ([example datasets](https://huggingface.co/collections/datasets-examples/audio-dataset-66aca0b73e8f69e3d069e607)) diff --git a/docs/hub/datasets-image.md b/docs/hub/datasets-image.md index b6d222880..be2874439 100644 --- a/docs/hub/datasets-image.md +++ b/docs/hub/datasets-image.md @@ -59,7 +59,7 @@ See [File names and splits](./datasets-file-names-and-splits) for more informati ## Additional columns -If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your repository. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection. +If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your repository. This lets you quickly create datasets for different computer vision tasks like [text captioning](https://huggingface.co/tasks/image-to-text) or [object detection](https://huggingface.co/tasks/object-detection). ``` my_dataset_repository/