Skip to content

Commit

Permalink
update documentation for image datasets (#1148)
Browse files Browse the repository at this point in the history
* update documentation for image datasets

* Update docs/hub/datasets-image.md

Co-authored-by: Pedro Cuenca <[email protected]>

* Update docs/hub/datasets-image.md

Co-authored-by: Pedro Cuenca <[email protected]>

* Update docs/hub/datasets-image.md

Co-authored-by: Pedro Cuenca <[email protected]>

* Update docs/hub/datasets-image.md

Co-authored-by: Pedro Cuenca <[email protected]>

* Update docs/hub/datasets-image.md

Co-authored-by: Pedro Cuenca <[email protected]>

* remove the mention to the dataset scripts

* link to the datasets doc for the scripts

* grammarly

* Update docs/hub/datasets-image.md

Co-authored-by: Mario Šaško <[email protected]>

* Update docs/hub/datasets-image.md

Co-authored-by: Mario Šaško <[email protected]>

* Update docs/hub/datasets-image.md

Co-authored-by: Mario Šaško <[email protected]>

* use the tree cli to improve the clarity

* add a sequence about raw bytes in parquet

* don't tell that the image files must be in the same directory

* corrections by @polinaeterna

* Update docs/hub/datasets-image.md

Co-authored-by: Quentin Lhoest <[email protected]>

* Update docs/hub/datasets-image.md

Co-authored-by: Quentin Lhoest <[email protected]>

* Update docs/hub/datasets-image.md

Co-authored-by: Quentin Lhoest <[email protected]>

* Update docs/hub/datasets-image.md

Co-authored-by: Quentin Lhoest <[email protected]>

* Update docs/hub/datasets-image.md

Co-authored-by: Quentin Lhoest <[email protected]>

* Update docs/hub/datasets-image.md

Co-authored-by: Quentin Lhoest <[email protected]>

* Update docs/hub/datasets-image.md

Co-authored-by: Quentin Lhoest <[email protected]>

* Update docs/hub/datasets-image.md

Co-authored-by: Quentin Lhoest <[email protected]>

* Update docs/hub/datasets-image.md

Co-authored-by: Polina Kazakova <[email protected]>

* minor fix + rewrite about metadata location and file_name

* fix tree structure

* fix (hopefully the last one!)

---------

Co-authored-by: Pedro Cuenca <[email protected]>
Co-authored-by: Mario Šaško <[email protected]>
Co-authored-by: Quentin Lhoest <[email protected]>
Co-authored-by: Polina Kazakova <[email protected]>
  • Loading branch information
5 people authored Dec 4, 2023
1 parent 680d07e commit 19fc552
Show file tree
Hide file tree
Showing 5 changed files with 169 additions and 3 deletions.
2 changes: 2 additions & 0 deletions docs/hub/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,8 @@
title: File names and splits
- local: datasets-manual-configuration
title: Manual Configuration
- local: datasets-image
title: Image Dataset
- local: spaces
title: Spaces
isExpanded: true
Expand Down
2 changes: 1 addition & 1 deletion docs/hub/datasets-data-files-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,5 +25,5 @@ And if your images/audio files have metadata (e.g. captions, bounding boxes, tra

We provide two guides that you can check out:

- [How to create an image dataset](https://huggingface.co/docs/datasets/image_dataset)
- [How to create an image dataset](./datasets-image)
- [How to create an audio dataset](https://huggingface.co/docs/datasets/audio_dataset)
2 changes: 1 addition & 1 deletion docs/hub/datasets-download-stats.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@

The Hub provides download stats for all datasets loadable via the `datasets` library. To determine the number of downloads, the Hub counts every time `load_dataset` is called in Python, excluding Hugging Face's CI tooling on GitHub. No information is sent from the user, and no additional calls are made for this. The count is done server-side as we serve files for downloads. This means that:

* The download count is the same regardless of whether the data is directly stored on the Hub repo or if the repository has a script to load the data from an external source.
* The download count is the same regardless of whether the data is directly stored on the Hub repo or if the repository has a [script](https://huggingface.co/docs/datasets/dataset_script) to load the data from an external source.
* If a user manually downloads the data using tools like `wget` or the Hub's user interface (UI), those downloads will not be included in the download count.
164 changes: 164 additions & 0 deletions docs/hub/datasets-image.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# Image Dataset

This guide will show you how to configure your dataset repository with image files. You can find accompanying examples of repositories in this [Image datasets examples collection](https://huggingface.co/collections/datasets-examples/image-dataset-6568e7cf28639db76eb92d65).

A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a Dataset Viewer on its page on the Hub. Any additional information about your dataset - such as captions or bounding boxes for object detection - is automatically loaded as long as you include this information in a metadata file (`metadata.csv`/`metadata.jsonl`).

## Only images

If your dataset only consists of one column with images, you can simply store your image files at the root:

```
my_dataset_repository/
├── 1.jpg
├── 2.jpg
├── 3.jpg
└── 4.jpg
```

or in a subdirectory:

```
my_dataset_repository/
└── images
├── 1.jpg
├── 2.jpg
├── 3.jpg
└── 4.jpg
```

Multiple [formats](./datasets-adding#file-formats) are supported at the same time, including PNG, JPEG, TIFF and WebP.

```
my_dataset_repository/
└── images
├── 1.jpg
├── 2.png
├── 3.tiff
└── 4.webp
```

If you have several splits, you can put your images into directories named accordingly:

```
my_dataset_repository/
├── train
│   ├── 1.jpg
│   └── 2.jpg
└── test
├── 3.jpg
└── 4.jpg
```

See [File names and splits](./datasets-file-names-and-splits) for more information and other ways to organize data by splits.

## Additional columns

If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your repository. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection.

```
my_dataset_repository/
└── train
├── 1.jpg
├── 2.jpg
├── 3.jpg
├── 4.jpg
└── metadata.csv
```

Your `metadata.csv` file must have a `file_name` column which links image files with their metadata:

```csv
file_name,text
1.jpg,a drawing of a green pokemon with red eyes
2.jpg,a green and yellow toy with a red nose
3.jpg,a red and white ball with an angry look on its face
4.jpg,a cartoon ball with a smile on it's face
```

You can also use a [JSONL](https://jsonlines.org/) file `metadata.jsonl`:

```jsonl
{"file_name": "1.jpg","text": "a drawing of a green pokemon with red eyes"}
{"file_name": "2.jpg","text": "a green and yellow toy with a red nose"}
{"file_name": "3.jpg","text": "a red and white ball with an angry look on its face"}
{"file_name": "4.jpg","text": "a cartoon ball with a smile on it's face"}
```

## Relative paths

Metadata file must be located either in the same directory with the images it is linked to, or in any parent directory, like in this example:

```
my_dataset_repository/
└── train
├── images
│   ├── 1.jpg
│   ├── 2.jpg
│   ├── 3.jpg
│   └── 4.jpg
└── metadata.csv
```

In this case, the `file_name` column must be a full relative path to the images, not just the filename:

```csv
file_name,text
images/1.jpg,a drawing of a green pokemon with red eyes
images/2.jpg,a green and yellow toy with a red nose
images/3.jpg,a red and white ball with an angry look on its face
images/4.jpg,a cartoon ball with a smile on it's face
```

Metadata file cannot be put in subdirectories of a directory with the images.

## Image classification

For image classification datasets, you can also use a simple setup: use directories to name the image classes. Store your image files in a directory structure like:

```
my_dataset_repository/
├── green
│   ├── 1.jpg
│   └── 2.jpg
└── red
├── 3.jpg
└── 4.jpg
```

The dataset created with this structure contains two columns: `image` and `label` (with values `green` and `red`).

You can also provide multiple splits. To do so, your dataset directory should have the following structure (see [File names and splits](./datasets-file-names-and-splits) for more information):

```
my_dataset_repository/
├── test
│   ├── green
│   │   └── 2.jpg
│   └── red
│   └── 4.jpg
└── train
├── green
│   └── 1.jpg
└── red
└── 3.jpg
```

You can disable this automatic addition of the `label` column in the [YAML configuration](./datasets-manual-configuration). If your directory names have no special meaning, set `drop_labels: true` in the README header:

```yaml
configs:
- config_name: default
drop_labels: true
```
## Parquet format
Instead of uploading the images and metadata as individual files, you can embed everything inside a [Parquet](https://parquet.apache.org/) file. This is useful if you have a large number of images, if you want to embed multiple image columns, or if you want to store additional information about the images in the same file. Parquet is also useful for storing data such as raw bytes, which is not supported by JSON/CSV.
```
my_dataset_repository/
└── train.parquet
```

Note that for the user convenience, every dataset hosted in the Hub is automatically converted to Parquet format. Read more about it in the [Parquet format](./datasets-viewer#access-the-parquet-files) documentation.
2 changes: 1 addition & 1 deletion docs/hub/datasets-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

The Hugging Face Hub hosts a [large number of community-curated datasets](https://huggingface.co/datasets) for a diverse range of tasks such as translation, automatic speech recognition, and image classification. Alongside the information contained in the [dataset card](./datasets-cards), many datasets, such as [GLUE](https://huggingface.co/datasets/glue), include a [Dataset Viewer](./datasets-viewer) to showcase the data.

Each dataset is a [Git repository](./repositories), equipped with the necessary scripts to download the data and generate splits for training, evaluation, and testing. For information on how a dataset repository is structured, refer to the [Data files Configuration page](./datasets-data-files-configuration). Following the supported repo structure will ensure that the dataset page on the Hub will have a Viewer.
Each dataset is a [Git repository](./repositories) that contains the data required to generate splits for training, evaluation, and testing. For information on how a dataset repository is structured, refer to the [Data files Configuration page](./datasets-data-files-configuration). Following the supported repo structure will ensure that the dataset page on the Hub will have a Viewer.

## Search for datasets

Expand Down

0 comments on commit 19fc552

Please sign in to comment.