update documentation for image datasets (#1148)

* update documentation for image datasets * Update docs/hub/datasets-image.md Co-authored-by: Pedro Cuenca <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Pedro Cuenca <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Pedro Cuenca <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Pedro Cuenca <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Pedro Cuenca <[email protected]> * remove the mention to the dataset scripts * link to the datasets doc for the scripts * grammarly * Update docs/hub/datasets-image.md Co-authored-by: Mario Šaško <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Mario Šaško <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Mario Šaško <[email protected]> * use the tree cli to improve the clarity * add a sequence about raw bytes in parquet * don't tell that the image files must be in the same directory * corrections by @polinaeterna * Update docs/hub/datasets-image.md Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Polina Kazakova <[email protected]> * minor fix + rewrite about metadata location and file_name * fix tree structure * fix (hopefully the last one!) --------- Co-authored-by: Pedro Cuenca <[email protected]> Co-authored-by: Mario Šaško <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]> Co-authored-by: Polina Kazakova <[email protected]>
huggingface · Dec 4, 2023 · 19fc552 · 19fc552
1 parent 680d07e
commit 19fc552
Show file tree

Hide file tree

Showing 5 changed files with 169 additions and 3 deletions.
diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml
@@ -164,6 +164,8 @@
         title: File names and splits
       - local: datasets-manual-configuration
         title: Manual Configuration
+      - local: datasets-image
+        title: Image Dataset
 - local: spaces
   title: Spaces
   isExpanded: true

diff --git a/docs/hub/datasets-data-files-configuration.md b/docs/hub/datasets-data-files-configuration.md
@@ -25,5 +25,5 @@ And if your images/audio files have metadata (e.g. captions, bounding boxes, tra
 
 We provide two guides that you can check out:
 
-- [How to create an image dataset](https://huggingface.co/docs/datasets/image_dataset)
+- [How to create an image dataset](./datasets-image)
 - [How to create an audio dataset](https://huggingface.co/docs/datasets/audio_dataset)
diff --git a/docs/hub/datasets-download-stats.md b/docs/hub/datasets-download-stats.md
@@ -4,5 +4,5 @@
 
 The Hub provides download stats for all datasets loadable via the `datasets` library. To determine the number of downloads, the Hub counts every time `load_dataset` is called in Python, excluding Hugging Face's CI tooling on GitHub. No information is sent from the user, and no additional calls are made for this. The count is done server-side as we serve files for downloads. This means that:
 
-* The download count is the same regardless of whether the data is directly stored on the Hub repo or if the repository has a script to load the data from an external source.
+* The download count is the same regardless of whether the data is directly stored on the Hub repo or if the repository has a [script](https://huggingface.co/docs/datasets/dataset_script) to load the data from an external source.
 * If a user manually downloads the data using tools like `wget` or the Hub's user interface (UI), those downloads will not be included in the download count.
diff --git a/docs/hub/datasets-image.md b/docs/hub/datasets-image.md
@@ -0,0 +1,164 @@
+# Image Dataset
+
+This guide will show you how to configure your dataset repository with image files. You can find accompanying examples of repositories in this [Image datasets examples collection](https://huggingface.co/collections/datasets-examples/image-dataset-6568e7cf28639db76eb92d65).
+
+A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a Dataset Viewer on its page on the Hub. Any additional information about your dataset - such as captions or bounding boxes for object detection - is automatically loaded as long as you include this information in a metadata file (`metadata.csv`/`metadata.jsonl`).
+
+## Only images
+
+If your dataset only consists of one column with images, you can simply store your image files at the root:
+
+```
+my_dataset_repository/
+├── 1.jpg
+├── 2.jpg
+├── 3.jpg
+└── 4.jpg
+```
+
+or in a subdirectory:
+
+```
+my_dataset_repository/
+└── images
+    ├── 1.jpg
+    ├── 2.jpg
+    ├── 3.jpg
+    └── 4.jpg
+```
+
+Multiple [formats](./datasets-adding#file-formats) are supported at the same time, including PNG, JPEG, TIFF and WebP.
+
+```
+my_dataset_repository/
+└── images
+    ├── 1.jpg
+    ├── 2.png
+    ├── 3.tiff
+    └── 4.webp
+```
+
+If you have several splits, you can put your images into directories named accordingly: 
+
+```
+my_dataset_repository/
+├── train
+│   ├── 1.jpg
+│   └── 2.jpg
+└── test
+    ├── 3.jpg
+    └── 4.jpg
+```
+
+See [File names and splits](./datasets-file-names-and-splits) for more information and other ways to organize data by splits.
+
+## Additional columns
+
+If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your repository. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection.
+
+```
+my_dataset_repository/
+└── train
+    ├── 1.jpg
+    ├── 2.jpg
+    ├── 3.jpg
+    ├── 4.jpg
+    └── metadata.csv
+```
+
+Your `metadata.csv` file must have a `file_name` column which links image files with their metadata:
+
+```csv
+file_name,text
+1.jpg,a drawing of a green pokemon with red eyes
+2.jpg,a green and yellow toy with a red nose
+3.jpg,a red and white ball with an angry look on its face
+4.jpg,a cartoon ball with a smile on it's face
+```
+
+You can also use a [JSONL](https://jsonlines.org/) file `metadata.jsonl`:
+
+```jsonl
+{"file_name": "1.jpg","text": "a drawing of a green pokemon with red eyes"}
+{"file_name": "2.jpg","text": "a green and yellow toy with a red nose"}
+{"file_name": "3.jpg","text": "a red and white ball with an angry look on its face"}
+{"file_name": "4.jpg","text": "a cartoon ball with a smile on it's face"}
+```
+
+## Relative paths
+
+Metadata file must be located either in the same directory with the images it is linked to, or in any parent directory, like in this example: 
+
+```
+my_dataset_repository/
+└── train
+    ├── images
+    │   ├── 1.jpg
+    │   ├── 2.jpg
+    │   ├── 3.jpg
+    │   └── 4.jpg
+    └── metadata.csv
+```
+
+In this case, the `file_name` column must be a full relative path to the images, not just the filename:
+
+```csv
+file_name,text
+images/1.jpg,a drawing of a green pokemon with red eyes
+images/2.jpg,a green and yellow toy with a red nose
+images/3.jpg,a red and white ball with an angry look on its face
+images/4.jpg,a cartoon ball with a smile on it's face
+```
+
+Metadata file cannot be put in subdirectories of a directory with the images.
+
+## Image classification
+
+For image classification datasets, you can also use a simple setup: use directories to name the image classes. Store your image files in a directory structure like:
+
+```
+my_dataset_repository/
+├── green
+│   ├── 1.jpg
+│   └── 2.jpg
+└── red
+    ├── 3.jpg
+    └── 4.jpg
+```
+
+The dataset created with this structure contains two columns: `image` and `label` (with values `green` and `red`).
+
+You can also provide multiple splits. To do so, your dataset directory should have the following structure (see [File names and splits](./datasets-file-names-and-splits) for more information):
+
+```
+my_dataset_repository/
+├── test
+│   ├── green
+│   │   └── 2.jpg
+│   └── red
+│       └── 4.jpg
+└── train
+    ├── green
+    │   └── 1.jpg
+    └── red
+        └── 3.jpg
+```
+
+You can disable this automatic addition of the `label` column in the [YAML configuration](./datasets-manual-configuration). If your directory names have no special meaning, set `drop_labels: true` in the README header:
+
+```yaml
+configs:
+  - config_name: default
+    drop_labels: true
+```
+
+## Parquet format
+
+Instead of uploading the images and metadata as individual files, you can embed everything inside a [Parquet](https://parquet.apache.org/) file. This is useful if you have a large number of images, if you want to embed multiple image columns, or if you want to store additional information about the images in the same file. Parquet is also useful for storing data such as raw bytes, which is not supported by JSON/CSV.
+
+```
+my_dataset_repository/
+└── train.parquet
+```
+
+Note that for the user convenience, every dataset hosted in the Hub is automatically converted to Parquet format. Read more about it in the [Parquet format](./datasets-viewer#access-the-parquet-files) documentation.
diff --git a/docs/hub/datasets-overview.md b/docs/hub/datasets-overview.md
@@ -4,7 +4,7 @@
 
 The Hugging Face Hub hosts a [large number of community-curated datasets](https://huggingface.co/datasets) for a diverse range of tasks such as translation, automatic speech recognition, and image classification. Alongside the information contained in the [dataset card](./datasets-cards), many datasets, such as [GLUE](https://huggingface.co/datasets/glue), include a [Dataset Viewer](./datasets-viewer) to showcase the data.
 
-Each dataset is a [Git repository](./repositories), equipped with the necessary scripts to download the data and generate splits for training, evaluation, and testing. For information on how a dataset repository is structured, refer to the [Data files Configuration page](./datasets-data-files-configuration). Following the supported repo structure will ensure that the dataset page on the Hub will have a Viewer.
+Each dataset is a [Git repository](./repositories) that contains the data required to generate splits for training, evaluation, and testing. For information on how a dataset repository is structured, refer to the [Data files Configuration page](./datasets-data-files-configuration). Following the supported repo structure will ensure that the dataset page on the Hub will have a Viewer.
 
 ## Search for datasets