-
Notifications
You must be signed in to change notification settings - Fork 265
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
update documentation for image datasets (#1148)
* update documentation for image datasets * Update docs/hub/datasets-image.md Co-authored-by: Pedro Cuenca <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Pedro Cuenca <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Pedro Cuenca <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Pedro Cuenca <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Pedro Cuenca <[email protected]> * remove the mention to the dataset scripts * link to the datasets doc for the scripts * grammarly * Update docs/hub/datasets-image.md Co-authored-by: Mario Šaško <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Mario Šaško <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Mario Šaško <[email protected]> * use the tree cli to improve the clarity * add a sequence about raw bytes in parquet * don't tell that the image files must be in the same directory * corrections by @polinaeterna * Update docs/hub/datasets-image.md Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Quentin Lhoest <[email protected]> * Update docs/hub/datasets-image.md Co-authored-by: Polina Kazakova <[email protected]> * minor fix + rewrite about metadata location and file_name * fix tree structure * fix (hopefully the last one!) --------- Co-authored-by: Pedro Cuenca <[email protected]> Co-authored-by: Mario Šaško <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]> Co-authored-by: Polina Kazakova <[email protected]>
- Loading branch information
1 parent
680d07e
commit 19fc552
Showing
5 changed files
with
169 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,164 @@ | ||
# Image Dataset | ||
|
||
This guide will show you how to configure your dataset repository with image files. You can find accompanying examples of repositories in this [Image datasets examples collection](https://huggingface.co/collections/datasets-examples/image-dataset-6568e7cf28639db76eb92d65). | ||
|
||
A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a Dataset Viewer on its page on the Hub. Any additional information about your dataset - such as captions or bounding boxes for object detection - is automatically loaded as long as you include this information in a metadata file (`metadata.csv`/`metadata.jsonl`). | ||
|
||
## Only images | ||
|
||
If your dataset only consists of one column with images, you can simply store your image files at the root: | ||
|
||
``` | ||
my_dataset_repository/ | ||
├── 1.jpg | ||
├── 2.jpg | ||
├── 3.jpg | ||
└── 4.jpg | ||
``` | ||
|
||
or in a subdirectory: | ||
|
||
``` | ||
my_dataset_repository/ | ||
└── images | ||
├── 1.jpg | ||
├── 2.jpg | ||
├── 3.jpg | ||
└── 4.jpg | ||
``` | ||
|
||
Multiple [formats](./datasets-adding#file-formats) are supported at the same time, including PNG, JPEG, TIFF and WebP. | ||
|
||
``` | ||
my_dataset_repository/ | ||
└── images | ||
├── 1.jpg | ||
├── 2.png | ||
├── 3.tiff | ||
└── 4.webp | ||
``` | ||
|
||
If you have several splits, you can put your images into directories named accordingly: | ||
|
||
``` | ||
my_dataset_repository/ | ||
├── train | ||
│ ├── 1.jpg | ||
│ └── 2.jpg | ||
└── test | ||
├── 3.jpg | ||
└── 4.jpg | ||
``` | ||
|
||
See [File names and splits](./datasets-file-names-and-splits) for more information and other ways to organize data by splits. | ||
|
||
## Additional columns | ||
|
||
If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your repository. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection. | ||
|
||
``` | ||
my_dataset_repository/ | ||
└── train | ||
├── 1.jpg | ||
├── 2.jpg | ||
├── 3.jpg | ||
├── 4.jpg | ||
└── metadata.csv | ||
``` | ||
|
||
Your `metadata.csv` file must have a `file_name` column which links image files with their metadata: | ||
|
||
```csv | ||
file_name,text | ||
1.jpg,a drawing of a green pokemon with red eyes | ||
2.jpg,a green and yellow toy with a red nose | ||
3.jpg,a red and white ball with an angry look on its face | ||
4.jpg,a cartoon ball with a smile on it's face | ||
``` | ||
|
||
You can also use a [JSONL](https://jsonlines.org/) file `metadata.jsonl`: | ||
|
||
```jsonl | ||
{"file_name": "1.jpg","text": "a drawing of a green pokemon with red eyes"} | ||
{"file_name": "2.jpg","text": "a green and yellow toy with a red nose"} | ||
{"file_name": "3.jpg","text": "a red and white ball with an angry look on its face"} | ||
{"file_name": "4.jpg","text": "a cartoon ball with a smile on it's face"} | ||
``` | ||
|
||
## Relative paths | ||
|
||
Metadata file must be located either in the same directory with the images it is linked to, or in any parent directory, like in this example: | ||
|
||
``` | ||
my_dataset_repository/ | ||
└── train | ||
├── images | ||
│ ├── 1.jpg | ||
│ ├── 2.jpg | ||
│ ├── 3.jpg | ||
│ └── 4.jpg | ||
└── metadata.csv | ||
``` | ||
|
||
In this case, the `file_name` column must be a full relative path to the images, not just the filename: | ||
|
||
```csv | ||
file_name,text | ||
images/1.jpg,a drawing of a green pokemon with red eyes | ||
images/2.jpg,a green and yellow toy with a red nose | ||
images/3.jpg,a red and white ball with an angry look on its face | ||
images/4.jpg,a cartoon ball with a smile on it's face | ||
``` | ||
|
||
Metadata file cannot be put in subdirectories of a directory with the images. | ||
|
||
## Image classification | ||
|
||
For image classification datasets, you can also use a simple setup: use directories to name the image classes. Store your image files in a directory structure like: | ||
|
||
``` | ||
my_dataset_repository/ | ||
├── green | ||
│ ├── 1.jpg | ||
│ └── 2.jpg | ||
└── red | ||
├── 3.jpg | ||
└── 4.jpg | ||
``` | ||
|
||
The dataset created with this structure contains two columns: `image` and `label` (with values `green` and `red`). | ||
|
||
You can also provide multiple splits. To do so, your dataset directory should have the following structure (see [File names and splits](./datasets-file-names-and-splits) for more information): | ||
|
||
``` | ||
my_dataset_repository/ | ||
├── test | ||
│ ├── green | ||
│ │ └── 2.jpg | ||
│ └── red | ||
│ └── 4.jpg | ||
└── train | ||
├── green | ||
│ └── 1.jpg | ||
└── red | ||
└── 3.jpg | ||
``` | ||
|
||
You can disable this automatic addition of the `label` column in the [YAML configuration](./datasets-manual-configuration). If your directory names have no special meaning, set `drop_labels: true` in the README header: | ||
|
||
```yaml | ||
configs: | ||
- config_name: default | ||
drop_labels: true | ||
``` | ||
## Parquet format | ||
Instead of uploading the images and metadata as individual files, you can embed everything inside a [Parquet](https://parquet.apache.org/) file. This is useful if you have a large number of images, if you want to embed multiple image columns, or if you want to store additional information about the images in the same file. Parquet is also useful for storing data such as raw bytes, which is not supported by JSON/CSV. | ||
``` | ||
my_dataset_repository/ | ||
└── train.parquet | ||
``` | ||
|
||
Note that for the user convenience, every dataset hosted in the Hub is automatically converted to Parquet format. Read more about it in the [Parquet format](./datasets-viewer#access-the-parquet-files) documentation. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters