Skip to content

Commit

Permalink
Show image that introduces concept of splits/configs (#1215)
Browse files Browse the repository at this point in the history
* fix broken links

* remove redundant h2 title

* add splits gif

* add pros of Parquet format

* Update docs/hub/datasets-data-files-configuration.md

Co-authored-by: Steven Liu <[email protected]>

* Update docs/hub/datasets-data-files-configuration.md

Co-authored-by: Quentin Lhoest <[email protected]>

* reduce the length of the Tip

* add parquet-converter profile image

---------

Co-authored-by: Steven Liu <[email protected]>
Co-authored-by: Quentin Lhoest <[email protected]>
  • Loading branch information
3 people authored Feb 12, 2024
1 parent 141b461 commit af6134e
Show file tree
Hide file tree
Showing 5 changed files with 25 additions and 4 deletions.
2 changes: 1 addition & 1 deletion docs/hub/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -258,7 +258,7 @@ This is equivalent to `huggingface_hub.whoami()`.

Use Collections to group repositories from the Hub (Models, Datasets, Spaces and Papers) on a dedicated page.

You can learn more about it in the Collections [guide](./collections.md). Collections can also be managed using the Python client (see [guide](https://huggingface.co/docs/huggingface_hub/main/en/guides/collections)).
You can learn more about it in the Collections [guide](./collections). Collections can also be managed using the Python client (see [guide](https://huggingface.co/docs/huggingface_hub/main/en/guides/collections)).

### POST /api/collections

Expand Down
2 changes: 1 addition & 1 deletion docs/hub/datasets-cards.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

Each dataset may be documented by the `README.md` file in the repository. This file is called a **dataset card**, and the Hugging Face Hub will render its contents on the dataset's main page. To inform users about how to responsibly use the data, it's a good idea to include information about any potential biases within the dataset. Generally, dataset cards help users understand the contents of the dataset and give context for how the dataset should be used.

You can also add dataset metadata to your card. The metadata describes important information about a dataset such as its license, language, and size. It also contains tags to help users discover a dataset on the Hub, and [data files configuration](./datasets-manual-configuration.md) options. Tags are defined in a YAML metadata section at the top of the `README.md` file.
You can also add dataset metadata to your card. The metadata describes important information about a dataset such as its license, language, and size. It also contains tags to help users discover a dataset on the Hub, and [data files configuration](./datasets-manual-configuration) options. Tags are defined in a YAML metadata section at the top of the `README.md` file.

## Dataset card metadata

Expand Down
6 changes: 6 additions & 0 deletions docs/hub/datasets-data-files-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,12 @@ There are no constraints on how to structure dataset repositories.
However, if you want the Dataset Viewer to show certain data files, or to separate your dataset in train/validation/test splits, you need to structure your dataset accordingly.
Often it is as simple as naming your data files according to their split names, e.g. `train.csv` and `test.csv`.

## What are splits and configurations?

Machine learning datasets typically have splits and may also have configurations. A dataset is generally made of _splits_ (e.g. `train` and `test`) that are used during different stages of training and evaluating a model. A _configuration_ is a sub-dataset contained within a larger dataset. Configurations are especially common in multilingual speech datasets where there may be a different configuration for each language. If you're interested in learning more about splits and configurations, check out the [Splits and configurations](https://huggingface.co/docs/datasets-server/configs_and_splits) guide!

![split-configs-server](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/split-configs-server.gif)

## File names and splits

To structure your dataset by naming your data files or directories according to their split names, see the [File names and splits](./datasets-file-names-and-splits) documentation.
Expand Down
2 changes: 0 additions & 2 deletions docs/hub/datasets-manual-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,6 @@ A dataset with a supported structure and [file formats](./datasets-adding#file-f

It is also possible to define multiple configurations for the same dataset (e.g. if the dataset has various independent files).

## Define your splits and subsets in YAML

## Splits

If you have multiple files and want to define which file goes into which split, you can use YAML at the top of your README.md.
Expand Down
17 changes: 17 additions & 0 deletions docs/hub/datasets-viewer.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,23 @@ In this case, an informational message lets you know that the Viewer is partial.

To power the dataset viewer, the first 5GB of every dataset are auto-converted to the Parquet format (unless it was already a Parquet dataset). In the dataset viewer (for example, see [`datasets/glue`](https://huggingface.co/datasets/glue)), you can click on [_"Auto-converted to Parquet"_](https://huggingface.co/datasets/glue/tree/refs%2Fconvert%2Fparquet/cola) to access the Parquet files. Please, refer to the [Datasets Server docs](/docs/datasets-server/parquet_process) to learn how to query the dataset parquet files with libraries such as Polars, Pandas or DuckDB.

<Tip>

Parquet is a columnar storage format optimized for querying and processing large datasets. Parquet is a popular choice for big data processing and analytics and is widely used for data processing and machine learning. You can learn more about the advantages associated with this format in the <a href="https://huggingface.co/docs/datasets-server/parquet">documentation</a>.

</Tip>

### Conversion bot

When you create a new dataset, the [`parquet-converter` bot](https://huggingface.co/parquet-converter) notifies you once it converts the dataset to Parquet. The [discussion](./repositories-pull-requests-discussions) it opens in the repository provides details about the Parquet format and links to the Parquet files.

<div class="flex justify-center">
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/parquet-converter-profile-light.png"/>
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/parquet-converter-profile-dark.png"/>
</div>

### Programmatic access

You can also access the list of Parquet files programmatically using the [Hub API](./api#get-apidatasetsrepoidparquet); for example, endpoint [`https://huggingface.co/api/datasets/glue/parquet`](https://huggingface.co/api/datasets/glue/parquet) lists the parquet files of the glue dataset.

## Dataset preview
Expand Down

0 comments on commit af6134e

Please sign in to comment.