Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show image that introduces concept of splits/configs #1215

Merged
merged 9 commits into from
Feb 12, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/hub/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -258,7 +258,7 @@ This is equivalent to `huggingface_hub.whoami()`.

Use Collections to group repositories from the Hub (Models, Datasets, Spaces and Papers) on a dedicated page.

You can learn more about it in the Collections [guide](./collections.md). Collections can also be managed using the Python client (see [guide](https://huggingface.co/docs/huggingface_hub/main/en/guides/collections)).
You can learn more about it in the Collections [guide](./collections). Collections can also be managed using the Python client (see [guide](https://huggingface.co/docs/huggingface_hub/main/en/guides/collections)).

### POST /api/collections

Expand Down
2 changes: 1 addition & 1 deletion docs/hub/datasets-cards.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

Each dataset may be documented by the `README.md` file in the repository. This file is called a **dataset card**, and the Hugging Face Hub will render its contents on the dataset's main page. To inform users about how to responsibly use the data, it's a good idea to include information about any potential biases within the dataset. Generally, dataset cards help users understand the contents of the dataset and give context for how the dataset should be used.

You can also add dataset metadata to your card. The metadata describes important information about a dataset such as its license, language, and size. It also contains tags to help users discover a dataset on the Hub, and [data files configuration](./datasets-manual-configuration.md) options. Tags are defined in a YAML metadata section at the top of the `README.md` file.
You can also add dataset metadata to your card. The metadata describes important information about a dataset such as its license, language, and size. It also contains tags to help users discover a dataset on the Hub, and [data files configuration](./datasets-manual-configuration) options. Tags are defined in a YAML metadata section at the top of the `README.md` file.

## Dataset card metadata

Expand Down
6 changes: 6 additions & 0 deletions docs/hub/datasets-data-files-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,12 @@ There are no constraints on how to structure dataset repositories.
However, if you want the Dataset Viewer to show certain data files, or to separate your dataset in train/validation/test splits, you need to structure your dataset accordingly.
Often it is as simple as naming your data files according to their split names, e.g. `train.csv` and `test.csv`.

## What are splits and configurations?

Machine learning datasets typically have splits and may also have configurations. A _split_ is a subset of the dataset, like `train` and `test`, that are used during different stages of training and evaluating a model. A _configuration_ is a sub-dataset contained within a larger dataset. Configurations are especially common in multilingual speech datasets where there may be a different configuration for each language. If you're interested in learning more about splits and configurations, check out the [Splits and configurations](https://huggingface.co/docs/datasets-server/configs_and_splits) guide!
severo marked this conversation as resolved.
Show resolved Hide resolved

![split-configs-server](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/split-configs-server.gif)

## File names and splits

To structure your dataset by naming your data files or directories according to their split names, see the [File names and splits](./datasets-file-names-and-splits) documentation.
Expand Down
2 changes: 0 additions & 2 deletions docs/hub/datasets-manual-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,6 @@ A dataset with a supported structure and [file formats](./datasets-adding#file-f

It is also possible to define multiple configurations for the same dataset (e.g. if the dataset has various independent files).

## Define your splits and subsets in YAML

## Splits

If you have multiple files and want to define which file goes into which split, you can use YAML at the top of your README.md.
Expand Down
15 changes: 15 additions & 0 deletions docs/hub/datasets-viewer.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,21 @@ In this case, an informational message lets you know that the Viewer is partial.

To power the dataset viewer, the first 5GB of every dataset are auto-converted to the Parquet format (unless it was already a Parquet dataset). In the dataset viewer (for example, see [`datasets/glue`](https://huggingface.co/datasets/glue)), you can click on [_"Auto-converted to Parquet"_](https://huggingface.co/datasets/glue/tree/refs%2Fconvert%2Fparquet/cola) to access the Parquet files. Please, refer to the [Datasets Server docs](/docs/datasets-server/parquet_process) to learn how to query the dataset parquet files with libraries such as Polars, Pandas or DuckDB.

<Tip>
severo marked this conversation as resolved.
Show resolved Hide resolved

Parquet is a columnar storage format optimized for querying and processing large datasets. Parquet is a popular choice for big data processing and analytics and is widely used for data processing and machine learning.

Its structure allows for efficient data reading and querying:
<ul>
<li>only the necessary columns are read from disk (projection pushdown); no need to read the entire file. This reduces the memory requirement for working with Parquet data.</li>
<li>entire row groups are skipped if the statistics stored in its metadata do not match the data of interest (automatic filtering).</li>
<li>the data is compressed, which reduces the amount of data that needs to be stored and transferred.</li>
</ul>

You can learn more about the advantages associated with this format in the <a href="https://huggingface.co/docs/datasets-server/parquet">documentation</a>.

</Tip>

You can also access the list of Parquet files programmatically using the [Hub API](./api#get-apidatasetsrepoidparquet); for example, endpoint [`https://huggingface.co/api/datasets/glue/parquet`](https://huggingface.co/api/datasets/glue/parquet) lists the parquet files of the glue dataset.

## Dataset preview
Expand Down
Loading