From 71e605d74024ce06affdd9b38290072dc9464ded Mon Sep 17 00:00:00 2001 From: Sylvain Lesage Date: Fri, 9 Feb 2024 11:19:18 +0000 Subject: [PATCH 1/8] fix broken links --- docs/hub/api.md | 2 +- docs/hub/datasets-cards.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/hub/api.md b/docs/hub/api.md index 405f07f64..41ba79934 100644 --- a/docs/hub/api.md +++ b/docs/hub/api.md @@ -258,7 +258,7 @@ This is equivalent to `huggingface_hub.whoami()`. Use Collections to group repositories from the Hub (Models, Datasets, Spaces and Papers) on a dedicated page. -You can learn more about it in the Collections [guide](./collections.md). Collections can also be managed using the Python client (see [guide](https://huggingface.co/docs/huggingface_hub/main/en/guides/collections)). +You can learn more about it in the Collections [guide](./collections). Collections can also be managed using the Python client (see [guide](https://huggingface.co/docs/huggingface_hub/main/en/guides/collections)). ### POST /api/collections diff --git a/docs/hub/datasets-cards.md b/docs/hub/datasets-cards.md index 834055afc..fcd0f13e6 100644 --- a/docs/hub/datasets-cards.md +++ b/docs/hub/datasets-cards.md @@ -4,7 +4,7 @@ Each dataset may be documented by the `README.md` file in the repository. This file is called a **dataset card**, and the Hugging Face Hub will render its contents on the dataset's main page. To inform users about how to responsibly use the data, it's a good idea to include information about any potential biases within the dataset. Generally, dataset cards help users understand the contents of the dataset and give context for how the dataset should be used. -You can also add dataset metadata to your card. The metadata describes important information about a dataset such as its license, language, and size. It also contains tags to help users discover a dataset on the Hub, and [data files configuration](./datasets-manual-configuration.md) options. Tags are defined in a YAML metadata section at the top of the `README.md` file. +You can also add dataset metadata to your card. The metadata describes important information about a dataset such as its license, language, and size. It also contains tags to help users discover a dataset on the Hub, and [data files configuration](./datasets-manual-configuration) options. Tags are defined in a YAML metadata section at the top of the `README.md` file. ## Dataset card metadata From f703503b92dd4b7151d6db2f11c0567507e6b1d5 Mon Sep 17 00:00:00 2001 From: Sylvain Lesage Date: Fri, 9 Feb 2024 11:23:15 +0000 Subject: [PATCH 2/8] remove redundant h2 title --- docs/hub/datasets-manual-configuration.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/docs/hub/datasets-manual-configuration.md b/docs/hub/datasets-manual-configuration.md index 28586cd7f..678e705a4 100644 --- a/docs/hub/datasets-manual-configuration.md +++ b/docs/hub/datasets-manual-configuration.md @@ -6,8 +6,6 @@ A dataset with a supported structure and [file formats](./datasets-adding#file-f It is also possible to define multiple configurations for the same dataset (e.g. if the dataset has various independent files). -## Define your splits and subsets in YAML - ## Splits If you have multiple files and want to define which file goes into which split, you can use YAML at the top of your README.md. From b69919331ffa6f10c10857f6e6955d0f16972706 Mon Sep 17 00:00:00 2001 From: Sylvain Lesage Date: Fri, 9 Feb 2024 11:27:28 +0000 Subject: [PATCH 3/8] add splits gif --- docs/hub/datasets-data-files-configuration.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/hub/datasets-data-files-configuration.md b/docs/hub/datasets-data-files-configuration.md index 6675202aa..d31c9198e 100644 --- a/docs/hub/datasets-data-files-configuration.md +++ b/docs/hub/datasets-data-files-configuration.md @@ -5,6 +5,12 @@ There are no constraints on how to structure dataset repositories. However, if you want the Dataset Viewer to show certain data files, or to separate your dataset in train/validation/test splits, you need to structure your dataset accordingly. Often it is as simple as naming your data files according to their split names, e.g. `train.csv` and `test.csv`. +## What are splits and configurations? + +Machine learning datasets typically have splits and may also have configurations. A _split_ is a subset of the dataset, like `train` and `test`, that are used during different stages of training and evaluating a model. A _configuration_ is a sub-dataset contained within a larger dataset. Configurations are especially common in multilingual speech datasets where there may be a different configuration for each language. If you're interested in learning more about splits and configurations, check out the [conceptual guide on "Splits and configurations"](https://huggingface.co/docs/datasets-server/configs_and_splits)! + +![split-configs-server](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/split-configs-server.gif) + ## File names and splits To structure your dataset by naming your data files or directories according to their split names, see the [File names and splits](./datasets-file-names-and-splits) documentation. From 353d9ca50517f4fa5fc5997bd17b7a9f3341e01d Mon Sep 17 00:00:00 2001 From: Sylvain Lesage Date: Fri, 9 Feb 2024 11:35:43 +0000 Subject: [PATCH 4/8] add pros of Parquet format --- docs/hub/datasets-viewer.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/docs/hub/datasets-viewer.md b/docs/hub/datasets-viewer.md index ddb6aede9..cb68ddf3d 100644 --- a/docs/hub/datasets-viewer.md +++ b/docs/hub/datasets-viewer.md @@ -37,6 +37,21 @@ In this case, an informational message lets you know that the Viewer is partial. To power the dataset viewer, the first 5GB of every dataset are auto-converted to the Parquet format (unless it was already a Parquet dataset). In the dataset viewer (for example, see [`datasets/glue`](https://huggingface.co/datasets/glue)), you can click on [_"Auto-converted to Parquet"_](https://huggingface.co/datasets/glue/tree/refs%2Fconvert%2Fparquet/cola) to access the Parquet files. Please, refer to the [Datasets Server docs](/docs/datasets-server/parquet_process) to learn how to query the dataset parquet files with libraries such as Polars, Pandas or DuckDB. + + +Parquet is a columnar storage format optimized for querying and processing large datasets. Parquet is a popular choice for big data processing and analytics and is widely used for data processing and machine learning. + +Its structure allows for efficient data reading and querying: +
    +
  • only the necessary columns are read from disk (projection pushdown); no need to read the entire file. This reduces the memory requirement for working with Parquet data.
  • +
  • entire row groups are skipped if the statistics stored in its metadata do not match the data of interest (automatic filtering).
  • +
  • the data is compressed, which reduces the amount of data that needs to be stored and transferred.
  • +
+ +You can learn more about the advantages associated with this format in the documentation. + +
+ You can also access the list of Parquet files programmatically using the [Hub API](./api#get-apidatasetsrepoidparquet); for example, endpoint [`https://huggingface.co/api/datasets/glue/parquet`](https://huggingface.co/api/datasets/glue/parquet) lists the parquet files of the glue dataset. ## Dataset preview From f4589765eee613d62b6432236af2399549ddec36 Mon Sep 17 00:00:00 2001 From: Sylvain Lesage Date: Fri, 9 Feb 2024 17:50:37 +0100 Subject: [PATCH 5/8] Update docs/hub/datasets-data-files-configuration.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/hub/datasets-data-files-configuration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hub/datasets-data-files-configuration.md b/docs/hub/datasets-data-files-configuration.md index d31c9198e..ac09bb9f2 100644 --- a/docs/hub/datasets-data-files-configuration.md +++ b/docs/hub/datasets-data-files-configuration.md @@ -7,7 +7,7 @@ Often it is as simple as naming your data files according to their split names, ## What are splits and configurations? -Machine learning datasets typically have splits and may also have configurations. A _split_ is a subset of the dataset, like `train` and `test`, that are used during different stages of training and evaluating a model. A _configuration_ is a sub-dataset contained within a larger dataset. Configurations are especially common in multilingual speech datasets where there may be a different configuration for each language. If you're interested in learning more about splits and configurations, check out the [conceptual guide on "Splits and configurations"](https://huggingface.co/docs/datasets-server/configs_and_splits)! +Machine learning datasets typically have splits and may also have configurations. A _split_ is a subset of the dataset, like `train` and `test`, that are used during different stages of training and evaluating a model. A _configuration_ is a sub-dataset contained within a larger dataset. Configurations are especially common in multilingual speech datasets where there may be a different configuration for each language. If you're interested in learning more about splits and configurations, check out the [Splits and configurations](https://huggingface.co/docs/datasets-server/configs_and_splits) guide! ![split-configs-server](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/split-configs-server.gif) From 1ae4b730d8a9cd7ca27ccf66582d4e43ba4db73d Mon Sep 17 00:00:00 2001 From: Sylvain Lesage Date: Mon, 12 Feb 2024 10:42:21 +0100 Subject: [PATCH 6/8] Update docs/hub/datasets-data-files-configuration.md Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> --- docs/hub/datasets-data-files-configuration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hub/datasets-data-files-configuration.md b/docs/hub/datasets-data-files-configuration.md index ac09bb9f2..20c5b3055 100644 --- a/docs/hub/datasets-data-files-configuration.md +++ b/docs/hub/datasets-data-files-configuration.md @@ -7,7 +7,7 @@ Often it is as simple as naming your data files according to their split names, ## What are splits and configurations? -Machine learning datasets typically have splits and may also have configurations. A _split_ is a subset of the dataset, like `train` and `test`, that are used during different stages of training and evaluating a model. A _configuration_ is a sub-dataset contained within a larger dataset. Configurations are especially common in multilingual speech datasets where there may be a different configuration for each language. If you're interested in learning more about splits and configurations, check out the [Splits and configurations](https://huggingface.co/docs/datasets-server/configs_and_splits) guide! +Machine learning datasets typically have splits and may also have configurations. A dataset is generally made of _splits_ (e.g. `train` and `test`) that are used during different stages of training and evaluating a model. A _configuration_ is a sub-dataset contained within a larger dataset. Configurations are especially common in multilingual speech datasets where there may be a different configuration for each language. If you're interested in learning more about splits and configurations, check out the [Splits and configurations](https://huggingface.co/docs/datasets-server/configs_and_splits) guide! ![split-configs-server](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/split-configs-server.gif) From df8cd1141dba102e3b72a71c8b973fa78435ac20 Mon Sep 17 00:00:00 2001 From: Sylvain Lesage Date: Mon, 12 Feb 2024 10:32:54 +0000 Subject: [PATCH 7/8] reduce the length of the Tip --- docs/hub/datasets-viewer.md | 11 +---------- 1 file changed, 1 insertion(+), 10 deletions(-) diff --git a/docs/hub/datasets-viewer.md b/docs/hub/datasets-viewer.md index cb68ddf3d..06c0f93a0 100644 --- a/docs/hub/datasets-viewer.md +++ b/docs/hub/datasets-viewer.md @@ -39,16 +39,7 @@ To power the dataset viewer, the first 5GB of every dataset are auto-converted t -Parquet is a columnar storage format optimized for querying and processing large datasets. Parquet is a popular choice for big data processing and analytics and is widely used for data processing and machine learning. - -Its structure allows for efficient data reading and querying: -
    -
  • only the necessary columns are read from disk (projection pushdown); no need to read the entire file. This reduces the memory requirement for working with Parquet data.
  • -
  • entire row groups are skipped if the statistics stored in its metadata do not match the data of interest (automatic filtering).
  • -
  • the data is compressed, which reduces the amount of data that needs to be stored and transferred.
  • -
- -You can learn more about the advantages associated with this format in the documentation. +Parquet is a columnar storage format optimized for querying and processing large datasets. Parquet is a popular choice for big data processing and analytics and is widely used for data processing and machine learning. You can learn more about the advantages associated with this format in the documentation.
From 1a6c1a4a5f131e804bfa0d1a5e6f26489ec3b3f7 Mon Sep 17 00:00:00 2001 From: Sylvain Lesage Date: Mon, 12 Feb 2024 10:45:09 +0000 Subject: [PATCH 8/8] add parquet-converter profile image --- docs/hub/datasets-viewer.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/docs/hub/datasets-viewer.md b/docs/hub/datasets-viewer.md index 06c0f93a0..f3832135f 100644 --- a/docs/hub/datasets-viewer.md +++ b/docs/hub/datasets-viewer.md @@ -43,6 +43,17 @@ Parquet is a columnar storage format optimized for querying and processing large +### Conversion bot + +When you create a new dataset, the [`parquet-converter` bot](https://huggingface.co/parquet-converter) notifies you once it converts the dataset to Parquet. The [discussion](./repositories-pull-requests-discussions) it opens in the repository provides details about the Parquet format and links to the Parquet files. + +
+ + +
+ +### Programmatic access + You can also access the list of Parquet files programmatically using the [Hub API](./api#get-apidatasetsrepoidparquet); for example, endpoint [`https://huggingface.co/api/datasets/glue/parquet`](https://huggingface.co/api/datasets/glue/parquet) lists the parquet files of the glue dataset. ## Dataset preview