huggingface · severo · Feb 12, 2024 · Feb 9, 2024 · Feb 9, 2024 · Feb 9, 2024
diff --git a/docs/hub/api.md b/docs/hub/api.md
@@ -258,7 +258,7 @@ This is equivalent to `huggingface_hub.whoami()`.
 
 Use Collections to group repositories from the Hub (Models, Datasets, Spaces and Papers) on a dedicated page.
 
-You can learn more about it in the Collections [guide](./collections.md). Collections can also be managed using the Python client (see [guide](https://huggingface.co/docs/huggingface_hub/main/en/guides/collections)).
+You can learn more about it in the Collections [guide](./collections). Collections can also be managed using the Python client (see [guide](https://huggingface.co/docs/huggingface_hub/main/en/guides/collections)).
 
 ### POST /api/collections
 

diff --git a/docs/hub/datasets-cards.md b/docs/hub/datasets-cards.md
@@ -4,7 +4,7 @@
 
 Each dataset may be documented by the `README.md` file in the repository. This file is called a **dataset card**, and the Hugging Face Hub will render its contents on the dataset's main page. To inform users about how to responsibly use the data, it's a good idea to include information about any potential biases within the dataset. Generally, dataset cards help users understand the contents of the dataset and give context for how the dataset should be used.
 
-You can also add dataset metadata to your card. The metadata describes important information about a dataset such as its license, language, and size. It also contains tags to help users discover a dataset on the Hub, and [data files configuration](./datasets-manual-configuration.md) options. Tags are defined in a YAML metadata section at the top of the `README.md` file.
+You can also add dataset metadata to your card. The metadata describes important information about a dataset such as its license, language, and size. It also contains tags to help users discover a dataset on the Hub, and [data files configuration](./datasets-manual-configuration) options. Tags are defined in a YAML metadata section at the top of the `README.md` file.
 
 ## Dataset card metadata
 

diff --git a/docs/hub/datasets-data-files-configuration.md b/docs/hub/datasets-data-files-configuration.md
@@ -5,6 +5,12 @@ There are no constraints on how to structure dataset repositories.
 However, if you want the Dataset Viewer to show certain data files, or to separate your dataset in train/validation/test splits, you need to structure your dataset accordingly.
 Often it is as simple as naming your data files according to their split names, e.g. `train.csv` and `test.csv`.
 
+## What are splits and configurations?
+
+Machine learning datasets typically have splits and may also have configurations. A _split_ is a subset of the dataset, like `train` and `test`, that are used during different stages of training and evaluating a model. A _configuration_ is a sub-dataset contained within a larger dataset. Configurations are especially common in multilingual speech datasets where there may be a different configuration for each language. If you're interested in learning more about splits and configurations, check out the [Splits and configurations](https://huggingface.co/docs/datasets-server/configs_and_splits) guide!
+
+![split-configs-server](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/split-configs-server.gif)
+
 ## File names and splits
 
 To structure your dataset by naming your data files or directories according to their split names, see the [File names and splits](./datasets-file-names-and-splits) documentation.

diff --git a/docs/hub/datasets-manual-configuration.md b/docs/hub/datasets-manual-configuration.md
@@ -6,8 +6,6 @@ A dataset with a supported structure and [file formats](./datasets-adding#file-f
 
 It is also possible to define multiple configurations for the same dataset (e.g. if the dataset has various independent files).
 
-## Define your splits and subsets in YAML
-
 ## Splits
 
 If you have multiple files and want to define which file goes into which split, you can use YAML at the top of your README.md.

diff --git a/docs/hub/datasets-viewer.md b/docs/hub/datasets-viewer.md
@@ -37,6 +37,21 @@ In this case, an informational message lets you know that the Viewer is partial.
 
 To power the dataset viewer, the first 5GB of every dataset are auto-converted to the Parquet format (unless it was already a Parquet dataset). In the dataset viewer (for example, see [`datasets/glue`](https://huggingface.co/datasets/glue)), you can click on [_"Auto-converted to Parquet"_](https://huggingface.co/datasets/glue/tree/refs%2Fconvert%2Fparquet/cola) to access the Parquet files. Please, refer to the [Datasets Server docs](/docs/datasets-server/parquet_process) to learn how to query the dataset parquet files with libraries such as Polars, Pandas or DuckDB.
 
+<Tip>
+
+Parquet is a columnar storage format optimized for querying and processing large datasets. Parquet is a popular choice for big data processing and analytics and is widely used for data processing and machine learning.
+
+Its structure allows for efficient data reading and querying:
+<ul>
+  <li>only the necessary columns are read from disk (projection pushdown); no need to read the entire file. This reduces the memory requirement for working with Parquet data.</li>
+  <li>entire row groups are skipped if the statistics stored in its metadata do not match the data of interest (automatic filtering).</li>
+  <li>the data is compressed, which reduces the amount of data that needs to be stored and transferred.</li>
+</ul>
+
+You can learn more about the advantages associated with this format in the <a href="https://huggingface.co/docs/datasets-server/parquet">documentation</a>.
+
+</Tip>
+
 You can also access the list of Parquet files programmatically using the [Hub API](./api#get-apidatasetsrepoidparquet); for example, endpoint [`https://huggingface.co/api/datasets/glue/parquet`](https://huggingface.co/api/datasets/glue/parquet) lists the parquet files of the glue dataset.
 
 ## Dataset preview