diff --git a/docs/hub/datasets-adding.md b/docs/hub/datasets-adding.md index b56f5f340..88ddd7767 100644 --- a/docs/hub/datasets-adding.md +++ b/docs/hub/datasets-adding.md @@ -94,6 +94,10 @@ Image and audio files can also have additional metadata files, see the [Data fil You may want to convert your files to these formats to benefit from all the Hub features. Other formats and structures may not be recognized by the Hub. +### Which file format should I use? + +For most types of datasets, Parquet is the recommended format due to its efficient compression, rich typing, and since a variety of tools supports this format with optimized read and batched operations. Alternatively, CSV or JSON Lines/JSON can be used for tabular data (prefer JSON Lines for nested data). Although easy to parse compared to Parquet, these formats are not recommended for data larger than several GBs. For image and audio datasets, uploading raw files is the most practical for most use cases since it's easy to access individual files. For large scale image and audio datasets streaming, [WebDataset](https://github.com/webdataset/webdataset) should be preferred over raw image and audio files to avoid the overhead of accessing individual files. Though for more general use cases involving analytics, data filtering or metadata parsing, Parquet is the recommended option for large scale image and audio datasets. + ### Dataset Viewer The [Dataset Viewer](./datasets-viewer) is useful to know how the data actually looks like before you download it.