Skip to content

Commit

Permalink
Add section to explain which format to use for dataset files (#1239)
Browse files Browse the repository at this point in the history
* Add section to explain which format to use for dataset files

* Update docs/hub/datasets-adding.md

Co-authored-by: Quentin Lhoest <[email protected]>

---------

Co-authored-by: Quentin Lhoest <[email protected]>
  • Loading branch information
mariosasko and lhoestq authored Mar 13, 2024
1 parent f243599 commit f49e966
Showing 1 changed file with 4 additions and 0 deletions.
4 changes: 4 additions & 0 deletions docs/hub/datasets-adding.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,10 @@ Image and audio files can also have additional metadata files, see the [Data fil
You may want to convert your files to these formats to benefit from all the Hub features.
Other formats and structures may not be recognized by the Hub.

### Which file format should I use?

For most types of datasets, Parquet is the recommended format due to its efficient compression, rich typing, and since a variety of tools supports this format with optimized read and batched operations. Alternatively, CSV or JSON Lines/JSON can be used for tabular data (prefer JSON Lines for nested data). Although easy to parse compared to Parquet, these formats are not recommended for data larger than several GBs. For image and audio datasets, uploading raw files is the most practical for most use cases since it's easy to access individual files. For large scale image and audio datasets streaming, [WebDataset](https://github.com/webdataset/webdataset) should be preferred over raw image and audio files to avoid the overhead of accessing individual files. Though for more general use cases involving analytics, data filtering or metadata parsing, Parquet is the recommended option for large scale image and audio datasets.

### Dataset Viewer

The [Dataset Viewer](./datasets-viewer) is useful to know how the data actually looks like before you download it.
Expand Down

0 comments on commit f49e966

Please sign in to comment.