Skip to content

Commit

Permalink
Documenting FiftyOne integration (#1302)
Browse files Browse the repository at this point in the history
* documenting FiftyOne integration

* removing autocomplete

* minor enhancements

* change py --> python

* add colab notebook

* add collections
  • Loading branch information
jacobmarks authored Jun 4, 2024
1 parent b8ca768 commit 02806fc
Show file tree
Hide file tree
Showing 3 changed files with 152 additions and 0 deletions.
2 changes: 2 additions & 0 deletions docs/hub/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,8 @@
title: Combine datasets and export
- local: datasets-duckdb-vector-similarity-search
title: Perform vector similarity search
- local: datasets-fiftyone
title: FiftyOne
- local: datasets-pandas
title: Pandas
- local: datasets-webdataset
Expand Down
149 changes: 149 additions & 0 deletions docs/hub/datasets-fiftyone.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# FiftyOne

FiftyOne is the leading open-source toolkit for curating, visualizing, and
managing unstructured visual data. The library streamlines data-centric
workflows, from finding low-confidence predictions to identifying poor-quality
samples and uncovering hidden patterns in your data. The library supports all
sorts of visual data, from images and videos to PDFs, point clouds, and meshes.

Whereas tabular data formats like a pandas DataFrame or a Parquet file consist
of rows and columns, FiftyOne datasets are considerably more flexible,
accomodating object detections, keypoints, polylines, etc. and custom schemas.

FiftyOne is integrated with the Hugging Face Hub, so you can load and share
FiftyOne datasets directly from the Hub.

🚀 Try the FiftyOne 🤝 Hugging Face Integration in [Colab](https://colab.research.google.com/drive/1l0kzfbJ2wtUw1EGS1tq1PJYoWenMlihp?usp=sharing)!

## Prerequisites

First [login with your Hugging Face account](../huggingface_hub/quick-start#login):

```bash
huggingface-cli login
```

Make sure you have `fiftyone>=0.24.0` installed:

```bash
pip install -U fiftyone
```

## Loading Visual Datasets from the Hub

With `load_from_hub()` from FiftyOne's Hugging Face utils, you can load:

- Any FiftyOne dataset uploaded to the hub
- Most image-based datasets stored in Parquet files (which is the standard for datasets uploaded to the hub via the `datasets` library)

### Loading FiftyOne datasets from the Hub

Any dataset pushed to the hub in one of FiftyOne’s [supported common formats](https://docs.voxel51.com/user_guide/dataset_creation/datasets.html#supported-import-formats)
should have all of the necessary configuration info in its dataset repo on the
hub, so you can load the dataset by specifying its `repo_id`. As an example, to
load the [VisDrone detection dataset](https://huggingface.co/datasets/Voxel51/VisDrone2019-DET),
all you need is:

```python
import fiftyone as fo
from fiftyone.utils import load_from_hub

## load from the hub
dataset = load_from_hub("Voxel51/VisDrone2019-DET")

## visualize in app
session = fo.launch_app(dataset)
```

![FiftyOne VisDrone dataset](https://cdn-uploads.huggingface.co/production/uploads/63127e2495407887cb79c5ea/0eKxe_GSsBjt8wMjT9qaI.jpeg)

You can [customize the download process](https://docs.voxel51.com/integrations/huggingface.html#configuring-the-download-process), including the number of samples to
download, the name of the created dataset object, whether or not it is persisted
to disk, and more!

You can list all the available FiftyOne datasets on the Hub using:

```python
from huggingface_hub import HfApi
api = HfApi()
api.list_datasets(tags="fiftyone")
```

### Loading Parquet Datasets from the Hub with FiftyOne

You can also use the `load_from_hub()` function to load datasets from Parquet
files. Type conversions are handled for you and images are downloaded from URLs
if necessary.

With this functionality, [you can load](https://docs.voxel51.com/integrations/huggingface.html#basic-examples) any of the following:

- [FiftyOne-Compatible Image Classification Datasets](https://huggingface.co/collections/Voxel51/fiftyone-compatible-image-classification-datasets-665dfd51020d8b66a56c9b6f), like [Food101](https://huggingface.co/datasets/food101) and [ImageNet-Sketch](https://huggingface.co/datasets/imagenet_sketch)
- [FiftyOne-Compatible Object Detection Datasets](https://huggingface.co/collections/Voxel51/fiftyone-compatible-object-detection-datasets-665e0279c94ae552c7159a2b) like [CPPE-5](https://huggingface.co/datasets/cppe-5) and [WIDER FACE](https://huggingface.co/datasets/wider_face)
- [FiftyOne-Compatible Segmentation Datasets](https://huggingface.co/collections/Voxel51/fiftyone-compatible-image-segmentation-datasets-665e15b6ddb96a4d7226a380) like [SceneParse150](https://huggingface.co/datasets/scene_parse_150) and [Sidewalk Semantic](https://huggingface.co/datasets/segments/sidewalk-semantic)
- [FiftyOne-Compatible Image Captioning Datasets](https://huggingface.co/collections/Voxel51/fiftyone-compatible-image-captioning-datasets-665e16e29350244c06084505) like [COYO-700M](https://huggingface.co/datasets/kakaobrain/coyo-700m) and [New Yorker Caption Contest](https://huggingface.co/datasets/jmhessel/newyorker_caption_contest)
- [FiftyOne-Compatible Visual Question-Answering Datasets](https://huggingface.co/collections/Voxel51/fiftyone-compatible-vqa-datasets-665e16424ecc8a718156248a) like [TextVQA](https://huggingface.co/datasets/textvqa) and [ScienceQA](https://huggingface.co/datasets/derek-thomas/ScienceQA)


And many more!

As a simple example, we can load the first 1,000 samples from the
[WikiArt dataset](https://huggingface.co/datasets/huggan/wikiart) into FiftyOne with:

```python
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

dataset = load_from_hub(
"huggan/wikiart", ## repo_id
format="parquet", ## for Parquet format
classification_fields=["artist", "style", "genre"], ## columns to treat as classification labels
max_samples=1000, # number of samples to load
name="wikiart", # name of the dataset in FiftyOne
)
```

![WikiArt Dataset](https://cdn-uploads.huggingface.co/production/uploads/63127e2495407887cb79c5ea/PCqCvTlNTG5SLtcK5fwuQ.jpeg)

## Pushing FiftyOne Datasets to the Hub

Pushing a dataset to the hub is as simple as:

```python
import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone.utils.huggingface import push_to_hub

## load example dataset
dataset = foz.load_zoo_dataset("quickstart")

## push to hub
push_to_hub(dataset, "my-hf-dataset")
```

When you call `push_to_hub()`, the dataset will be uploaded to the repo
with the specified repo name under your username, and the repo will be created
if necessary. A [Dataset Card](./datasets-cards) will automatically be generated and populated with instructions for loading the dataset from the hub. You can even upload a thumbnail image/gif to appear on the Dataset Card with the `preview_path` argument.

Here’s an example using many of these arguments, which would upload the first three samples of FiftyOne's [Quickstart Video](https://docs.voxel51.com/user_guide/dataset_zoo/datasets.html#quickstart-video) dataset to the private repo `username/my-quickstart-video-dataset` with tags, an MIT license, a description, and a preview image:

```python
dataset = foz.load_from_zoo("quickstart-video", max_samples=3)

push_to_hub(
dataset,
"my-quickstart-video-dataset",
tags=["video", "tracking"],
license="mit",
description="A dataset of video samples for tracking tasks",
private=True,
preview_path="<path/to/preview.png>"
)
```

## 📚 Resources

- [🚀 Code-Along Colab Notebook](https://colab.research.google.com/drive/1l0kzfbJ2wtUw1EGS1tq1PJYoWenMlihp?usp=sharing)
- [🗺️ User Guide for FiftyOne Datasets](https://docs.voxel51.com/user_guide/using_datasets.html#)
- [🤗 FiftyOne 🤝 Hub Integration Docs](https://docs.voxel51.com/integrations/huggingface.html#huggingface-hub)
- [🤗 FiftyOne 🤝 Transformers Integration Docs](https://docs.voxel51.com/integrations/huggingface.html#transformers-library)
- [🧩 FiftyOne Hugging Face Hub Plugin](https://github.com/voxel51/fiftyone-huggingface-plugins)
1 change: 1 addition & 0 deletions docs/hub/datasets-libraries.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,6 @@ The table below summarizes the supported libraries and their level of integratio
| [Dask](./datasets-dask) | Parallel and distributed computing library that scales the existing Python and PyData ecosystem. |||
| [Datasets](./datasets-usage) | 🤗 Datasets is a library for accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP). |||
| [DuckDB](./datasets-duckdb) | In-process SQL OLAP database management system. |||
| [FiftyOne](./datasets-fiftyone) | FiftyOne is a library for curation and visualization of image, video, and 3D data |||
| [Pandas](./datasets-pandas) | Python data analysis toolkit. |||
| [WebDataset](./datasets-webdataset) | Library to write I/O pipelines for large datasets. |||

0 comments on commit 02806fc

Please sign in to comment.