From 2bb69f0487db2cedd1e6ddd9113afc874694af68 Mon Sep 17 00:00:00 2001 From: Andrea Francis Soria Jimenez Date: Mon, 27 May 2024 16:47:53 -0400 Subject: [PATCH] Datasets: Adding doc for DuckDB CLI integration (#1297) * Adding doc for duckdb cli integration * Apply code review suggestions * Apply suggestions from code review Co-authored-by: Sylvain Lesage Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Apply code review suggestions * Apply suggestions from code review Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Fix statistics output * Adding ref for other APIs * Add more information about when to use read_parquet --------- Co-authored-by: Sylvain Lesage Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> --- docs/hub/_toctree.yml | 11 ++ docs/hub/datasets-duckdb-auth.md | 46 +++++ .../hub/datasets-duckdb-combine-and-export.md | 97 +++++++++++ docs/hub/datasets-duckdb-select.md | 164 ++++++++++++++++++ docs/hub/datasets-duckdb-sql.md | 159 +++++++++++++++++ ...atasets-duckdb-vector-similarity-search.md | 63 +++++++ docs/hub/datasets-duckdb.md | 78 ++++++--- 7 files changed, 594 insertions(+), 24 deletions(-) create mode 100644 docs/hub/datasets-duckdb-auth.md create mode 100644 docs/hub/datasets-duckdb-combine-and-export.md create mode 100644 docs/hub/datasets-duckdb-select.md create mode 100644 docs/hub/datasets-duckdb-sql.md create mode 100644 docs/hub/datasets-duckdb-vector-similarity-search.md diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml index 3be6eeb3b..4b5409cb4 100644 --- a/docs/hub/_toctree.yml +++ b/docs/hub/_toctree.yml @@ -165,6 +165,17 @@ title: Datasets - local: datasets-duckdb title: DuckDB + sections: + - local: datasets-duckdb-auth + title: Authentication for private and gated datasets + - local: datasets-duckdb-select + title: Query datasets + - local: datasets-duckdb-sql + title: Perform SQL operations + - local: datasets-duckdb-combine-and-export + title: Combine datasets and export + - local: datasets-duckdb-vector-similarity-search + title: Perform vector similarity search - local: datasets-pandas title: Pandas - local: datasets-webdataset diff --git a/docs/hub/datasets-duckdb-auth.md b/docs/hub/datasets-duckdb-auth.md new file mode 100644 index 000000000..4f3a3aeea --- /dev/null +++ b/docs/hub/datasets-duckdb-auth.md @@ -0,0 +1,46 @@ +# Authentication for private and gated datasets + +To access private or gated datasets, you need to configure your Hugging Face Token in the DuckDB Secrets Manager. + +Visit [Hugging Face Settings - Tokens](https://huggingface.co/settings/tokens) to obtain your access token. + +DuckDB supports two providers for managing secrets: + +- `CONFIG`: Requires the user to pass all configuration information into the CREATE SECRET statement. +- `CREDENTIAL_CHAIN`: Automatically tries to fetch credentials. For the Hugging Face token, it will try to get it from `~/.cache/huggingface/token`. + +For more information about DuckDB Secrets visit the [Secrets Manager](https://duckdb.org/docs/configuration/secrets_manager.html) guide. + +## Creating a secret with `CONFIG` provider + +To create a secret using the CONFIG provider, use the following command: + +```bash +CREATE SECRET hf_token (TYPE HUGGINGFACE, TOKEN 'your_hf_token'); +``` + +Replace `your_hf_token` with your actual Hugging Face token. + +## Creating a secret with `CREDENTIAL_CHAIN` provider + +To create a secret using the CREDENTIAL_CHAIN provider, use the following command: + +```bash +CREATE SECRET hf_token (TYPE HUGGINGFACE, PROVIDER credential_chain); +``` + +This command automatically retrieves the stored token from `~/.cache/huggingface/token`. + +First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using: + +```bash +huggingface-cli login +``` + +Alternatively, you can set your Hugging Face token as an environment variable: + +```bash +export HF_TOKEN="hf_xxxxxxxxxxxxx" +``` + +For more information on authentication, see the [Hugging Face authentication](https://huggingface.co/docs/huggingface_hub/main/en/quick-start#authentication) documentation. diff --git a/docs/hub/datasets-duckdb-combine-and-export.md b/docs/hub/datasets-duckdb-combine-and-export.md new file mode 100644 index 000000000..50240371a --- /dev/null +++ b/docs/hub/datasets-duckdb-combine-and-export.md @@ -0,0 +1,97 @@ +# Combine datasets and export + +In this section, we'll demonstrate how to combine two datasets and export the result. The first dataset is in CSV format, and the second dataset is in Parquet format. Let's start by examining our datasets: + +The first will be [TheFusion21/PokemonCards](https://huggingface.co/datasets/TheFusion21/PokemonCards): + +```bash +FROM 'hf://datasets/TheFusion21/PokemonCards/train.csv' LIMIT 3; +┌─────────┬──────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────────┬───────┬─────────────────┐ +│ id │ image_url │ caption │ name │ hp │ set_name │ +│ varchar │ varchar │ varchar │ varchar │ int64 │ varchar │ +├─────────┼──────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────┼───────┼─────────────────┤ +│ pl3-1 │ https://images.pok… │ A Basic, SP Pokemon Card of type Darkness with the title Absol G and 70 HP of rarity Rare Holo from the set Supreme Victors. It has … │ Absol G │ 70 │ Supreme Victors │ +│ ex12-1 │ https://images.pok… │ A Stage 1 Pokemon Card of type Colorless with the title Aerodactyl and 70 HP of rarity Rare Holo evolved from Mysterious Fossil from … │ Aerodactyl │ 70 │ Legend Maker │ +│ xy5-1 │ https://images.pok… │ A Basic Pokemon Card of type Grass with the title Weedle and 50 HP of rarity Common from the set Primal Clash and the flavor text: It… │ Weedle │ 50 │ Primal Clash │ +└─────────┴──────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────┴───────┴─────────────────┘ +``` + +And the second one will be [wanghaofan/pokemon-wiki-captions](https://huggingface.co/datasets/wanghaofan/pokemon-wiki-captions): + +```bash +FROM 'hf://datasets/wanghaofan/pokemon-wiki-captions/data/*.parquet' LIMIT 3; + +┌──────────────────────┬───────────┬──────────┬──────────────────────────────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────┐ +│ image │ name_en │ name_zh │ text_en │ text_zh │ +│ struct(bytes blob,… │ varchar │ varchar │ varchar │ varchar │ +├──────────────────────┼───────────┼──────────┼──────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────┤ +│ {'bytes': \x89PNG\… │ abomasnow │ 暴雪王 │ Grass attributes,Blizzard King standing on two feet, with … │ 草属性,双脚站立的暴雪王,全身白色的绒毛,淡紫色的眼睛,几缕长条装的毛皮盖着它的嘴巴 │ +│ {'bytes': \x89PNG\… │ abra │ 凯西 │ Super power attributes, the whole body is yellow, the head… │ 超能力属性,通体黄色,头部外形类似狐狸,尖尖鼻子,手和脚上都有三个指头,长尾巴末端带着一个褐色圆环 │ +│ {'bytes': \x89PNG\… │ absol │ 阿勃梭鲁 │ Evil attribute, with white hair, blue-gray part without ha… │ 恶属性,有白色毛发,没毛发的部分是蓝灰色,头右边类似弓的角,红色眼睛 │ +└──────────────────────┴───────────┴──────────┴──────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────┘ + +``` + +Now, let's try to combine these two datasets by joining on the `name` column: + +```bash +SELECT a.image_url + , a.caption AS card_caption + , a.name + , a.hp + , b.text_en as wiki_caption +FROM 'hf://datasets/TheFusion21/PokemonCards/train.csv' a +JOIN 'hf://datasets/wanghaofan/pokemon-wiki-captions/data/*.parquet' b +ON LOWER(a.name) = b.name_en +LIMIT 3; + +┌──────────────────────┬──────────────────────┬────────────┬───────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ +│ image_url │ card_caption │ name │ hp │ wiki_caption │ +│ varchar │ varchar │ varchar │ int64 │ varchar │ +├──────────────────────┼──────────────────────┼────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ +│ https://images.pok… │ A Stage 1 Pokemon … │ Aerodactyl │ 70 │ A Pokémon with rock attributes, gray body, blue pupils, purple inner wings, two sharp claws on the wings, jagged teeth, and an arrow-like … │ +│ https://images.pok… │ A Basic Pokemon Ca… │ Weedle │ 50 │ Insect-like, caterpillar-like in appearance, with a khaki-yellow body, seven pairs of pink gastropods, a pink nose, a sharp poisonous need… │ +│ https://images.pok… │ A Basic Pokemon Ca… │ Caterpie │ 50 │ Insect attributes, caterpillar appearance, green back, white abdomen, Y-shaped red antennae on the head, yellow spindle-shaped tail, two p… │ +└──────────────────────┴──────────────────────┴────────────┴───────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ + +``` + +We can export the result to a Parquet file using the `COPY` command: + +```bash +COPY (SELECT a.image_url + , a.caption AS card_caption + , a.name + , a.hp + , b.text_en as wiki_caption +FROM 'hf://datasets/TheFusion21/PokemonCards/train.csv' a +JOIN 'hf://datasets/wanghaofan/pokemon-wiki-captions/data/*.parquet' b +ON LOWER(a.name) = b.name_en) +TO 'output.parquet' (FORMAT PARQUET); +``` + +Let's validate the new Parquet file: + +```bash +SELECT COUNT(*) FROM 'output.parquet'; + +┌──────────────┐ +│ count_star() │ +│ int64 │ +├──────────────┤ +│ 9460 │ +└──────────────┘ + +``` + + + +You can also export to [CSV](https://duckdb.org/docs/guides/file_formats/csv_export), [Excel](https://duckdb.org/docs/guides/file_formats/excel_export +) and [JSON](https://duckdb.org/docs/guides/file_formats/json_export +) formats. + + + +Finally, let's push the resulting dataset to the Hub. You can use the Hub UI, the `huggingface_hub` client library and more to upload your Parquet file, see more information [here](./datasets-adding). + +And that's it! You've successfully combined two datasets, exported the result, and uploaded it to the Hugging Face Hub. diff --git a/docs/hub/datasets-duckdb-select.md b/docs/hub/datasets-duckdb-select.md new file mode 100644 index 000000000..d050ba8d3 --- /dev/null +++ b/docs/hub/datasets-duckdb-select.md @@ -0,0 +1,164 @@ +# Query datasets + +Querying datasets is a fundamental step in data analysis. Here, we'll guide you through querying datasets using various methods. + +There are [several ways](https://duckdb.org/docs/data/parquet/overview.html) to select your data. + +Using the `FROM` syntax: +```bash +FROM 'hf://datasets/jamescalam/world-cities-geo/train.jsonl' SELECT city, country, region LIMIT 3; + +┌────────────────┬─────────────┬───────────────┐ +│ city │ country │ region │ +│ varchar │ varchar │ varchar │ +├────────────────┼─────────────┼───────────────┤ +│ Kabul │ Afghanistan │ Southern Asia │ +│ Kandahar │ Afghanistan │ Southern Asia │ +│ Mazar-e Sharif │ Afghanistan │ Southern Asia │ +└────────────────┴─────────────┴───────────────┘ + +``` + +Using the `SELECT` and `FROM` syntax: + +```bash +SELECT city, country, region FROM 'hf://datasets/jamescalam/world-cities-geo/train.jsonl' USING SAMPLE 3; + +┌──────────┬─────────┬────────────────┐ +│ city │ country │ region │ +│ varchar │ varchar │ varchar │ +├──────────┼─────────┼────────────────┤ +│ Wenzhou │ China │ Eastern Asia │ +│ Valdez │ Ecuador │ South America │ +│ Aplahoue │ Benin │ Western Africa │ +└──────────┴─────────┴────────────────┘ + +``` + +Count all JSONL files matching a glob pattern: + +```bash +SELECT COUNT(*) FROM 'hf://datasets/jamescalam/world-cities-geo/*.jsonl'; + +┌──────────────┐ +│ count_star() │ +│ int64 │ +├──────────────┤ +│ 9083 │ +└──────────────┘ + +``` + +You can also query Parquet files using the `read_parquet` function (or its alias `parquet_scan`). This function, along with other [parameters]((https://duckdb.org/docs/data/parquet/overview.html#parameters)), provides flexibility in handling Parquet files specially if they dont have a `.parquet` extension. Let's explore these functions using the auto-converted Parquet files from the same dataset. + +Select using [read_parquet](https://duckdb.org/docs/guides/file_formats/query_parquet.html) function: + +```bash +SELECT * FROM read_parquet('hf://datasets/jamescalam/world-cities-geo@~parquet/default/**/*.parquet') LIMIT 3; + +┌────────────────┬─────────────┬───────────────┬───────────┬────────────┬────────────┬────────────────────┬───────────────────┬────────────────────┐ +│ city │ country │ region │ continent │ latitude │ longitude │ x │ y │ z │ +│ varchar │ varchar │ varchar │ varchar │ double │ double │ double │ double │ double │ +├────────────────┼─────────────┼───────────────┼───────────┼────────────┼────────────┼────────────────────┼───────────────────┼────────────────────┤ +│ Kabul │ Afghanistan │ Southern Asia │ Asia │ 34.5166667 │ 69.1833344 │ 1865.546409629258 │ 4906.785732164055 │ 3610.1012966606136 │ +│ Kandahar │ Afghanistan │ Southern Asia │ Asia │ 31.61 │ 65.6999969 │ 2232.782351694877 │ 4945.064042683584 │ 3339.261233224765 │ +│ Mazar-e Sharif │ Afghanistan │ Southern Asia │ Asia │ 36.7069444 │ 67.1122208 │ 1986.5057687360124 │ 4705.51748048584 │ 3808.088900172991 │ +└────────────────┴─────────────┴───────────────┴───────────┴────────────┴────────────┴────────────────────┴───────────────────┴────────────────────┘ + +``` + +Read all files that match a glob pattern and include a filename column specifying which file each row came from: + +```bash +SELECT city, country, filename FROM read_parquet('hf://datasets/jamescalam/world-cities-geo@~parquet/default/**/*.parquet', filename = true) LIMIT 3; + +┌────────────────┬─────────────┬───────────────────────────────────────────────────────────────────────────────┐ +│ city │ country │ filename │ +│ varchar │ varchar │ varchar │ +├────────────────┼─────────────┼───────────────────────────────────────────────────────────────────────────────┤ +│ Kabul │ Afghanistan │ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet │ +│ Kandahar │ Afghanistan │ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet │ +│ Mazar-e Sharif │ Afghanistan │ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet │ +└────────────────┴─────────────┴───────────────────────────────────────────────────────────────────────────────┘ + +``` + +## Get metadata and schema + +The [parquet_metadata](https://duckdb.org/docs/data/parquet/metadata.html) function can be used to query the metadata contained within a Parquet file. + +```bash +SELECT * FROM parquet_metadata('hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet'); + +┌───────────────────────────────────────────────────────────────────────────────┬──────────────┬────────────────────┬─────────────┐ +│ file_name │ row_group_id │ row_group_num_rows │ compression │ +│ varchar │ int64 │ int64 │ varchar │ +├───────────────────────────────────────────────────────────────────────────────┼──────────────┼────────────────────┼─────────────┤ +│ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet │ 0 │ 1000 │ SNAPPY │ +│ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet │ 0 │ 1000 │ SNAPPY │ +│ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet │ 0 │ 1000 │ SNAPPY │ +└───────────────────────────────────────────────────────────────────────────────┴──────────────┴────────────────────┴─────────────┘ + +``` + +Fetch the column names and column types: + +```bash +DESCRIBE SELECT * FROM 'hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet'; + +┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐ +│ column_name │ column_type │ null │ key │ default │ extra │ +│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ +├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤ +│ city │ VARCHAR │ YES │ │ │ │ +│ country │ VARCHAR │ YES │ │ │ │ +│ region │ VARCHAR │ YES │ │ │ │ +│ continent │ VARCHAR │ YES │ │ │ │ +│ latitude │ DOUBLE │ YES │ │ │ │ +│ longitude │ DOUBLE │ YES │ │ │ │ +│ x │ DOUBLE │ YES │ │ │ │ +│ y │ DOUBLE │ YES │ │ │ │ +│ z │ DOUBLE │ YES │ │ │ │ +└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘ + +``` + +Fetch the internal schema (excluding the file name): + +```bash +SELECT * EXCLUDE (file_name) FROM parquet_schema('hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet'); + +┌───────────┬────────────┬─────────────┬─────────────────┬──────────────┬────────────────┬───────┬───────────┬──────────┬──────────────┐ +│ name │ type │ type_length │ repetition_type │ num_children │ converted_type │ scale │ precision │ field_id │ logical_type │ +│ varchar │ varchar │ varchar │ varchar │ int64 │ varchar │ int64 │ int64 │ int64 │ varchar │ +├───────────┼────────────┼─────────────┼─────────────────┼──────────────┼────────────────┼───────┼───────────┼──────────┼──────────────┤ +│ schema │ │ │ REQUIRED │ 9 │ │ │ │ │ │ +│ city │ BYTE_ARRAY │ │ OPTIONAL │ │ UTF8 │ │ │ │ StringType() │ +│ country │ BYTE_ARRAY │ │ OPTIONAL │ │ UTF8 │ │ │ │ StringType() │ +│ region │ BYTE_ARRAY │ │ OPTIONAL │ │ UTF8 │ │ │ │ StringType() │ +│ continent │ BYTE_ARRAY │ │ OPTIONAL │ │ UTF8 │ │ │ │ StringType() │ +│ latitude │ DOUBLE │ │ OPTIONAL │ │ │ │ │ │ │ +│ longitude │ DOUBLE │ │ OPTIONAL │ │ │ │ │ │ │ +│ x │ DOUBLE │ │ OPTIONAL │ │ │ │ │ │ │ +│ y │ DOUBLE │ │ OPTIONAL │ │ │ │ │ │ │ +│ z │ DOUBLE │ │ OPTIONAL │ │ │ │ │ │ │ +├───────────┴────────────┴─────────────┴─────────────────┴──────────────┴────────────────┴───────┴───────────┴──────────┴──────────────┤ + +``` + +## Get statistics + +The `SUMMARIZE` command can be used to get various aggregates over a query (min, max, approx_unique, avg, std, q25, q50, q75, count). It returns these statistics along with the column name, column type, and the percentage of NULL values. + +```bash +SUMMARIZE SELECT latitude, longitude FROM 'hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet'; + +┌─────────────┬─────────────┬──────────────┬─────────────┬───────────────┬────────────────────┬────────────────────┬────────────────────┬────────────────────┬────────────────────┬───────┬─────────────────┐ +│ column_name │ column_type │ min │ max │ approx_unique │ avg │ std │ q25 │ q50 │ q75 │ count │ null_percentage │ +│ varchar │ varchar │ varchar │ varchar │ int64 │ varchar │ varchar │ varchar │ varchar │ varchar │ int64 │ decimal(9,2) │ +├─────────────┼─────────────┼──────────────┼─────────────┼───────────────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┼───────┼─────────────────┤ +│ latitude │ DOUBLE │ -54.8 │ 67.8557214 │ 7324 │ 22.5004568364307 │ 26.770454684690925 │ 6.089858461951687 │ 29.321258648324747 │ 44.90191158328915 │ 9083 │ 0.00 │ +│ longitude │ DOUBLE │ -175.2166595 │ 179.3833313 │ 7802 │ 14.699333721953098 │ 63.93672742608224 │ -6.877990418604821 │ 19.12963979385393 │ 43.873513093419966 │ 9083 │ 0.00 │ +└─────────────┴─────────────┴──────────────┴─────────────┴───────────────┴────────────────────┴────────────────────┴────────────────────┴────────────────────┴────────────────────┴───────┴─────────────────┘ + +``` diff --git a/docs/hub/datasets-duckdb-sql.md b/docs/hub/datasets-duckdb-sql.md new file mode 100644 index 000000000..8cbab28d7 --- /dev/null +++ b/docs/hub/datasets-duckdb-sql.md @@ -0,0 +1,159 @@ +# Perform SQL operations + +Performing SQL operations with DuckDB opens up a world of possibilities for querying datasets efficiently. Let's dive into some examples showcasing the power of DuckDB functions. + +For our demonstration, we'll explore a fascinating dataset. The [MMLU](https://huggingface.co/datasets/cais/mmlu) dataset is a multitask test containing multiple-choice questions spanning various knowledge domains. + +To preview the dataset, let's select a sample of 3 rows: + +```bash +FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' USING SAMPLE 3; + +┌──────────────────────┬──────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────┐ +│ question │ subject │ choices │ answer │ +│ varchar │ varchar │ varchar[] │ int64 │ +├──────────────────────┼──────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────┤ +│ The model of light… │ conceptual_physics │ [wave model, particle model, Both of these, Neither of these] │ 1 │ +│ A person who is lo… │ professional_psych… │ [his/her life scripts., his/her own feelings, attitudes, and beliefs., the emotional reactions and behaviors of the people he/she is interacting with.… │ 1 │ +│ The thermic effect… │ nutrition │ [is substantially higher for carbohydrate than for protein, is accompanied by a slight decrease in body core temperature., is partly related to sympat… │ 2 │ +└──────────────────────┴──────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────┘ + +``` + +This command retrieves a random sample of 3 rows from the dataset for us to examine. + +Let's start by examining the schema of our dataset. The following table outlines the structure of our dataset: + +```bash +DESCRIBE FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' USING SAMPLE 3; +┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐ +│ column_name │ column_type │ null │ key │ default │ extra │ +│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ +├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤ +│ question │ VARCHAR │ YES │ │ │ │ +│ subject │ VARCHAR │ YES │ │ │ │ +│ choices │ VARCHAR[] │ YES │ │ │ │ +│ answer │ BIGINT │ YES │ │ │ │ +└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘ + +``` +Next, let's analyze if there are any duplicated records in our dataset: + +```bash +SELECT *, + COUNT(*) AS counts +FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' +GROUP BY ALL +HAVING counts > 2; + +┌──────────┬─────────┬───────────┬────────┬────────┐ +│ question │ subject │ choices │ answer │ counts │ +│ varchar │ varchar │ varchar[] │ int64 │ int64 │ +├──────────┴─────────┴───────────┴────────┴────────┤ +│ 0 rows │ +└──────────────────────────────────────────────────┘ + +``` + +Fortunately, our dataset doesn't contain any duplicate records. + +Let's see the proportion of questions based on the subject in a bar representation: + +```bash +SELECT + subject, + COUNT(*) AS counts, + BAR(COUNT(*), 0, (SELECT COUNT(*) FROM 'hf://datasets/cais/mmlu/all/test-*.parquet')) AS percentage +FROM + 'hf://datasets/cais/mmlu/all/test-*.parquet' +GROUP BY + subject +ORDER BY + counts DESC; + +┌──────────────────────────────┬────────┬────────────────────────────────────────────────────────────────────────────────┐ +│ subject │ counts │ percentage │ +│ varchar │ int64 │ varchar │ +├──────────────────────────────┼────────┼────────────────────────────────────────────────────────────────────────────────┤ +│ professional_law │ 1534 │ ████████▋ │ +│ moral_scenarios │ 895 │ █████ │ +│ miscellaneous │ 783 │ ████▍ │ +│ professional_psychology │ 612 │ ███▍ │ +│ high_school_psychology │ 545 │ ███ │ +│ high_school_macroeconomics │ 390 │ ██▏ │ +│ elementary_mathematics │ 378 │ ██▏ │ +│ moral_disputes │ 346 │ █▉ │ +├──────────────────────────────┴────────┴────────────────────────────────────────────────────────────────────────────────┤ +│ 57 rows (8 shown) 3 columns │ +└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ + +``` + +Now, let's prepare a subset of the dataset containing questions related to **nutrition** and create a mapping of questions to correct answers. +Notice that we have the column **choices** from which we can get the correct answer using the **answer** column as an index. + +```bash +SELECT * +FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' +WHERE subject = 'nutrition' LIMIT 3; + +┌──────────────────────┬───────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────┐ +│ question │ subject │ choices │ answer │ +│ varchar │ varchar │ varchar[] │ int64 │ +├──────────────────────┼───────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────┤ +│ Which foods tend t… │ nutrition │ [Meat, Confectionary, Fruits and vegetables, Potatoes] │ 2 │ +│ In which one of th… │ nutrition │ [If the incidence rate of the disease falls., If survival time with the disease increases., If recovery of the disease is faster., If the population in which the… │ 1 │ +│ Which of the follo… │ nutrition │ [The flavonoid class comprises flavonoids and isoflavonoids., The digestibility and bioavailability of isoflavones in soya food products are not changed by proce… │ 0 │ +└──────────────────────┴───────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────┘ + +``` + +```bash +SELECT question, + choices[answer] AS correct_answer +FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' +WHERE subject = 'nutrition' LIMIT 3; + +┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────────────────────────────────────────┐ +│ question │ correct_answer │ +│ varchar │ varchar │ +├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────┤ +│ Which foods tend to be consumed in lower quantities in Wales and Scotland (as of 2020)?\n │ Confectionary │ +│ In which one of the following circumstances will the prevalence of a disease in the population increase, all else being constant?\n │ If the incidence rate of the disease falls. │ +│ Which of the following statements is correct?\n │ │ +└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────────────────────────────────────┘ + +``` + +To ensure data cleanliness, let's remove any newline characters at the end of the questions and filter out any empty answers: + +```bash +SELECT regexp_replace(question, '\n', '') AS question, + choices[answer] AS correct_answer +FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' +WHERE subject = 'nutrition' AND LENGTH(correct_answer) > 0 LIMIT 3; + +┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────────────────────────────────────────┐ +│ question │ correct_answer │ +│ varchar │ varchar │ +├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────┤ +│ Which foods tend to be consumed in lower quantities in Wales and Scotland (as of 2020)? │ Confectionary │ +│ In which one of the following circumstances will the prevalence of a disease in the population increase, all else being constant? │ If the incidence rate of the disease falls. │ +│ Which vitamin is a major lipid-soluble antioxidant in cell membranes? │ Vitamin D │ +└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────────────────────────────────────┘ + +``` + +Finally, lets highlight some of the DuckDB functions used in this section: +- `DESCRIBE`, returns the table schema. +- `USING SAMPLE`, samples are used to randomly select a subset of a dataset. +- `BAR`, draws a band whose width is proportional to (x - min) and equal to width characters when x = max. Width defaults to 80. +- `string[begin:end]`, extracts a string using slice conventions. Missing begin or end arguments are interpreted as the beginning or end of the list respectively. Negative values are accepted. +- `regexp_replace`, if the string contains the regexp pattern, replaces the matching part with replacement. +- `LENGTH`, gets the number of characters in the string. + + + +There are plenty of useful functions available in DuckDB's [SQL functions overview](https://duckdb.org/docs/sql/functions/overview). The best part is that you can use them directly on Hugging Face datasets. + + diff --git a/docs/hub/datasets-duckdb-vector-similarity-search.md b/docs/hub/datasets-duckdb-vector-similarity-search.md new file mode 100644 index 000000000..ef6aed390 --- /dev/null +++ b/docs/hub/datasets-duckdb-vector-similarity-search.md @@ -0,0 +1,63 @@ +# Perform vector similarity search + +The Fixed-Length Arrays feature was added in DuckDB version 0.10.0. This lets you use vector embeddings in DuckDB tables, making your data analysis even more powerful. + +Additionally, the array_cosine_similarity function was introduced. This function measures the cosine of the angle between two vectors, indicating their similarity. A value of 1 means they’re perfectly aligned, 0 means they’re perpendicular, and -1 means they’re completely opposite. + +Let's explore how to use this function for similarity searches. In this section, we’ll show you how to perform similarity searches using DuckDB. + +We will use the [asoria/awesome-chatgpt-prompts-embeddings](https://huggingface.co/datasets/asoria/awesome-chatgpt-prompts-embeddings) dataset. + +First, let's preview a few records from the dataset: + +```bash +FROM 'hf://datasets/asoria/awesome-chatgpt-prompts-embeddings/data/*.parquet' SELECT act, prompt, len(embedding) as embed_len LIMIT 3; + +┌──────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬───────────┐ +│ act │ prompt │ embed_len │ +│ varchar │ varchar │ int64 │ +├──────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────────┤ +│ Linux Terminal │ I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output insid… │ 384 │ +│ English Translator… │ I want you to act as an English translator, spelling corrector and improver. I will speak to you in any language and you will detect the language, translate it and answer… │ 384 │ +│ `position` Intervi… │ I want you to act as an interviewer. I will be the candidate and you will ask me the interview questions for the `position` position. I want you to only reply as the inte… │ 384 │ +└──────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────────┘ + +``` + +Next, let's choose an embedding to use for the similarity search: + +```bash +FROM 'hf://datasets/asoria/awesome-chatgpt-prompts-embeddings/data/*.parquet' SELECT embedding WHERE act = 'Linux Terminal'; + +┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ +│ embedding │ +│ float[] │ +├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ +│ [-0.020781303, -0.029143505, -0.0660217, -0.00932716, -0.02601602, -0.011426172, 0.06627567, 0.11941507, 0.0013917526, 0.012889079, 0.053234346, -0.07380514, 0.04871567, -0.043601237, -0.0025319182, 0.0448… │ +└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ + +``` + +Now, let's use the selected embedding to find similar records: + + +```bash +SELECT act, + prompt, + array_cosine_similarity(embedding::float[384], (SELECT embedding FROM 'hf://datasets/asoria/awesome-chatgpt-prompts-embeddings/data/*.parquet' WHERE act = 'Linux Terminal')::float[384]) AS similarity +FROM 'hf://datasets/asoria/awesome-chatgpt-prompts-embeddings/data/*.parquet' +ORDER BY similarity DESC +LIMIT 3; + +┌──────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────────┐ +│ act │ prompt │ similarity │ +│ varchar │ varchar │ float │ +├──────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────┤ +│ Linux Terminal │ I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output insi… │ 1.0 │ +│ JavaScript Console │ I want you to act as a javascript console. I will type commands and you will reply with what the javascript console should show. I want you to only reply with the termin… │ 0.7599728 │ +│ R programming Inte… │ I want you to act as a R interpreter. I'll type commands and you'll reply with what the terminal should show. I want you to only reply with the terminal output inside on… │ 0.7303775 │ +└──────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────┘ + +``` + +That's it! You have successfully performed a vector similarity search using DuckDB. diff --git a/docs/hub/datasets-duckdb.md b/docs/hub/datasets-duckdb.md index 38769c707..30b0d3b87 100644 --- a/docs/hub/datasets-duckdb.md +++ b/docs/hub/datasets-duckdb.md @@ -1,43 +1,73 @@ # DuckDB [DuckDB](https://github.com/duckdb/duckdb) is an in-process SQL [OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing) database management system. -Since it supports [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub: +You can use the Hugging Face paths (`hf://`) to access data on the Hub: -First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using: +The [DuckDB CLI](https://duckdb.org/docs/api/cli/overview.html) (Command Line Interface) is a single, dependency-free executable. +There are also other APIs available for running DuckDB, including Python, C++, Go, Java, Rust, and more. For additional details, visit their [clients](https://duckdb.org/docs/api/overview.html) page. -``` -huggingface-cli login + + +For installation details, visit the [installation page](https://duckdb.org/docs/installation). + + + +Starting from version `v0.10.3`, the DuckDB CLI includes native support for accessing datasets on the Hugging Face Hub via URLs with the `hf://` scheme. Here are some features you can leverage with this powerful tool: + +- Query public datasets and your own gated and private datasets +- Analyze datasets and perform SQL operations +- Combine datasets and export it to different formats +- Conduct vector similarity search on embedding datasets +- Implement full-text search on datasets + +For a complete list of DuckDB features, visit the DuckDB [documentation](https://duckdb.org/docs/). + +To start the CLI, execute the following command in the installation folder: + +```bash +./duckdb ``` -Then you can [Create a dataset repository](../huggingface_hub/quick-start#create-a-repository), for example using: +## Forging the Hugging Face URL -```python -from huggingface_hub import HfApi +To access Hugging Face datasets, use the following URL format: -HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset") +```plaintext +hf://datasets/{my-username}/{my-dataset}/{path_to_file} ``` -Finally, you can use [Hugging Face paths]([Hugging Face paths](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system#integrations)) in DuckDB: +- **my-username**, the user or organization of the dataset, e.g. `ibm` +- **my-dataset**, the dataset name, e.g: `duorc` +- **path_to_parquet_file**, the parquet file path which supports glob patterns, e.g `**/*.parquet`, to query all parquet files -```python ->>> from huggingface_hub import HfFileSystem ->>> import duckdb ->>> fs = HfFileSystem() ->>> duckdb.register_filesystem(fs) ->>> duckdb.sql("COPY tbl TO 'hf://datasets/username/my_dataset/data.parquet' (FORMAT PARQUET);") + + +You can query auto-converted Parquet files using the @~parquet branch, which corresponds to the `refs/convert/parquet` revision. For more details, refer to the documentation at https://huggingface.co/docs/datasets-server/en/parquet#conversion-to-parquet. + +To reference the `refs/convert/parquet` revision of a dataset, use the following syntax: + +```plaintext +hf://datasets/{my-username}/{my-dataset}@~parquet/{path_to_file} +``` + +Here is a sample URL following the above syntax: + +```plaintext +hf://datasets/ibm/duorc@~parquet/ParaphraseRC/test/0000.parquet ``` -This creates a file `data.parquet` in the dataset repository `username/my_dataset` containing your dataset in Parquet format. -You can reload it later: + -```python ->>> from huggingface_hub import HfFileSystem ->>> import duckdb +Let's start with a quick demo to query all the rows of a dataset: ->>> fs = HfFileSystem() ->>> duckdb.register_filesystem(fs) ->>> df = duckdb.query("SELECT * FROM 'hf://datasets/username/my_dataset/data.parquet' LIMIT 10;").df() +```sql +FROM 'hf://datasets/ibm/duorc/ParaphraseRC/*.parquet' LIMIT 3; ``` -To have more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system). +Or using traditional SQL syntax: + +```sql +SELECT * FROM 'hf://datasets/ibm/duorc/ParaphraseRC/*.parquet' LIMIT 3; +``` +In the following sections, we will cover more complex operations you can perform with DuckDB on Hugging Face datasets.