Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add sql console section #1434

Merged
merged 18 commits into from
Sep 27, 2024
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
node_modules/
__pycache__/
.vscode/
.idea/

Expand Down
2 changes: 2 additions & 0 deletions docs/hub/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,8 @@
title: Configure the Dataset Viewer
- local: datasets-viewer-embed
title: Embed the Dataset Viewer in a webpage
- local: datasets-viewer-sql-console
title: "SQL Console"
- local: datasets-download-stats
title: Datasets Download Stats
- local: datasets-data-files-configuration
Expand Down
130 changes: 130 additions & 0 deletions docs/hub/datasets-viewer-sql-console.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# SQL Console: Query Hugging Face datasets in your browser

You can run SQL queries on the dataset in the browser using the SQL Console. The SQL Console is powered by [DuckDB](https://duckdb.org/) WASM and runs entirely in the browser. You can access the SQL Console from the dataset page by clicking on the **SQL Console** badge.

<div class="flex justify-center">
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/sql-console-histogram.png"/>
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/sql-console-histogram-dark.png"/>
</div>

<p class="text-sm text-center italic">
To learn more about the SQL Console, see the <a href="https://huggingface.co/blog/sql-console" target="_blank" rel="noopener noreferrer">SQL Console blog post</a>.
</p>


Through the SQL Console, you can:

- Run [DuckDB SQL queries](https://duckdb.org/docs/sql/query_syntax/select) on the dataset (_checkout [SQL Snippets](https://huggingface.co/spaces/cfahlgren1/sql-snippets) for useful queries_)
- Share results of the query with others via a link (_check out [this example](https://huggingface.co/datasets/gretelai/synthetic-gsm8k-reflection-405b?sql_console=true&sql=FROM+histogram%28%0A++train%2C%0A++topic%2C%0A++bin_count+%3A%3D+10%0A%29)_)
- Download the results of the query to a parquet file
- Embed the results of the query in your own webpage using an iframe

cfahlgren1 marked this conversation as resolved.
Show resolved Hide resolved
<Tip>
You can also use the DuckDB CLI to query the dataset via the `hf://` protocol. See the <a href="https://huggingface.co/docs/hub/en/datasets-duckdb" target="_blank" rel="noopener noreferrer">DuckDB CLI documentation</a> for more information. The SQL Console provides a convenient `Copy to DuckDB CLI` button that generates the SQL query for creating views and executing your query in the DuckDB CLI.
cfahlgren1 marked this conversation as resolved.
Show resolved Hide resolved
</Tip>


## Examples

### Leakage Detection
cfahlgren1 marked this conversation as resolved.
Show resolved Hide resolved

Leakage detection is the process of identifying whether data in a dataset is present in multiple splits, for example, whether the test set is present in the training set.

<div class="flex justify-center">
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/leakage-detection.png"/>
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/leakage-detection-dark.png"/>
</div>
cfahlgren1 marked this conversation as resolved.
Show resolved Hide resolved

<p class="text-sm text-center italic">
Learn more about leakage detection <a href="https://huggingface.co/blog/lbourdois/lle">here</a>.
</p>

```sql
WITH
overlapping_rows AS (
SELECT COALESCE(
(SELECT COUNT(*) AS overlap_count
FROM train
INTERSECT
SELECT COUNT(*) AS overlap_count
FROM test),
0
) AS overlap_count
),
total_unique_rows AS (
SELECT COUNT(*) AS total_count
FROM (
SELECT * FROM train
UNION
SELECT * FROM test
) combined
)
SELECT
overlap_count,
total_count,
CASE
WHEN total_count > 0 THEN (overlap_count * 100.0 / total_count)
ELSE 0
END AS overlap_percentage
FROM overlapping_rows, total_unique_rows;
```

### Filtering

The SQL Console makes filtering datasets really easily. For example, if you want to filter the `SkunkworksAI/reasoning-0.01` dataset for instructions and responses with a reasoning length of at least 10, you can use the following query:
cfahlgren1 marked this conversation as resolved.
Show resolved Hide resolved

<div class="flex justify-center">
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/bar-struct-length.png"/>
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/bar-struct-length-dark.png"/>
</div>

In the query, we can use the `len` function to get the length of the `reasoning_chains` column and the `bar` function to create a bar chart of the reasoning lengths.

```sql
select len(reasoning_chains) as reason_len, bar(reason_len, 0, 100), *
from train
where reason_len > 10
order by reason_len desc
cfahlgren1 marked this conversation as resolved.
Show resolved Hide resolved
```

The [bar](https://duckdb.org/docs/sql/functions/char.html#barx-min-max-width) function is a neat built-in DuckDB function that creates a bar chart of the reasoning lengths.

### Histogram

Many dataset authors choose to include statistics about the distribution of the data in the dataset. Using the DuckDB `histogram` function, we can plot a histogram of a column's values.

For example, to plot a histogram of the `reason_len` column in the `SkunkworksAI/reasoning-0.01` dataset, you can use the following query:

<div class="flex justify-center">
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/histogram-simple.png"/>
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/histogram-simple-dark.png"/>
</div>
<p class="text-sm text-center italic">
Learn more about the `histogram` function and parameters <a href="https://cfahlgren1-sql-snippets.hf.space/histogram" target="_blank" rel="noopener noreferrer">here</a>.
</p>

```sql
from histogram(train, len(reasoning_chains))
cfahlgren1 marked this conversation as resolved.
Show resolved Hide resolved
```

### Regex Matching

One of the most powerful features of DuckDB is the deep support for regular expressions. You can use the `regexp` function to match patterns in your data.

Using the [regexp_matches](https://duckdb.org/docs/sql/functions/char.html#regexp_matchesstring-pattern) function, we can filter the `SkunkworksAI/reasoning-0.01` dataset for instructions that contain markdown code blocks.

<div class="flex justify-center">
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/regex-matching-markdown-code.png"/>
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/sql_console/regex-matching-markdown-code-dark.png"/>
</div>
<p class="text-sm text-center italic">
Learn more about the DuckDB regex functions <a href="https://duckdb.org/docs/sql/functions/regular_expressions.html" target="_blank" rel="noopener noreferrer">here</a>.
</p>


```sql
SELECT *
FROM train
WHERE regexp_matches(instruction, '```[a-z]*\n')
limit 100
```
5 changes: 5 additions & 0 deletions docs/hub/datasets-viewer.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,11 @@ Similarly, if you select one class from a categorical column, it will show only

You can search for a word in the dataset by typing it in the search bar at the top of the table. The search is case-insensitive and will match any row containing the word. The text is searched in the columns of `string`, even if the values are nested in a dictionary or a list.

## Run SQL queries on the dataset

You can run SQL queries on the dataset in the browser using the SQL Console. This feature also leverages our [auto-conversion to Parquet](datasets-viewer#access-the-parquet-files).
For more information see our guide on [SQL Console](./datasets-viewer-sql-console).

## Share a specific row

You can share a specific row by clicking on it, and then copying the URL in the address bar of your browser. For example https://huggingface.co/datasets/nyu-mll/glue/viewer/mrpc/test?p=2&row=241 will open the dataset viewer on the MRPC dataset, on the test split, and on the 241st row.
Expand Down
Loading