API endpoints for gated dataset access requests #7364

jerome-white · 2025-01-09T06:21:20Z

Feature request

I would like a programatic way of requesting access to gated datasets. The current solution to gain access forces me to visit a website and physically click an "agreement" button (as per the documentation).

An ideal approach would be HF API download methods that negotiate access on my behalf based on information from my CLI login and/or token. I realise that may be naive given the various types of access semantics available to dataset authors (automatic versus manual approval, for example) and complexities it might add to existing methods, but something along those lines would be nice.

Perhaps using the *_access_request methods available to dataset authors can be a precedent; see reject_access_request for example.

Motivation

When trying to download files from a gated dataset, I'm met with a GatedRepoError and instructed to visit the repository's website to gain access:

Cannot access gated repo for url https://huggingface.co/datasets/open-llm-leaderboard/meta-llama__Meta-Llama-3.1-70B-Instruct-details/resolve/main/meta-llama__Meta-Llama-3.1-70B-Instruct/samples_leaderboard_math_precalculus_hard_2024-07-19T18-47-29.522341.jsonl.
Access to dataset open-llm-leaderboard/meta-llama__Meta-Llama-3.1-70B-Instruct-details is restricted and you are not in the authorized list. Visit https://huggingface.co/datasets/open-llm-leaderboard/meta-llama__Meta-Llama-3.1-70B-Instruct-details to ask for access.

This makes task automation extremely difficult. For example, I'm interested in studying sample-level responses of models on the LLM leaderboard -- how they answered particular questions on a given evaluation framework. As I come across more and more participants that gate their data, it's becoming unwieldy to continue my work (there over 2,000 participants, so in the worst case that's the number of website visits I'd need to manually undertake).

One approach is use Selenium to react to the GatedRepoError, but that seems like overkill; and a potential violation HF terms of service (?).

As mentioned in the previous section, there seems to be an API for gated dataset owners to managed access requests, and thus some appetite for allowing automated management of gating. This feature request is to extend that to dataset users.

Your contribution

Whether I can help depends on a few things; one being the complexity of the underlying gated access design. If this feature request is accepted I am open to being involved in discussions and testing, and even development under the right time-outcome tradeoff.

The text was updated successfully, but these errors were encountered:

jerome-white · 2025-01-09T10:52:38Z

Looks like a similar feature request was made to the HF Hub team. Is handling this at the Hub level more appropriate?

(As an aside, I've gotten the HTTP-based solution proposed in that forum to work for simple cases.)

julien-c · 2025-01-09T11:13:08Z

yes i think @Wauplin's comment on that thread is still what we recommend

jerome-white added the enhancement New feature or request label Jan 9, 2025

This comment was marked as off-topic.

Sign in to view

jerome-white closed this as completed Jan 9, 2025

jerome-white closed this as not planned Won't fix, can't repro, duplicate, stale Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API endpoints for gated dataset access requests #7364

API endpoints for gated dataset access requests #7364

jerome-white commented Jan 9, 2025

jerome-white commented Jan 9, 2025

This comment was marked as off-topic.

julien-c commented Jan 9, 2025

API endpoints for gated dataset access requests #7364

API endpoints for gated dataset access requests #7364

Comments

jerome-white commented Jan 9, 2025

Feature request

Motivation

Your contribution

jerome-white commented Jan 9, 2025

This comment was marked as off-topic.

julien-c commented Jan 9, 2025