Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API endpoints for gated dataset access requests #7364

Closed
jerome-white opened this issue Jan 9, 2025 · 3 comments
Closed

API endpoints for gated dataset access requests #7364

jerome-white opened this issue Jan 9, 2025 · 3 comments
Labels
enhancement New feature or request

Comments

@jerome-white
Copy link

Feature request

I would like a programatic way of requesting access to gated datasets. The current solution to gain access forces me to visit a website and physically click an "agreement" button (as per the documentation).

An ideal approach would be HF API download methods that negotiate access on my behalf based on information from my CLI login and/or token. I realise that may be naive given the various types of access semantics available to dataset authors (automatic versus manual approval, for example) and complexities it might add to existing methods, but something along those lines would be nice.

Perhaps using the *_access_request methods available to dataset authors can be a precedent; see reject_access_request for example.

Motivation

When trying to download files from a gated dataset, I'm met with a GatedRepoError and instructed to visit the repository's website to gain access:

Cannot access gated repo for url https://huggingface.co/datasets/open-llm-leaderboard/meta-llama__Meta-Llama-3.1-70B-Instruct-details/resolve/main/meta-llama__Meta-Llama-3.1-70B-Instruct/samples_leaderboard_math_precalculus_hard_2024-07-19T18-47-29.522341.jsonl.
Access to dataset open-llm-leaderboard/meta-llama__Meta-Llama-3.1-70B-Instruct-details is restricted and you are not in the authorized list. Visit https://huggingface.co/datasets/open-llm-leaderboard/meta-llama__Meta-Llama-3.1-70B-Instruct-details to ask for access.

This makes task automation extremely difficult. For example, I'm interested in studying sample-level responses of models on the LLM leaderboard -- how they answered particular questions on a given evaluation framework. As I come across more and more participants that gate their data, it's becoming unwieldy to continue my work (there over 2,000 participants, so in the worst case that's the number of website visits I'd need to manually undertake).

One approach is use Selenium to react to the GatedRepoError, but that seems like overkill; and a potential violation HF terms of service (?).

As mentioned in the previous section, there seems to be an API for gated dataset owners to managed access requests, and thus some appetite for allowing automated management of gating. This feature request is to extend that to dataset users.

Your contribution

Whether I can help depends on a few things; one being the complexity of the underlying gated access design. If this feature request is accepted I am open to being involved in discussions and testing, and even development under the right time-outcome tradeoff.

@jerome-white jerome-white added the enhancement New feature or request label Jan 9, 2025
@jerome-white
Copy link
Author

Looks like a similar feature request was made to the HF Hub team. Is handling this at the Hub level more appropriate?

(As an aside, I've gotten the HTTP-based solution proposed in that forum to work for simple cases.)

@julien-c

This comment was marked as off-topic.

@julien-c
Copy link
Member

julien-c commented Jan 9, 2025

yes i think @Wauplin's comment on that thread is still what we recommend

@jerome-white jerome-white closed this as not planned Won't fix, can't repro, duplicate, stale Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants