Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training data #15

Open
christophschuhmann opened this issue Jan 21, 2022 · 2 comments
Open

Training data #15

christophschuhmann opened this issue Jan 21, 2022 · 2 comments

Comments

@christophschuhmann
Copy link

I would like to know on what ruCLIP was trained.
We, LAION, have around 6B yet unreleased img-text-pairs, filtered with CLIP and mCLIP. Many of them also are russian. :)

If you 'd like access, let me know.

Christoph Schuhmann
www.laion.ai

@shonenkov
Copy link
Contributor

@christophschuhmann Hello! Your dataset LAION is incredible. As a researcher, I would be interested in working with your dataset in the Russian language.

ruCLIP was trained on datasets from open sources, datasets of the Sberbank ecosystem, and sample datasets translated using neural networks. We collected about 240M pairs, with only 100M in "native" Russian. The data turned out quite noisy, but the signal for ruCLIP is definitely in them.

My colleague Andrey Kuznetsov sent you an e-mail [email protected] . Could you discuss with him the conditions and rules of your dataset? We would be very grateful for your help.

@christophschuhmann
Copy link
Author

Nice to hear from you, I have not received an email yet on [email protected]
Maybe it got caught in a spam filter. Could he sent it again to [email protected]

Waiting to hear from you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants