Training data #15

christophschuhmann · 2022-01-21T10:44:43Z

I would like to know on what ruCLIP was trained.
We, LAION, have around 6B yet unreleased img-text-pairs, filtered with CLIP and mCLIP. Many of them also are russian. :)

If you 'd like access, let me know.

Christoph Schuhmann
www.laion.ai

shonenkov · 2022-01-24T15:53:51Z

@christophschuhmann Hello! Your dataset LAION is incredible. As a researcher, I would be interested in working with your dataset in the Russian language.

ruCLIP was trained on datasets from open sources, datasets of the Sberbank ecosystem, and sample datasets translated using neural networks. We collected about 240M pairs, with only 100M in "native" Russian. The data turned out quite noisy, but the signal for ruCLIP is definitely in them.

My colleague Andrey Kuznetsov sent you an e-mail [email protected] . Could you discuss with him the conditions and rules of your dataset? We would be very grateful for your help.

christophschuhmann · 2022-01-24T16:43:06Z

Nice to hear from you, I have not received an email yet on [email protected]
Maybe it got caught in a spam filter. Could he sent it again to [email protected]

Waiting to hear from you :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training data #15

Training data #15

christophschuhmann commented Jan 21, 2022

shonenkov commented Jan 24, 2022

christophschuhmann commented Jan 24, 2022

Training data #15

Training data #15

Comments

christophschuhmann commented Jan 21, 2022

shonenkov commented Jan 24, 2022

christophschuhmann commented Jan 24, 2022