fix: added annotations for training data #1742

KennethEnevoldsen · 2025-01-09T16:35:13Z

Added training data annotations for a variety of models.

This was quite hard to do I must say so something might be wrong. It would be great to get a review especially on the sentence embeddings training data that will solve a lot of downstream cases.

We should probably tag relevant model authors here as well.

@Muennighoff unsure if the NQ test split on mteb corresponds to train/dev split on natural questions? Can you also take a look at the stackexchange cases.

If the split wasn't annotated I assumed that they trained on the full dataset (including test).

Adressed #1720

Checklist

Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

mteb/models/sentence_transformers_models.py

Samoed

Awesome!

mteb/models/openai_models.py

Muennighoff · 2025-01-09T19:04:02Z

Re: NQ - I think it is probably a subset of the dev split (filtered out: queries without an answer, or having a table as an answer, or with conflicting Wikipedia pages)

KennethEnevoldsen · 2025-01-10T13:46:56Z

Updated a few cases.

I am unsure if "StackExchangeClusteringP2P" is included in the training data if you train on:
"flax-sentence-embeddings/stackexchange_xml"

If so most sentence transformers models would be non-zero-shot. A solution to this would be to remove StackExchange from the MMTEB ("beta") benchmarks.

This kinda comes down to how we define zero-shot.

So far I have done:

Zero Shot
A model is considered zero-shot if it is not trained on other splits of the dataset used to derive the task.
E.g., if a model is trained on Natural Questions, it cannot be considered zero-shot on benchmarks containing the task "NQ" which is derived from Natural Questions.
This definition creates a few edge cases. For instance, multiple models are typically trained on Wikipedia title and body pairs, but we do not define this as leakage on, e.g., "WikipediaRetrievalMultilingual" and "WikiClusteringP2P" as these datasets are not based on title-body pairs.
Distilled, further fine-tunes or in other ways, derivative models inherit the datasets of their parent models.
Based on community feedback and research findings, This definition could change in the future.

x-tabdeveloping · 2025-01-10T14:19:05Z

Just for you information, we will also need to rewrite the annotations I added to task names as I thought you would have to add the dataset paths to the models' metadata.

KennethEnevoldsen added 6 commits January 9, 2025 16:50

fix: Added annotations for arctic embed models

dff3f2a

added google and bge

6ca3589

added cohere

9dee844

Added e5

728d082

added bge based model2vec

dbee415

annotated oAI

49e4b39

KennethEnevoldsen requested review from Muennighoff, isaac-chung and Samoed January 9, 2025 16:35

Samoed reviewed Jan 9, 2025

View reviewed changes

mteb/models/sentence_transformers_models.py Show resolved Hide resolved

Samoed reviewed Jan 9, 2025

View reviewed changes

mteb/models/openai_models.py Show resolved Hide resolved

format and update annotations

49f2941

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: added annotations for training data #1742

fix: added annotations for training data #1742

KennethEnevoldsen commented Jan 9, 2025 •

edited by Samoed

Loading

Samoed left a comment

Muennighoff commented Jan 9, 2025

KennethEnevoldsen commented Jan 10, 2025 •

edited

Loading

x-tabdeveloping commented Jan 10, 2025

fix: added annotations for training data #1742

Are you sure you want to change the base?

fix: added annotations for training data #1742

Conversation

KennethEnevoldsen commented Jan 9, 2025 • edited by Samoed Loading

Checklist

Samoed left a comment

Choose a reason for hiding this comment

Muennighoff commented Jan 9, 2025

KennethEnevoldsen commented Jan 10, 2025 • edited Loading

x-tabdeveloping commented Jan 10, 2025

KennethEnevoldsen commented Jan 9, 2025 •

edited by Samoed

Loading

KennethEnevoldsen commented Jan 10, 2025 •

edited

Loading