Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: added annotations for training data #1742

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

KennethEnevoldsen
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen commented Jan 9, 2025

Added training data annotations for a variety of models.

This was quite hard to do I must say so something might be wrong. It would be great to get a review especially on the sentence embeddings training data that will solve a lot of downstream cases.

We should probably tag relevant model authors here as well.

@Muennighoff unsure if the NQ test split on mteb corresponds to train/dev split on natural questions? Can you also take a look at the stackexchange cases.

If the split wasn't annotated I assumed that they trained on the full dataset (including test).

Adressed #1720

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Copy link
Collaborator

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

mteb/models/openai_models.py Show resolved Hide resolved
@Muennighoff
Copy link
Contributor

Re: NQ - I think it is probably a subset of the dev split (filtered out: queries without an answer, or having a table as an answer, or with conflicting Wikipedia pages)

@KennethEnevoldsen
Copy link
Contributor Author

KennethEnevoldsen commented Jan 10, 2025

Updated a few cases.

I am unsure if "StackExchangeClusteringP2P" is included in the training data if you train on:
"flax-sentence-embeddings/stackexchange_xml"

If so most sentence transformers models would be non-zero-shot. A solution to this would be to remove StackExchange from the MMTEB ("beta") benchmarks.

This kinda comes down to how we define zero-shot.

So far I have done:

Zero Shot
A model is considered zero-shot if it is not trained on other splits of the dataset used to derive the task.
E.g., if a model is trained on Natural Questions, it cannot be considered zero-shot on benchmarks containing the task "NQ" which is derived from Natural Questions.
This definition creates a few edge cases. For instance, multiple models are typically trained on Wikipedia title and body pairs, but we do not define this as leakage on, e.g., "WikipediaRetrievalMultilingual" and "WikiClusteringP2P" as these datasets are not based on title-body pairs.
Distilled, further fine-tunes or in other ways, derivative models inherit the datasets of their parent models.
Based on community feedback and research findings, This definition could change in the future.

@x-tabdeveloping
Copy link
Collaborator

Just for you information, we will also need to rewrite the annotations I added to task names as I thought you would have to add the dataset paths to the models' metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants