Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MITIE as library #1105

Merged
merged 5 commits into from
Jan 15, 2025
Merged

Add MITIE as library #1105

merged 5 commits into from
Jan 15, 2025

Conversation

tymzar
Copy link
Contributor

@tymzar tymzar commented Jan 14, 2025

This PR ensures download stats work for the MITIE models.

I would much appreciate it if you would consider the following change!

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tymzar , thanks for opening this PR! I've left some comments below regarding the download counter. I based my review on the repo listed here: https://huggingface.co/models?other=mitie

@@ -448,6 +448,12 @@ export const MODEL_LIBRARIES_UI_ELEMENTS = {
repoName: "mindspore",
repoUrl: "https://github.com/mindspore-ai/mindspore",
},
mitie:{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this entry below between "mesh-anything" and "ml-agents" to preserve alphabetical order? Thanks! 🙏

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Np, will do!

prettyLabel: "MITIE",
repoName: "MITIE",
repoUrl: "https://github.com/mit-nlp/MITIE",
countDownloads: "path_extension:dat OR path_filename:substrings OR path_filename:top_words",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a file that could be considered as "the file that is always downloaded when a user wants to instantiate the model" ? The problem with counting downloads on all .dat files is that someone downloading the whole repo (happening quite often) would count as 5 downloads. I would recommend counting downloads only on a single .dat file, preferably the one where you are sure users will download.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw in the example notebook that one can save the model using ner.save_to_disk("./output/ner_model.dat"). Would this ner_model.dat be the perfect candidate to count downloads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wauplin, what do you think about such a query? This way, we will account for all the embedding modes, and there will be a required prefix for all the models trained on top of them.

(path_extension:dat OR path_extension:txt)
AND
(path_filename:mitie_model*
    OR path_filename:substring_set
    OR path_filename:top_word_counts
    OR path_filename:total_word_feature_extractor
    OR path_filename:word_vects
    OR path_filename:substrings
    OR path_filename:top_words
    OR path_filename:word_morph_feature_extractor)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we do that, a user downloading the whole repo will be counted 8 times in the download counter. This is not what we want otherwise the download counter gets completely biased. It would be preferable to have something like

Suggested change
countDownloads: "path_extension:dat OR path_filename:substrings OR path_filename:top_words",
countDownloads: `path:"ner_model.dat"`

which will encourage everyone using MITIE to upload their model with the same naming convention. Usually libraries implements a simple push_to_hub method that handles the upload (and therefore file naming) the exact same way for everyone. And then you can implement a from_pretrained to load a NER model from a MITIE repo. Naming convention is important in this case so that all MITIE repos can be loaded the same way, which is beneficial for us to count downloads appropriately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need these from_pretrained / push_to_hub methods to be implemented right now, just to define the naming convention first and document it in the MITIE repo on Github

Copy link
Contributor Author

@tymzar tymzar Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wauplin I would agree, but in this case, the library allows for training a few NLP models. Almost all of them are built on top of the total_word_feature_extractor.dat, and we can establish a common prefix for mitie models as mitie_model* (ex. mitie_model_ner_model.dat). Does that sound right?

Suggested change
countDownloads: "path_extension:dat OR path_filename:substrings OR path_filename:top_words",
countDownloads: "(path_extension:dat) AND (path_filename:mitie_model* OR path_filename:total_word_feature_extractor)",

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost all of them are built on top of the total_word_feature_extractor.dat

In that case

Suggested change
countDownloads: "path_extension:dat OR path_filename:substrings OR path_filename:top_words",
countDownloads: `path:"total_word_feature_extractor.dat"`,

should be enough, right?
Otherwise, having the convention that all models start with mitie_model is fine as well. Or you could also say that a mitie model is always saved as xxx.mitie file (which is under the hood a .dat file which explicitly for mitie library). This way, no need for a wildcard * which we try to avoid for performance reasons (there is only one in open_clip and I'm actually not sure it works as expected...).

Copy link
Contributor Author

@tymzar tymzar Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wauplin Gotcha. I just checked, and the library supports loading models with different extensions. I just pushed your suggestion, Tysm!

@tymzar tymzar force-pushed the main branch 2 times, most recently from 4506f89 to 0f1d251 Compare January 15, 2025 12:16
Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect! Time to merge now. Thanks for the iterations 🤗

packages/tasks/src/model-libraries.ts Outdated Show resolved Hide resolved
@Wauplin Wauplin merged commit 1a20326 into huggingface:main Jan 15, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants