-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MITIE as library #1105
Add MITIE as library #1105
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tymzar , thanks for opening this PR! I've left some comments below regarding the download counter. I based my review on the repo listed here: https://huggingface.co/models?other=mitie
@@ -448,6 +448,12 @@ export const MODEL_LIBRARIES_UI_ELEMENTS = { | |||
repoName: "mindspore", | |||
repoUrl: "https://github.com/mindspore-ai/mindspore", | |||
}, | |||
mitie:{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this entry below between "mesh-anything" and "ml-agents" to preserve alphabetical order? Thanks! 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Np, will do!
prettyLabel: "MITIE", | ||
repoName: "MITIE", | ||
repoUrl: "https://github.com/mit-nlp/MITIE", | ||
countDownloads: "path_extension:dat OR path_filename:substrings OR path_filename:top_words", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a file that could be considered as "the file that is always downloaded when a user wants to instantiate the model" ? The problem with counting downloads on all .dat
files is that someone downloading the whole repo (happening quite often) would count as 5 downloads. I would recommend counting downloads only on a single .dat
file, preferably the one where you are sure users will download.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw in the example notebook that one can save the model using ner.save_to_disk("./output/ner_model.dat")
. Would this ner_model.dat
be the perfect candidate to count downloads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Wauplin, what do you think about such a query? This way, we will account for all the embedding modes, and there will be a required prefix for all the models trained on top of them.
(path_extension:dat OR path_extension:txt)
AND
(path_filename:mitie_model*
OR path_filename:substring_set
OR path_filename:top_word_counts
OR path_filename:total_word_feature_extractor
OR path_filename:word_vects
OR path_filename:substrings
OR path_filename:top_words
OR path_filename:word_morph_feature_extractor)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if we do that, a user downloading the whole repo will be counted 8 times in the download counter. This is not what we want otherwise the download counter gets completely biased. It would be preferable to have something like
countDownloads: "path_extension:dat OR path_filename:substrings OR path_filename:top_words", | |
countDownloads: `path:"ner_model.dat"` |
which will encourage everyone using MITIE to upload their model with the same naming convention. Usually libraries implements a simple push_to_hub
method that handles the upload (and therefore file naming) the exact same way for everyone. And then you can implement a from_pretrained
to load a NER model from a MITIE repo. Naming convention is important in this case so that all MITIE repos can be loaded the same way, which is beneficial for us to count downloads appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need these from_pretrained
/ push_to_hub
methods to be implemented right now, just to define the naming convention first and document it in the MITIE repo on Github
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Wauplin I would agree, but in this case, the library allows for training a few NLP models. Almost all of them are built on top of the total_word_feature_extractor.dat, and we can establish a common prefix for mitie models as mitie_model* (ex. mitie_model_ner_model.dat). Does that sound right?
countDownloads: "path_extension:dat OR path_filename:substrings OR path_filename:top_words", | |
countDownloads: "(path_extension:dat) AND (path_filename:mitie_model* OR path_filename:total_word_feature_extractor)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost all of them are built on top of the total_word_feature_extractor.dat
In that case
countDownloads: "path_extension:dat OR path_filename:substrings OR path_filename:top_words", | |
countDownloads: `path:"total_word_feature_extractor.dat"`, |
should be enough, right?
Otherwise, having the convention that all models start with mitie_model
is fine as well. Or you could also say that a mitie model is always saved as xxx.mitie
file (which is under the hood a .dat file which explicitly for mitie library). This way, no need for a wildcard *
which we try to avoid for performance reasons (there is only one in open_clip
and I'm actually not sure it works as expected...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Wauplin Gotcha. I just checked, and the library supports loading models with different extensions. I just pushed your suggestion, Tysm!
4506f89
to
0f1d251
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect! Time to merge now. Thanks for the iterations 🤗
This PR ensures download stats work for the MITIE models.
I would much appreciate it if you would consider the following change!