The Language Identification transforms serves as a simple exemplar to demonstrate the development
of a simple 1:1 transform.
Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.
This transform will identify language of each text with confidence score with fasttext language identification model. ref
The set of dictionary keys holding LangIdentificationTransform configuration for values are as follows:
Key name | Default | Description |
---|---|---|
model_credential | unset | specifies the credential you use to get model. This will be huggingface token. Guide to get huggingface token |
model_kind | unset | specifies what kind of model you want to use for language identification. Currently, only fasttext is available. |
model_url | unset | specifies url that model locates. For fasttext, this will be repo nme of the model, like facebook/fasttext-language-identification |
content_column_name | contents |
specifies name of the column containing documents |
output_lang_column_name | lang |
specifies name of the output column to hold predicted language code |
output_score_column_name | score |
specifies name of the output column to hold score of prediction |
The following command line arguments are available in addition to the options provided by the python launcher options.
--lang_id_model_credential LANG_ID_MODEL_CREDENTIAL the credential you use to get model. This will be huggingface token.
--lang_id_model_kind LANG_ID_MODEL_KIND what kind of model you want to use for language identification. Currently, only `fasttext` is available.
--lang_id_model_url LANG_ID_MODEL_URL url that model locates. For fasttext, this will be repo name of the model, like `facebook/fasttext-language-identification`
--lang_id_content_column_name LANG_ID_CONTENT_COLUMN_NAME A name of the column containing documents
--lang_id_output_lang_column_name LANG_ID_OUTPUT_LANG_COLUMN_NAME Column name to store identified language
--lang_id_output_score_column_name LANG_ID_OUTPUT_SCORE_COLUMN_NAME Column name to store the score of language identification
These correspond to the configuration keys described above.
Here is a sample notebook
For M1 Mac user, if you see following error during make command, error: command '/usr/bin/clang' failed with exit code 1
, you should follow this step
To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.
Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.
This project wraps the language identification transform with a Ray runtime.
Language Identification configuration and command line options are the same as for the base python transform.
In addition to those available to the transform as defined here, the set of ray launcher options are available.
To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.