diff --git a/transforms/language/doc_quality/python/README.md b/transforms/language/doc_quality/python/README.md index 38421f34f..f3944cdc0 100644 --- a/transforms/language/doc_quality/python/README.md +++ b/transforms/language/doc_quality/python/README.md @@ -1,13 +1,21 @@ # Document Quality Transform + Please see the set of [transform project conventions](../../../README.md#transform-project-conventions) for details on general project conventions, transform configuration, testing and IDE set up. -## Summary -This transform will calculate and annotate several metrics related to document, which are usuful to see the quality of document. +## Description +This transform will calculate and annotate several metrics related to document, which are usuful to see the quality of document. +Text is the type of data this transform operates on. + +### Input -In this transform, following metrics will be included: +| input column name | data type | descrition | +|-|-|-| +| the one specified in _doc_content_column_ configuration | string | text whose quality will be calculated by this transform | + +### Output columns annotated by this transform | output column name | data type | description | supported language | |-|-|-|-| @@ -27,7 +35,7 @@ In this transform, following metrics will be included: You can see more detailed backgrounds of some columns in [Deepmind's Gopher paper](https://arxiv.org/pdf/2112.11446.pdf) -## Configuration and command line Options +## Configuration The set of dictionary keys holding [DocQualityTransform](src/doc_quality_transform.py) configuration for values are as follows: @@ -36,13 +44,19 @@ configuration for values are as follows: * _doc_content_column_ - specifies column name that contains document text. By default, "contents" is used. * _bad_word_filepath_ - specifies a path to bad word file: local folder (file or directory) that points to bad word file. You don't have to set this parameter if you don't need to set bad words. -## Running +Example +``` +{ + text_lang_key: "en", + doc_content_column_key: "contents", + bad_word_filepath_key: os.path.join(basedir, "ldnoobw", "en"), +} +``` + +## Usage ### Launched Command Line Options -When running the transform with the Ray launcher (i.e. TransformLauncher), -the following command line arguments are available in addition to -the options provided by -the [python launcher](../../../../data-processing-lib/doc/python-launcher-options.md). +The following command line arguments are available ``` --docq_text_lang DOCQ_TEXT_LANG language used in the text content. By default, "en" is used. --docq_doc_content_column DOCQ_DOC_CONTENT_COLUMN column name that contain document text. By default, "contents" is used. @@ -70,6 +84,9 @@ ls output ``` To see results of the transform. +### Code example + +TBD (link to the notebook will be provided) ### Transforming data using the transform image @@ -77,7 +94,27 @@ To use the transform image to transform your data, please refer to the [running images quickstart](../../../../doc/quick-start/run-transform-image.md), substituting the name of this transform image and runtime as appropriate. +## Testing + +Following [the testing strategy of data-processing-lib](../../../../data-processing-lib/doc/transform-testing.md) + +Currently we have: +- [Unit test](test/test_doc_quality_python.py) +- [Integration test](test/test_doc_quality.py) + + +## Further Resource + +- For those who want to learn C4 heuristic rules + - https://arxiv.org/pdf/1910.10683.pdf +- For those who want to learn Gopher statistics + - https://arxiv.org/pdf/2112.11446.pdf +- For those who want to see the source of badwords used by default + - https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words + + +## Consideration -## Troubleshooting guide +### Troubleshooting guide For M1 Mac user, if you see following error during make command, `error: command '/usr/bin/clang' failed with exit code 1`, you may better follow [this step](https://freeman.vc/notes/installing-fasttext-on-an-m1-mac) \ No newline at end of file