Update doc to follow template in issue #753

Signed-off-by: Constantin M Adam <[email protected]>
IBM · Nov 26, 2024 · 6538218 · 6538218
1 parent 5a018e6
commit 6538218
Show file tree

Hide file tree

Showing 4 changed files with 105 additions and 49 deletions.
diff --git a/transforms/universal/doc_id/README.md b/transforms/universal/doc_id/README.md
@@ -1,33 +1,13 @@
 # Doc ID Transform 
 
-The Document ID transforms adds a document identification (unique integers and content hashes), which later can be 
-used in de-duplication operations, per the set of 
-[transform project conventions](../../README.md#transform-project-conventions)
-the following runtimes are available:
+The Document ID transform assigns to each document in a dataset a unique identifier, including an integer ID and a
+content hash, which can later be used by the exact dedup and fuzzy dedup transform to identify and remove duplicate
+documents. Per the set of [transform project conventions](../../README.md#transform-project-conventions), the following
+runtimes are available:
 
-* [pythom](python/README.md) - enables the running of the base python transformation
-  in a Python runtime
-* [ray](ray/README.md) - enables the running of the base python transformation
-  in a Ray runtime
-* [spark](spark/README.md) - enables the running of a spark-based transformation
-in a Spark runtime. 
-* [kfp](kfp_ray/README.md) - enables running the ray docker image 
-in a kubernetes cluster using a generated `yaml` file.
-
-## Summary
-
-This transform annotates documents with document "ids".
-It supports the following transformations of the original data:
-* Adding document hash: this enables the addition of a document hash-based id to the data.
-  The hash is calculated with `hashlib.sha256(doc.encode("utf-8")).hexdigest()`.
-  To enable this annotation, set `hash_column` to the name of the column,
-  where you want to store it.
-* Adding integer document id: this allows the addition of an integer document id to the data that
-  is unique across all rows in all tables provided to the `transform()` method.
-  To enable this annotation, set `int_id_column` to the name of the column, where you want
-  to store it.
-
-Document IDs are generally useful for tracking annotations to specific documents. Additionally
-[fuzzy deduping](../fdedup) relies on integer IDs to be present. If your dataset does not have
-document ID column(s), you can use this transform to create ones.
+* [python](python/README.md) - enables running the base python transform in a Python runtime
+* [ray](ray/README.md) - enables running the base python transform  in a Ray runtime
+* [spark](spark/README.md) - enables running of a spark-based transform in a Spark runtime. 
+* [kfp](kfp_ray/README.md) - enables running the ray docker image in a kubernetes cluster using a generated `yaml` file.
 
+Please check [here](python/README.md) for a more detailed description of this transform.
diff --git a/transforms/universal/doc_id/python/README.md b/transforms/universal/doc_id/python/README.md
@@ -1,21 +1,41 @@
 # Document ID Python Annotator
 
-Please see the set of
-[transform project conventions](../../../README.md)
-for details on general project conventions, transform configuration,
-testing and IDE set up.
+Please see the set of [transform project conventions](../../../README.md) for details on general project conventions,
+transform configuration, testing and IDE set up.
 
-## Building
+## Contributors
+- Boris Lublinsky ([email protected])
 
-A [docker file](Dockerfile) that can be used for building docker image. You can use
+## Description
 
-```shell
-make build 
-```
+This transform assigns unique identifiers to the documents in a dataset and supports the following annotations to the
+original data:
+* **Adding a Document Hash** to each document. The unique hash-based ID is generated using
+`hashlib.sha256(doc.encode("utf-8")).hexdigest()`. To store this hash in the data specify the desired column name using
+the `hash_column` parameter.
+* **Adding an Integer Document ID**: to each document. The integer ID is unique across all rows and tables processed by
+the `transform()` method. To store this ID in the data, specify the desired column name using the `int_id_column`
+parameter.
+
+Document IDs are essential for tracking annotations linked to specific documents. They are also required for processes
+like [fuzzy deduplication](../fdedup), which depend on the presence of integer IDs. If your dataset lacks document ID
+columns, this transform can be used to generate them.
+
+## Input Columns Used by This Transform
+
+| Input Column Name                                    | Data Type | Description                      |
+|------------------------------------------------------|-----------|----------------------------------|
+| Column specified by the _contents_column_ config arg | str       | Column that stores document text |
+
+## Output Columns Annotated by This Transform
+| Output Column Name | Data Type | Description                                 |
+|--------------------|-----------|---------------------------------------------|
+| hash_column        | str       | Unique hash assigned to each document       |
+| int_id_column      | uint64    | Unique integer ID assigned to each document |
 
-## Configuration and command line Options
+## Configuration and Command Line Options
 
-The set of dictionary keys defined in [DocIDTransform](src/doc_id_transform_ray.py)
+The set of dictionary keys defined in [DocIDTransform](src/doc_id_transform_base.py)
 configuration for values are as follows:
 
 * _doc_column_ - specifies name of the column containing the document (required for ID generation)
@@ -25,7 +45,7 @@ configuration for values are as follows:
 
 At least one of _hash_column_ or _int_id_column_ must be specified.
 
-## Running
+## Usage
 
 ### Launched Command Line Options 
 When running the transform with the Ray launcher (i.e. TransformLauncher),
@@ -43,7 +63,52 @@ the following command line arguments are available in addition to
 ```
 These correspond to the configuration keys described above.
 
+To use the transform image to transform your data, please refer to the 
+[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
+substituting the name of this transform image and runtime as appropriate.
+
+## Building
+
+A [docker file](Dockerfile) that can be used for building docker image. You can use
+
+```shell
+make build 
+```
+
+### Running the samples
+To run the samples, use the following `make` targets
+
+* `run-cli-sample` - runs src/doc_id_transform_python.py using command line args
+* `run-local-sample` - runs src/doc_id_local_python.py
+
+These targets will activate the virtual environment and set up any configuration needed.
+Use the `-n` option of `make` to see the detail of what is done to run the sample.
+
+For example, 
+```shell
+make run-cli-sample
+...
+```
+Then 
+```shell
+ls output
+```
+To see results of the transform.
+
+### Code example
+
+TBD
+
+### Transforming data using the transform image
 
 To use the transform image to transform your data, please refer to the 
 [running images quickstart](../../../../doc/quick-start/run-transform-image.md),
 substituting the name of this transform image and runtime as appropriate.
+
+## Testing
+
+Following [the testing strategy of data-processing-lib](../../../../data-processing-lib/doc/transform-testing.md)
+
+Currently we have:
+- [Unit test](test/test_doc_id_python.py)
+- [Integration test](test/test_doc_id.py)
diff --git a/transforms/universal/doc_id/ray/README.md b/transforms/universal/doc_id/ray/README.md
@@ -1,10 +1,18 @@
-# Document ID Annotator
+# Document ID Ray Annotator
 
 Please see the set of
 [transform project conventions](../../../README.md)
 for details on general project conventions, transform configuration,
 testing and IDE set up.
 
+## Summary
+This project wraps the Document ID transform with a Ray runtime.
+
+## Configuration and command line Options
+Document ID configuration and command line options are the same as for the base python
+transform.
+
+
 ## Building
 
 A [docker file](Dockerfile) that can be used for building docker image. You can use

diff --git a/transforms/universal/doc_id/spark/README.md b/transforms/universal/doc_id/spark/README.md
@@ -6,23 +6,26 @@ testing and IDE set up.
 
 ## Summary 
 
-This transform assigns a unique integer ID to each row in a Spark DataFrame. It relies on the [monotonically_increasing_id](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html) pyspark function to generate the unique integer IDs. As described in the documentation of this function:
+This transform assigns a unique integer ID to each row in a Spark DataFrame. It relies on the
+[monotonically_increasing_id](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html)
+pyspark function to generate the unique integer IDs. As described in the documentation of this function:
 > The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. 
 
 ## Configuration and command line Options
 
-The set of dictionary keys holding [DocIdTransform](src/doc_id_transform.py) 
-configuration for values are as follows:
+The set of dictionary keys holding [DocIdTransform](src/doc_id_transform.py) configuration for values are as follows:
 
 * _doc_id_column_name_ - specifies the name of the DataFrame column that holds the generated document IDs.
 
 ## Running
-You can run the [doc_id_local.py](src/doc_id_local_spark.py) (spark-based implementation) to transform the `test1.parquet` file in [test input data](test-data/input) to an `output` directory.  The directory will contain both the new annotated `test1.parquet` file and the `metadata.json` file.
+You can run the [doc_id_local.py](src/doc_id_local_spark.py) (spark-based implementation) to transform the
+`test1.parquet` file in [test input data](test-data/input) to an `output` directory.  The directory will contain both
+the new annotated `test1.parquet` file and the `metadata.json` file.
 
 ### Launched Command Line Options 
-When running the transform with the Spark launcher (i.e. SparkTransformLauncher),
-the following command line arguments are available in addition to 
-the options provided by the [python launcher](../../../../data-processing-lib/doc/python-launcher-options.md).
+When running the transform with the Spark launcher (i.e. SparkTransformLauncher), the following command line arguments
+are available in addition to the options provided by the
+[python launcher](../../../../data-processing-lib/doc/python-launcher-options.md).
 
 ```
   --doc_id_column_name DOC_ID_COLUMN_NAME