-
Notifications
You must be signed in to change notification settings - Fork 154
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Update doc to follow template in issue #753
Signed-off-by: Constantin M Adam <[email protected]>
- Loading branch information
Showing
4 changed files
with
105 additions
and
49 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,33 +1,13 @@ | ||
# Doc ID Transform | ||
|
||
The Document ID transforms adds a document identification (unique integers and content hashes), which later can be | ||
used in de-duplication operations, per the set of | ||
[transform project conventions](../../README.md#transform-project-conventions) | ||
the following runtimes are available: | ||
The Document ID transform assigns to each document in a dataset a unique identifier, including an integer ID and a | ||
content hash, which can later be used by the exact dedup and fuzzy dedup transform to identify and remove duplicate | ||
documents. Per the set of [transform project conventions](../../README.md#transform-project-conventions), the following | ||
runtimes are available: | ||
|
||
* [pythom](python/README.md) - enables the running of the base python transformation | ||
in a Python runtime | ||
* [ray](ray/README.md) - enables the running of the base python transformation | ||
in a Ray runtime | ||
* [spark](spark/README.md) - enables the running of a spark-based transformation | ||
in a Spark runtime. | ||
* [kfp](kfp_ray/README.md) - enables running the ray docker image | ||
in a kubernetes cluster using a generated `yaml` file. | ||
|
||
## Summary | ||
|
||
This transform annotates documents with document "ids". | ||
It supports the following transformations of the original data: | ||
* Adding document hash: this enables the addition of a document hash-based id to the data. | ||
The hash is calculated with `hashlib.sha256(doc.encode("utf-8")).hexdigest()`. | ||
To enable this annotation, set `hash_column` to the name of the column, | ||
where you want to store it. | ||
* Adding integer document id: this allows the addition of an integer document id to the data that | ||
is unique across all rows in all tables provided to the `transform()` method. | ||
To enable this annotation, set `int_id_column` to the name of the column, where you want | ||
to store it. | ||
|
||
Document IDs are generally useful for tracking annotations to specific documents. Additionally | ||
[fuzzy deduping](../fdedup) relies on integer IDs to be present. If your dataset does not have | ||
document ID column(s), you can use this transform to create ones. | ||
* [python](python/README.md) - enables running the base python transform in a Python runtime | ||
* [ray](ray/README.md) - enables running the base python transform in a Ray runtime | ||
* [spark](spark/README.md) - enables running of a spark-based transform in a Spark runtime. | ||
* [kfp](kfp_ray/README.md) - enables running the ray docker image in a kubernetes cluster using a generated `yaml` file. | ||
|
||
Please check [here](python/README.md) for a more detailed description of this transform. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,21 +1,41 @@ | ||
# Document ID Python Annotator | ||
|
||
Please see the set of | ||
[transform project conventions](../../../README.md) | ||
for details on general project conventions, transform configuration, | ||
testing and IDE set up. | ||
Please see the set of [transform project conventions](../../../README.md) for details on general project conventions, | ||
transform configuration, testing and IDE set up. | ||
|
||
## Building | ||
## Contributors | ||
- Boris Lublinsky ([email protected]) | ||
|
||
A [docker file](Dockerfile) that can be used for building docker image. You can use | ||
## Description | ||
|
||
```shell | ||
make build | ||
``` | ||
This transform assigns unique identifiers to the documents in a dataset and supports the following annotations to the | ||
original data: | ||
* **Adding a Document Hash** to each document. The unique hash-based ID is generated using | ||
`hashlib.sha256(doc.encode("utf-8")).hexdigest()`. To store this hash in the data specify the desired column name using | ||
the `hash_column` parameter. | ||
* **Adding an Integer Document ID**: to each document. The integer ID is unique across all rows and tables processed by | ||
the `transform()` method. To store this ID in the data, specify the desired column name using the `int_id_column` | ||
parameter. | ||
|
||
Document IDs are essential for tracking annotations linked to specific documents. They are also required for processes | ||
like [fuzzy deduplication](../fdedup), which depend on the presence of integer IDs. If your dataset lacks document ID | ||
columns, this transform can be used to generate them. | ||
|
||
## Input Columns Used by This Transform | ||
|
||
| Input Column Name | Data Type | Description | | ||
|------------------------------------------------------|-----------|----------------------------------| | ||
| Column specified by the _contents_column_ config arg | str | Column that stores document text | | ||
|
||
## Output Columns Annotated by This Transform | ||
| Output Column Name | Data Type | Description | | ||
|--------------------|-----------|---------------------------------------------| | ||
| hash_column | str | Unique hash assigned to each document | | ||
| int_id_column | uint64 | Unique integer ID assigned to each document | | ||
|
||
## Configuration and command line Options | ||
## Configuration and Command Line Options | ||
|
||
The set of dictionary keys defined in [DocIDTransform](src/doc_id_transform_ray.py) | ||
The set of dictionary keys defined in [DocIDTransform](src/doc_id_transform_base.py) | ||
configuration for values are as follows: | ||
|
||
* _doc_column_ - specifies name of the column containing the document (required for ID generation) | ||
|
@@ -25,7 +45,7 @@ configuration for values are as follows: | |
|
||
At least one of _hash_column_ or _int_id_column_ must be specified. | ||
|
||
## Running | ||
## Usage | ||
|
||
### Launched Command Line Options | ||
When running the transform with the Ray launcher (i.e. TransformLauncher), | ||
|
@@ -43,7 +63,52 @@ the following command line arguments are available in addition to | |
``` | ||
These correspond to the configuration keys described above. | ||
|
||
To use the transform image to transform your data, please refer to the | ||
[running images quickstart](../../../../doc/quick-start/run-transform-image.md), | ||
substituting the name of this transform image and runtime as appropriate. | ||
|
||
## Building | ||
|
||
A [docker file](Dockerfile) that can be used for building docker image. You can use | ||
|
||
```shell | ||
make build | ||
``` | ||
|
||
### Running the samples | ||
To run the samples, use the following `make` targets | ||
|
||
* `run-cli-sample` - runs src/doc_id_transform_python.py using command line args | ||
* `run-local-sample` - runs src/doc_id_local_python.py | ||
|
||
These targets will activate the virtual environment and set up any configuration needed. | ||
Use the `-n` option of `make` to see the detail of what is done to run the sample. | ||
|
||
For example, | ||
```shell | ||
make run-cli-sample | ||
... | ||
``` | ||
Then | ||
```shell | ||
ls output | ||
``` | ||
To see results of the transform. | ||
|
||
### Code example | ||
|
||
TBD | ||
|
||
### Transforming data using the transform image | ||
|
||
To use the transform image to transform your data, please refer to the | ||
[running images quickstart](../../../../doc/quick-start/run-transform-image.md), | ||
substituting the name of this transform image and runtime as appropriate. | ||
|
||
## Testing | ||
|
||
Following [the testing strategy of data-processing-lib](../../../../data-processing-lib/doc/transform-testing.md) | ||
|
||
Currently we have: | ||
- [Unit test](test/test_doc_id_python.py) | ||
- [Integration test](test/test_doc_id.py) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters