Skip to content

Commit

Permalink
Update doc to follow template in issue #753
Browse files Browse the repository at this point in the history
Signed-off-by: Constantin M Adam <[email protected]>
  • Loading branch information
cmadam committed Nov 26, 2024
1 parent 5a018e6 commit 6538218
Show file tree
Hide file tree
Showing 4 changed files with 105 additions and 49 deletions.
38 changes: 9 additions & 29 deletions transforms/universal/doc_id/README.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,13 @@
# Doc ID Transform

The Document ID transforms adds a document identification (unique integers and content hashes), which later can be
used in de-duplication operations, per the set of
[transform project conventions](../../README.md#transform-project-conventions)
the following runtimes are available:
The Document ID transform assigns to each document in a dataset a unique identifier, including an integer ID and a
content hash, which can later be used by the exact dedup and fuzzy dedup transform to identify and remove duplicate
documents. Per the set of [transform project conventions](../../README.md#transform-project-conventions), the following
runtimes are available:

* [pythom](python/README.md) - enables the running of the base python transformation
in a Python runtime
* [ray](ray/README.md) - enables the running of the base python transformation
in a Ray runtime
* [spark](spark/README.md) - enables the running of a spark-based transformation
in a Spark runtime.
* [kfp](kfp_ray/README.md) - enables running the ray docker image
in a kubernetes cluster using a generated `yaml` file.

## Summary

This transform annotates documents with document "ids".
It supports the following transformations of the original data:
* Adding document hash: this enables the addition of a document hash-based id to the data.
The hash is calculated with `hashlib.sha256(doc.encode("utf-8")).hexdigest()`.
To enable this annotation, set `hash_column` to the name of the column,
where you want to store it.
* Adding integer document id: this allows the addition of an integer document id to the data that
is unique across all rows in all tables provided to the `transform()` method.
To enable this annotation, set `int_id_column` to the name of the column, where you want
to store it.

Document IDs are generally useful for tracking annotations to specific documents. Additionally
[fuzzy deduping](../fdedup) relies on integer IDs to be present. If your dataset does not have
document ID column(s), you can use this transform to create ones.
* [python](python/README.md) - enables running the base python transform in a Python runtime
* [ray](ray/README.md) - enables running the base python transform in a Ray runtime
* [spark](spark/README.md) - enables running of a spark-based transform in a Spark runtime.
* [kfp](kfp_ray/README.md) - enables running the ray docker image in a kubernetes cluster using a generated `yaml` file.

Please check [here](python/README.md) for a more detailed description of this transform.
89 changes: 77 additions & 12 deletions transforms/universal/doc_id/python/README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,41 @@
# Document ID Python Annotator

Please see the set of
[transform project conventions](../../../README.md)
for details on general project conventions, transform configuration,
testing and IDE set up.
Please see the set of [transform project conventions](../../../README.md) for details on general project conventions,
transform configuration, testing and IDE set up.

## Building
## Contributors
- Boris Lublinsky ([email protected])

A [docker file](Dockerfile) that can be used for building docker image. You can use
## Description

```shell
make build
```
This transform assigns unique identifiers to the documents in a dataset and supports the following annotations to the
original data:
* **Adding a Document Hash** to each document. The unique hash-based ID is generated using
`hashlib.sha256(doc.encode("utf-8")).hexdigest()`. To store this hash in the data specify the desired column name using
the `hash_column` parameter.
* **Adding an Integer Document ID**: to each document. The integer ID is unique across all rows and tables processed by
the `transform()` method. To store this ID in the data, specify the desired column name using the `int_id_column`
parameter.

Document IDs are essential for tracking annotations linked to specific documents. They are also required for processes
like [fuzzy deduplication](../fdedup), which depend on the presence of integer IDs. If your dataset lacks document ID
columns, this transform can be used to generate them.

## Input Columns Used by This Transform

| Input Column Name | Data Type | Description |
|------------------------------------------------------|-----------|----------------------------------|
| Column specified by the _contents_column_ config arg | str | Column that stores document text |

## Output Columns Annotated by This Transform
| Output Column Name | Data Type | Description |
|--------------------|-----------|---------------------------------------------|
| hash_column | str | Unique hash assigned to each document |
| int_id_column | uint64 | Unique integer ID assigned to each document |

## Configuration and command line Options
## Configuration and Command Line Options

The set of dictionary keys defined in [DocIDTransform](src/doc_id_transform_ray.py)
The set of dictionary keys defined in [DocIDTransform](src/doc_id_transform_base.py)
configuration for values are as follows:

* _doc_column_ - specifies name of the column containing the document (required for ID generation)
Expand All @@ -25,7 +45,7 @@ configuration for values are as follows:

At least one of _hash_column_ or _int_id_column_ must be specified.

## Running
## Usage

### Launched Command Line Options
When running the transform with the Ray launcher (i.e. TransformLauncher),
Expand All @@ -43,7 +63,52 @@ the following command line arguments are available in addition to
```
These correspond to the configuration keys described above.

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.

## Building

A [docker file](Dockerfile) that can be used for building docker image. You can use

```shell
make build
```

### Running the samples
To run the samples, use the following `make` targets

* `run-cli-sample` - runs src/doc_id_transform_python.py using command line args
* `run-local-sample` - runs src/doc_id_local_python.py

These targets will activate the virtual environment and set up any configuration needed.
Use the `-n` option of `make` to see the detail of what is done to run the sample.

For example,
```shell
make run-cli-sample
...
```
Then
```shell
ls output
```
To see results of the transform.

### Code example

TBD

### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.

## Testing

Following [the testing strategy of data-processing-lib](../../../../data-processing-lib/doc/transform-testing.md)

Currently we have:
- [Unit test](test/test_doc_id_python.py)
- [Integration test](test/test_doc_id.py)
10 changes: 9 additions & 1 deletion transforms/universal/doc_id/ray/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,18 @@
# Document ID Annotator
# Document ID Ray Annotator

Please see the set of
[transform project conventions](../../../README.md)
for details on general project conventions, transform configuration,
testing and IDE set up.

## Summary
This project wraps the Document ID transform with a Ray runtime.

## Configuration and command line Options
Document ID configuration and command line options are the same as for the base python
transform.


## Building

A [docker file](Dockerfile) that can be used for building docker image. You can use
Expand Down
17 changes: 10 additions & 7 deletions transforms/universal/doc_id/spark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,26 @@ testing and IDE set up.

## Summary

This transform assigns a unique integer ID to each row in a Spark DataFrame. It relies on the [monotonically_increasing_id](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html) pyspark function to generate the unique integer IDs. As described in the documentation of this function:
This transform assigns a unique integer ID to each row in a Spark DataFrame. It relies on the
[monotonically_increasing_id](https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html)
pyspark function to generate the unique integer IDs. As described in the documentation of this function:
> The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
## Configuration and command line Options

The set of dictionary keys holding [DocIdTransform](src/doc_id_transform.py)
configuration for values are as follows:
The set of dictionary keys holding [DocIdTransform](src/doc_id_transform.py) configuration for values are as follows:

* _doc_id_column_name_ - specifies the name of the DataFrame column that holds the generated document IDs.

## Running
You can run the [doc_id_local.py](src/doc_id_local_spark.py) (spark-based implementation) to transform the `test1.parquet` file in [test input data](test-data/input) to an `output` directory. The directory will contain both the new annotated `test1.parquet` file and the `metadata.json` file.
You can run the [doc_id_local.py](src/doc_id_local_spark.py) (spark-based implementation) to transform the
`test1.parquet` file in [test input data](test-data/input) to an `output` directory. The directory will contain both
the new annotated `test1.parquet` file and the `metadata.json` file.

### Launched Command Line Options
When running the transform with the Spark launcher (i.e. SparkTransformLauncher),
the following command line arguments are available in addition to
the options provided by the [python launcher](../../../../data-processing-lib/doc/python-launcher-options.md).
When running the transform with the Spark launcher (i.e. SparkTransformLauncher), the following command line arguments
are available in addition to the options provided by the
[python launcher](../../../../data-processing-lib/doc/python-launcher-options.md).

```
--doc_id_column_name DOC_ID_COLUMN_NAME
Expand Down

0 comments on commit 6538218

Please sign in to comment.