-
Notifications
You must be signed in to change notification settings - Fork 154
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #895 from matouma/pii-named-modules
Refactoring pii_redactor as its own dpk_ named module
- Loading branch information
Showing
39 changed files
with
731 additions
and
560 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
REPOROOT=../../.. | ||
# Use make help, to see the available rules | ||
include $(REPOROOT)/transforms/.make.cicd.targets | ||
|
||
# | ||
# This is intended to be included across the Makefiles provided within | ||
# a given transform's directory tree, so must use compatible syntax. | ||
# | ||
################################################################################ | ||
# This defines the name of the transform and is used to match against | ||
# expected files and is used to define the transform's image name. | ||
TRANSFORM_NAME=$(shell basename `pwd`) | ||
|
||
################################################################################ | ||
|
||
|
||
publish: | ||
@echo "Skip... do nothing! pushing CI/CD over a cliff with OSError on text_encoder " |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,13 +1,105 @@ | ||
|
||
|
||
# PII Redactor Transform | ||
|
||
* [python](python/README.md) - provides the base python-based transformation | ||
implementation. | ||
* [ray](ray/README.md) - enables the running of the base python transformation | ||
in a Ray runtime | ||
* [kfp](kfp_ray/README.md) - enables running the ray docker image | ||
in a kubernetes cluster using a generated `yaml` file. | ||
<!-- Consider commenting out since we do not have a spark transform for this. | ||
* [spark](spark/README.md) - enables the running of a spark-based transformation | ||
in a Spark runtime. | ||
--> | ||
This transform redacts Personally Identifiable Information (PII) from the input data. | ||
|
||
The transform leverages the [Microsoft Presidio SDK](https://microsoft.github.io/presidio/) for PII detection and uses the Flair recognizer for entity recognition. | ||
|
||
|
||
## Contributors | ||
|
||
- Sowmya.L.R ([email protected]) | ||
|
||
|
||
### Supported Entities | ||
|
||
The transform detects the following PII entities by default: | ||
- **PERSON**: Names of individuals | ||
- **EMAIL_ADDRESS**: Email addresses | ||
- **ORGANIZATION**: Names of organizations | ||
- **DATE_TIME**: Dates and times | ||
- **PHONE_NUMBER**: Phone number | ||
- **CREDIT_CARD**: Credit card numbers | ||
|
||
You can configure the entities to detect by passing the required entities as argument param ( **--pii_redactor_entities** ). | ||
To know more about different entity types supported - [Entities](https://microsoft.github.io/presidio/supported_entities/) | ||
|
||
### Redaction Techniques | ||
|
||
Two redaction techniques are supported: | ||
- **replace**: Replaces detected PII with a placeholder (default) | ||
- **redact**: Removes the detected PII from the text | ||
|
||
You can choose the redaction technique by passing it as an argument parameter (**--pii_redactor_operator**). | ||
|
||
## Input and Output | ||
|
||
### Input | ||
|
||
The input data should be a `py.Table` with a column containing the text where PII detection and redaction will be applied. By default, this column is named `contents`. | ||
|
||
**Example Input Table Structure:** Table 1: Sample input to the pii redactor transform | ||
|
||
| contents | doc_id | | ||
|---------------------|--------| | ||
| My name is John Doe | doc001 | | ||
| I work at apple | doc002 | | ||
|
||
|
||
### Output | ||
|
||
The output table will include the original columns plus an additional column `new_contents` which is configurable with redacted text and `detected_pii` | ||
column consisting the type of PII entities detected in that document for replace operator. | ||
|
||
**Example Output Table Structure for replace operator:** | ||
|
||
| contents | doc_id | new_contents | detected_pii | | ||
|---------------------|--------|--------------------------|------------------| | ||
| My name is John Doe | doc001 | My name is `<PERSON>` | `[PERSON]` | | ||
| I work at apple | doc002 | I work at `<ORGANIZATION>` | `[ORGANIZATION]` | | ||
|
||
When `redact` operator is chosen the output will look like below | ||
|
||
**Example Output Table Structure for redact operator** | ||
|
||
| contents | doc_id | new_contents | detected_pii | | ||
|---------------------|--------|--------------------------|------------------| | ||
| My name is John Doe | doc001 | My name is | `[PERSON]` | | ||
| I work at apple | doc002 | I work at | `[ORGANIZATION]` | | ||
|
||
### Launched Command Line Options | ||
The following command line arguments are available in addition to | ||
the options provided by | ||
the [python launcher](../../../data-processing-lib/doc/python-launcher-options.md). | ||
|
||
``` | ||
--pii_redactor_entities PII_ENTITIES | ||
list of PII entities to be captured for example: ["PERSON", "EMAIL"] | ||
--pii_redactor_operator REDACTOR_OPERATOR | ||
Two redaction techniques are supported - replace(default), redact | ||
--pii_redactor_transformed_contents PII_TRANSFORMED_CONTENT_COLUMN_NAME | ||
Mention the column name in which transformed contents will be added. This is required argument. | ||
--pii_redactor_score_threshold SCORE_THRESHOLD | ||
The score_threshold is a parameter that sets the minimum confidence score required for an entity to be considered a match. | ||
Provide a value above 0.6 | ||
``` | ||
## PII Redactor Ray Transform | ||
Please see the set of | ||
[transform project conventions](../../README.md#transform-project-conventions) | ||
for details on general project conventions, transform configuration, | ||
testing and IDE set up. | ||
|
||
## Summary | ||
This project wraps the pii redactor transform with a Ray runtime. | ||
|
||
### Launched Command Line Options | ||
In addition to those available to the transform as defined here, | ||
the set of | ||
[ray launcher options](../../../data-processing-lib/doc/ray-launcher-options.md) are available. | ||
|
||
### Transforming data using the transform image | ||
|
||
To use the transform image to transform your data, please refer to the | ||
[running images quickstart](../../../doc/quick-start/run-transform-image.md), | ||
substituting the name of this transform image and runtime as appropriate. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
from .transform import * |
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.