Skip to content

Commit

Permalink
Merge pull request #895 from matouma/pii-named-modules
Browse files Browse the repository at this point in the history
Refactoring pii_redactor as its own dpk_ named module
  • Loading branch information
touma-I authored Jan 16, 2025
2 parents dbb0817 + f57a35a commit 725fdf6
Show file tree
Hide file tree
Showing 39 changed files with 731 additions and 560 deletions.
2 changes: 2 additions & 0 deletions transforms/README-list.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,8 @@ Note: This list includes the transforms that were part of the release starting w

## Release notes:

### 1.0.0.a5
Added Pii Redactor
### 1.0.0.a4
Added missing ray implementation for lang_id, doc_quality, tokenization and filter
Added ray notebooks for lang id, Doc Quality, tokenization, and Filter
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ RUN pip install --no-cache-dir pytest
RUN useradd -ms /bin/bash dpk
USER dpk
WORKDIR /home/dpk

ARG DPK_WHEEL_FILE_NAME

# Copy and install data processing libraries
Expand All @@ -18,20 +19,9 @@ RUN pip install data-processing-dist/${DPK_WHEEL_FILE_NAME}

# END OF STEPS destined for a data-prep-kit base image

COPY --chown=dpk:root src/ src/
COPY --chown=dpk:root pyproject.toml pyproject.toml
COPY --chown=dpk:root dpk_pii_redactor/ dpk_pii_redactor/
COPY --chown=dpk:root requirements.txt requirements.txt
RUN pip install --no-cache-dir -e .

# copy transform main() entry point to the image
COPY ./src/pii_redactor_transform_python.py .

# copy some of the samples in
COPY ./src/pii_redactor_local.py local/

# copy test
COPY test/ test/
COPY test-data/ test-data/
RUN pip install -r requirements.txt

# Set environment
ENV PYTHONPATH /home/dpk
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
ARG BASE_IMAGE=docker.io/rayproject/ray:2.24.0-py310

FROM ${BASE_IMAGE}

# see https://docs.openshift.com/container-platform/4.17/openshift_images/create-images.html#use-uid_create-images
Expand All @@ -10,33 +11,20 @@ RUN pip install --upgrade --no-cache-dir pip

# install pytest
RUN pip install --no-cache-dir pytest
ARG PIP_INSTALL_EXTRA_ARGS
ARG DPK_WHEEL_FILE_NAME

# Copy and install data processing libraries
# These are expected to be placed in the docker context before this is run (see the make image).
COPY --chmod=775 --chown=ray:root data-processing-dist data-processing-dist
RUN pip install data-processing-dist/${DPK_WHEEL_FILE_NAME}[ray]

## Copy the python version of the tansform
COPY --chmod=775 --chown=ray:root python-transform/ python-transform/
RUN cd python-transform && pip install --no-cache-dir -e .

#COPY requirements.txt requirements.txt
#RUN pip install --no-cache-dir -r requirements.txt

COPY --chmod=775 --chown=ray:root src/ src/
COPY --chmod=775 --chown=ray:root pyproject.toml pyproject.toml
RUN pip install --no-cache-dir -e .

# copy the main() entry point to the image
COPY ./src/pii_redactor_transform_ray.py .

# copy some of the samples in
COPY ./src/pii_redactor_local_ray.py local/
COPY --chown=ray:users dpk_pii_redactor/ dpk_pii_redactor/
COPY --chown=ray:users requirements.txt requirements.txt
RUN pip install -r requirements.txt

# copy test
COPY test/ test/
COPY test-data/ test-data/
# Grant non-root users the necessary permissions to the ray directory
RUN chmod 755 /home/ray

# Set environment
ENV PYTHONPATH /home/ray
Expand Down
18 changes: 18 additions & 0 deletions transforms/language/pii_redactor/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
REPOROOT=../../..
# Use make help, to see the available rules
include $(REPOROOT)/transforms/.make.cicd.targets

#
# This is intended to be included across the Makefiles provided within
# a given transform's directory tree, so must use compatible syntax.
#
################################################################################
# This defines the name of the transform and is used to match against
# expected files and is used to define the transform's image name.
TRANSFORM_NAME=$(shell basename `pwd`)

################################################################################


publish:
@echo "Skip... do nothing! pushing CI/CD over a cliff with OSError on text_encoder "
79 changes: 0 additions & 79 deletions transforms/language/pii_redactor/Makefile.disable

This file was deleted.

112 changes: 102 additions & 10 deletions transforms/language/pii_redactor/README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,105 @@


# PII Redactor Transform

* [python](python/README.md) - provides the base python-based transformation
implementation.
* [ray](ray/README.md) - enables the running of the base python transformation
in a Ray runtime
* [kfp](kfp_ray/README.md) - enables running the ray docker image
in a kubernetes cluster using a generated `yaml` file.
<!-- Consider commenting out since we do not have a spark transform for this.
* [spark](spark/README.md) - enables the running of a spark-based transformation
in a Spark runtime.
-->
This transform redacts Personally Identifiable Information (PII) from the input data.

The transform leverages the [Microsoft Presidio SDK](https://microsoft.github.io/presidio/) for PII detection and uses the Flair recognizer for entity recognition.


## Contributors

- Sowmya.L.R ([email protected])


### Supported Entities

The transform detects the following PII entities by default:
- **PERSON**: Names of individuals
- **EMAIL_ADDRESS**: Email addresses
- **ORGANIZATION**: Names of organizations
- **DATE_TIME**: Dates and times
- **PHONE_NUMBER**: Phone number
- **CREDIT_CARD**: Credit card numbers

You can configure the entities to detect by passing the required entities as argument param ( **--pii_redactor_entities** ).
To know more about different entity types supported - [Entities](https://microsoft.github.io/presidio/supported_entities/)

### Redaction Techniques

Two redaction techniques are supported:
- **replace**: Replaces detected PII with a placeholder (default)
- **redact**: Removes the detected PII from the text

You can choose the redaction technique by passing it as an argument parameter (**--pii_redactor_operator**).

## Input and Output

### Input

The input data should be a `py.Table` with a column containing the text where PII detection and redaction will be applied. By default, this column is named `contents`.

**Example Input Table Structure:** Table 1: Sample input to the pii redactor transform

| contents | doc_id |
|---------------------|--------|
| My name is John Doe | doc001 |
| I work at apple | doc002 |


### Output

The output table will include the original columns plus an additional column `new_contents` which is configurable with redacted text and `detected_pii`
column consisting the type of PII entities detected in that document for replace operator.

**Example Output Table Structure for replace operator:**

| contents | doc_id | new_contents | detected_pii |
|---------------------|--------|--------------------------|------------------|
| My name is John Doe | doc001 | My name is `<PERSON>` | `[PERSON]` |
| I work at apple | doc002 | I work at `<ORGANIZATION>` | `[ORGANIZATION]` |

When `redact` operator is chosen the output will look like below

**Example Output Table Structure for redact operator**

| contents | doc_id | new_contents | detected_pii |
|---------------------|--------|--------------------------|------------------|
| My name is John Doe | doc001 | My name is | `[PERSON]` |
| I work at apple | doc002 | I work at | `[ORGANIZATION]` |

### Launched Command Line Options
The following command line arguments are available in addition to
the options provided by
the [python launcher](../../../data-processing-lib/doc/python-launcher-options.md).

```
--pii_redactor_entities PII_ENTITIES
list of PII entities to be captured for example: ["PERSON", "EMAIL"]
--pii_redactor_operator REDACTOR_OPERATOR
Two redaction techniques are supported - replace(default), redact
--pii_redactor_transformed_contents PII_TRANSFORMED_CONTENT_COLUMN_NAME
Mention the column name in which transformed contents will be added. This is required argument.
--pii_redactor_score_threshold SCORE_THRESHOLD
The score_threshold is a parameter that sets the minimum confidence score required for an entity to be considered a match.
Provide a value above 0.6
```
## PII Redactor Ray Transform
Please see the set of
[transform project conventions](../../README.md#transform-project-conventions)
for details on general project conventions, transform configuration,
testing and IDE set up.

## Summary
This project wraps the pii redactor transform with a Ray runtime.

### Launched Command Line Options
In addition to those available to the transform as defined here,
the set of
[ray launcher options](../../../data-processing-lib/doc/ray-launcher-options.md) are available.

### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .transform import *
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
import os

from data_processing.data_access import DataAccessLocal
from pii_redactor_transform import (
from dpk_pii_redactor.transform import (
PIIRedactorTransform,
doc_transformed_contents_key,
supported_entities_key,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@

from data_processing.runtime.pure_python import PythonTransformLauncher
from data_processing.utils import ParamsUtils
from pii_redactor_transform import doc_transformed_contents_cli_param
from pii_redactor_transform_python import PIIRedactorPythonTransformConfiguration
from dpk_pii_redactor.transform import doc_transformed_contents_cli_param
from dpk_pii_redactor.transform_python import PIIRedactorPythonTransformConfiguration


# create parameters
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
import logging

import spacy
from flair_recognizer import FlairRecognizer
from dpk_pii_redactor.flair_recognizer import FlairRecognizer
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider

Expand Down
Empty file.
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

from data_processing.utils import ParamsUtils
from data_processing_ray.runtime.ray import RayTransformLauncher
from pii_redactor_transform_ray import PIIRedactorRayTransformConfiguration
from dpk_pii_redactor.ray.transform import PIIRedactorRayTransformConfiguration


# create parameters
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

from data_processing.utils import ParamsUtils
from data_processing_ray.runtime.ray import RayTransformLauncher
from pii_redactor_transform_ray import PIIRedactorRayTransformConfiguration
from dpk_pii_redactor.ray.transform import PIIRedactorRayTransformConfiguration


print(os.environ)
Expand Down
Loading

0 comments on commit 725fdf6

Please sign in to comment.