Skip to content

Commit

Permalink
Merge pull request #394 from IBM/image-data-mount
Browse files Browse the repository at this point in the history
Document processing of local data using python transform image
  • Loading branch information
daw3rd authored Jul 11, 2024
2 parents 79e930f + 8a09eb8 commit a7d3c42
Show file tree
Hide file tree
Showing 26 changed files with 314 additions and 14 deletions.
48 changes: 35 additions & 13 deletions doc/quick-start/run-transform-image.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,10 @@ quay.io/dataprep1/data-prep-kit/noop-python latest aac55fa
Or, you can use the pre-built images (latest, or 0.2.1 or later tags)
on quay.io found at [https://quay.io/user/dataprep1](https://quay.io/user/dataprep1).

### Local Data
Images built in this repository
include directories for mounting input and output directories.
Those directories are `/home/dpk/input` and `/home/dpk/output`.
### Local Data - Python Runtime
To use an image to process local data we will mount the host
input and output directories into the image. Any mount
point can be used, but we will use `/input` and `/output'.

To process data in the `/home/me/input` directory and write it
to the `/home/me/output` directory, we mount these directories into
Expand All @@ -38,13 +38,13 @@ So for example, using the locally built `noop` transform:

```shell
docker run --rm
-v /home/me/input:/home/dpk/input \
-v /home/me/output:/home/dpk/output \
-v /home/me/input:/input \
-v /home/me/output:/output \
noop-python:latest \
python noop_transform_python.py \
--data_local_config "{ \
'input_folder' : '/home/dpk/input', \
'output_folder' : '/home/dpk/output' \
'input_folder' : '/input', \
'output_folder' : '/output' \
}"

```
Expand All @@ -53,18 +53,40 @@ To run the quay.io located transform instead, substitute
for `noop-python:latest`, as follows:
```shell
docker run --rm
-v /home/me/input:/home/dpk/input \
-v /home/me/output:/home/dpk/output \
-v /home/me/input:/input \
-v /home/me/output:/output \
quay.io/dataprep1/data-prep-kit/noop-python:latest \
python noop_transform_python.py \
--data_local_config "{ \
'input_folder' : '/home/dpk/input', \
'output_folder' : '/home/dpk/output' \
'input_folder' : '/input', \
'output_folder' : '/output' \
}"

```
### Local Data - Ray Runtime
To use the ray runtime, we must
1. Switch to using the ray-based image `noop-ray:latest`
2. Use the ray runtime python main() defined in `noop_transform_ray.py`

### S3-located Data
For example, using the quay.io image
```shell
docker run --rm
-v /home/me/input:/input \
-v /home/me/output:/output \
quay.io/dataprep1/data-prep-kit/noop-ray:latest \
python noop_transform_ray.py \
--data_local_config "{ \
'input_folder' : '/input', \
'output_folder' : '/output' \
}"

```
This is functionally equivalent to the python-runtime, but additional
configuration can be provided (see the
[ray launcher args](../../data-processing-lib/doc/ray-launcher-options.md))
for details.

### S3-located Data - Python Runtime
When processing data located in S3 buckets, one can use the same image
and specify different `--data_s3_*` configuration as follows:

Expand Down
21 changes: 21 additions & 0 deletions transforms/code/code2parquet/python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,3 +123,24 @@ To see results of the transform.
---------------------------------


### Transforming local data

Beginning with version 0.2.1, most/all python transform images are built with directories for mounting local data for processing.
Those directories are `/home/dpk/input` and `/home/dpk/output`.

After using `make image` to build the transform image, you can process the data
in the `/home/me/input` directory and place it in the `/home/me/output` directory, for example, using the 0.2.1 tagged image as follows:

```shell
docker run --rm -v /home/me/input:/home/dpk/input -v /home/me/output:/home/dpk/output code2parquet-python:0.2.1 \
python code2parquet_transform_python.py --data_local_config "{ 'input_folder' : '/home/dpk/input', 'output_folder' : '/home/dpk/output'}"
```

You may also use the pre-built images on quay.io using `quay.io/dataprep1/data-prep-kit//code2parquet-python:0.2.1` as the image name.


### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
6 changes: 6 additions & 0 deletions transforms/code/code2parquet/ray/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,9 @@ Then
ls output
```
To see results of the transform.

### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
22 changes: 22 additions & 0 deletions transforms/code/code_quality/python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,3 +65,25 @@ Then
ls output
```
To see results of the transform.

### Transforming local data

Beginning with version 0.2.1, most/all python transform images are built with directories for mounting local data for processing.
Those directories are `/home/dpk/input` and `/home/dpk/output`.

After using `make image` to build the transform image, you can process the data
in the `/home/me/input` directory and place it in the `/home/me/output` directory, for example, using the 0.2.1 tagged image as follows:

```shell
docker run --rm -v /home/me/input:/home/dpk/input -v /home/me/output:/home/dpk/output code_quality-python:0.2.1 \
python code_quality_transform_python.py --data_local_config "{ 'input_folder' : '/home/dpk/input', 'output_folder' : '/home/dpk/output'}"
```

You may also use the pre-built images on quay.io using `quay.io/dataprep1/data-prep-kit//code_quality-python:0.2.1` as the image name.


### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
6 changes: 6 additions & 0 deletions transforms/code/code_quality/ray/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,3 +43,9 @@ Then
ls output
```
To see results of the transform.

### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
22 changes: 22 additions & 0 deletions transforms/code/malware/python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,3 +101,25 @@ Then
ls output
```
To see results of the transform.

### Transforming local data

Beginning with version 0.2.1, most/all python transform images are built with directories for mounting local data for processing.
Those directories are `/home/dpk/input` and `/home/dpk/output`.

After using `make image` to build the transform image, you can process the data
in the `/home/me/input` directory and place it in the `/home/me/output` directory, for example, using the 0.2.1 tagged image as follows:

```shell
docker run --rm -v /home/me/input:/home/dpk/input -v /home/me/output:/home/dpk/output malware-python:0.2.1 \
python malware_transform_python.py --data_local_config "{ 'input_folder' : '/home/dpk/input', 'output_folder' : '/home/dpk/output'}"
```

You may also use the pre-built images on quay.io using `quay.io/dataprep1/data-prep-kit//malware-python:0.2.1` as the image name.


### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
6 changes: 6 additions & 0 deletions transforms/code/malware/ray/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,9 @@ Then
ls output
```
To see results of the transform.

### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
22 changes: 22 additions & 0 deletions transforms/code/proglang_select/python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,3 +72,25 @@ Then
ls output
```
To see results of the transform.

### Transforming local data

Beginning with version 0.2.1, most/all python transform images are built with directories for mounting local data for processing.
Those directories are `/home/dpk/input` and `/home/dpk/output`.

After using `make image` to build the transform image, you can process the data
in the `/home/me/input` directory and place it in the `/home/me/output` directory, for example, using the 0.2.1 tagged image as follows:

```shell
docker run --rm -v /home/me/input:/home/dpk/input -v /home/me/output:/home/dpk/output proglang_select-python:0.2.1 \
python proglang_select_transform_python.py --data_local_config "{ 'input_folder' : '/home/dpk/input', 'output_folder' : '/home/dpk/output'}"
```

You may also use the pre-built images on quay.io using `quay.io/dataprep1/data-prep-kit//proglang_select-python:0.2.1` as the image name.


### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
6 changes: 6 additions & 0 deletions transforms/code/proglang_select/ray/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,3 +40,9 @@ Then
ls output
```
To see results of the transform.

### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
25 changes: 24 additions & 1 deletion transforms/language/lang_id/python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,4 +53,27 @@ To see results of the transform.

## Troubleshooting guide

For M1 Mac user, if you see following error during make command, `error: command '/usr/bin/clang' failed with exit code 1`, you may better follow [this step](https://freeman.vc/notes/installing-fasttext-on-an-m1-mac)
For M1 Mac user, if you see following error during make command, `error: command '/usr/bin/clang' failed with exit code 1`, you may better follow [this step](https://freeman.vc/notes/installing-fasttext-on-an-m1-mac)


### Transforming local data

Beginning with version 0.2.1, most/all python transform images are built with directories for mounting local data for processing.
Those directories are `/home/dpk/input` and `/home/dpk/output`.

After using `make image` to build the transform image, you can process the data
in the `/home/me/input` directory and place it in the `/home/me/output` directory, for example, using the 0.2.1 tagged image as follows:

```shell
docker run --rm -v /home/me/input:/home/dpk/input -v /home/me/output:/home/dpk/output lang_id-python:0.2.1 \
python lang_id_transform_python.py --data_local_config "{ 'input_folder' : '/home/dpk/input', 'output_folder' : '/home/dpk/output'}"
```

You may also use the pre-built images on quay.io using `quay.io/dataprep1/data-prep-kit//lang_id-python:0.2.1` as the image name.


### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
6 changes: 6 additions & 0 deletions transforms/language/lang_id/ray/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,3 +42,9 @@ Then
ls output
```
To see results of the transform.

### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
6 changes: 6 additions & 0 deletions transforms/universal/doc_id/ray/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,3 +83,9 @@ Then
ls output
```
To see results of the transform.

### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
6 changes: 6 additions & 0 deletions transforms/universal/doc_id/spark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,3 +51,9 @@ To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLeve
The metadata generated by the Spark `doc_id` transform contains the following statistics:
* `total_docs_count`, `total_columns_count`: total number of documents (rows), and columns in the input table, before the `doc_id` transform ran
* `docs_after_doc_id`, `columns_after_doc_id`: total number of documents (rows), and columns in the output table, after the `doc_id` transform ran

### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
6 changes: 6 additions & 0 deletions transforms/universal/ededup/ray/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,3 +89,9 @@ Then
ls output
```
To see results of the transform.

### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
6 changes: 6 additions & 0 deletions transforms/universal/fdedup/ray/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,3 +184,9 @@ Then
ls output
```
To see results of the transform.

### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
6 changes: 6 additions & 0 deletions transforms/universal/filter/python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -277,3 +277,9 @@ Then
ls output
```
To see results of the transform.

### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
6 changes: 6 additions & 0 deletions transforms/universal/filter/ray/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,3 +42,9 @@ Then
ls output
```
To see results of the transform.

### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
6 changes: 6 additions & 0 deletions transforms/universal/filter/spark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -239,3 +239,9 @@ the options provided by the [spark launcher](../../../../data-processing-lib/doc
logical operator (AND or OR) that joins filter criteria
```

### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
22 changes: 22 additions & 0 deletions transforms/universal/noop/python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,3 +55,25 @@ Then
ls output
```
To see results of the transform.

### Transforming local data

Beginning with version 0.2.1, most/all python transform images are built with directories for mounting local data for processing.
Those directories are `/home/dpk/input` and `/home/dpk/output`.

After using `make image` to build the transform image, you can process the data
in the `/home/me/input` directory and place it in the `/home/me/output` directory, for example, using the 0.2.1 tagged image as follows:

```shell
docker run --rm -v /home/me/input:/home/dpk/input -v /home/me/output:/home/dpk/output noop-python:0.2.1 \
python noop_transform_python.py --data_local_config "{ 'input_folder' : '/home/dpk/input', 'output_folder' : '/home/dpk/output'}"
```

You may also use the pre-built images on quay.io using `quay.io/dataprep1/data-prep-kit//noop-python:0.2.1` as the image name.


### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
6 changes: 6 additions & 0 deletions transforms/universal/noop/ray/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,9 @@ Then
ls output
```
To see results of the transform.

### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
6 changes: 6 additions & 0 deletions transforms/universal/noop/spark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,9 @@ ls output
```
To see results of the transform.


### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
6 changes: 6 additions & 0 deletions transforms/universal/profiler/ray/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,3 +80,9 @@ Then
ls output
```
To see results of the transform.

### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.
Loading

0 comments on commit a7d3c42

Please sign in to comment.