From 52b5a0dfac5ede129ea0758bb452615a29a57a5c Mon Sep 17 00:00:00 2001 From: blublinsky Date: Wed, 27 Nov 2024 12:01:09 +0000 Subject: [PATCH] add documentation --- .../doc/pipelined_transform.md | 42 +++++++++++++++++++ data-processing-lib/doc/transforms.md | 5 ++- 2 files changed, 46 insertions(+), 1 deletion(-) create mode 100644 data-processing-lib/doc/pipelined_transform.md diff --git a/data-processing-lib/doc/pipelined_transform.md b/data-processing-lib/doc/pipelined_transform.md new file mode 100644 index 000000000..5fb55cffd --- /dev/null +++ b/data-processing-lib/doc/pipelined_transform.md @@ -0,0 +1,42 @@ +# Pipelined transform + +Typical DPK usage is a sequential invocation of individual transforms that process all of the input data and create +the output one. Such execution is very convenient as it produces all of the intermediate data, which can be useful, +especially during the debugging. + +This said, such approach creates a lot of intermediate data and executes a lot of reads and writes, which might +significantly slow down processing, especially in the case of large data sets. + +To overcome this drawback, DPK introduced a new type of transform - pipeline transform. Pipeline transform +(somewhat similar to [sklearn pipeline](https://scikit-learn.org/1.5/modules/generated/sklearn.pipeline.Pipeline.html)) +is a transform, meaning it transforms one file at a time and a pipeline, meaning that this file is transformed by +a set of individual transformers, passing data between then as a byte array in memory. + +## Creating pipeline transform. + +Creation of the pipeline transform requires creation of runtime specific transform runtime configuration +leveraging [PipelineTransformConfiguration](../python/src/data_processing/transform/pipeline_transform_configuration.py) +Examples of such configuration can be found: + +* [Python](../../transforms/universal/noop/python/src/noop_pipeline_transform_python.py) +* [Ray](../../transforms/universal/noop/ray/src/noop_pipeline_transform_ray.py) +* [Spark](../../transforms/universal/noop/spark/src/noop_pipeline_transform_spark.py) + +These are very simple examples using pipeline containing a single transform. + +More complex example defining pipeline of two examples - Resize and NOOP can be found +[Python](../python/src/data_processing/test_support/transform/pipeline_transform.py) and +[Ray](../ray/src/data_processing_ray/test_support/transform/pipeline_transform.py) + +***Note*** the limitation of pipeline transform is that all participating transforms have to be different, +The same transform can not be included twice. + +## Running pipeline transform + +Similar to the `ordinary` transforms, pipeline transforms can be invoked using launcher, but parameters, +in this case have to include parameters for all participating transforms. The base class +[AbstractPipelineTransform](../python/src/data_processing/transform/pipeline_transform.py) will initialize +all participating transforms based on these parameters + +***Note*** as per DPK convention, parameters for every transform are prefixed by a transform name, which means +that a given transform will always get an appropriate parameter \ No newline at end of file diff --git a/data-processing-lib/doc/transforms.md b/data-processing-lib/doc/transforms.md index fc3509ba3..3d5169c90 100644 --- a/data-processing-lib/doc/transforms.md +++ b/data-processing-lib/doc/transforms.md @@ -9,9 +9,12 @@ There are currently two types of transforms defined in DPK: * [AbstractBinaryTransform](../python/src/data_processing/transform/binary_transform.py) which is a base class for all data transforms. Data transforms convert a file of data producing zero or more data files -and metadata. A specific class of the binary transform is +and metadata. Specific classes of the binary transform are [AbstractTableTransform](../python/src/data_processing/transform/table_transform.py) that consumes and produces data files containing [pyarrow tables](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html) +and [AbstractPipelineTransform](../python/src/data_processing/transform/pipeline_transform.py) that creates +pipelined execution of one or more transforms. For more information on pipelined transforms reffer to +[this](pipelined_transform.md) * [AbstractFolderTransform](../python/src/data_processing/transform/folder_transform.py) which is a base class consuming a folder (that can contain an arbitrary set of files, that need to be processed together) and proces zero or more data files and metadata.