Skip to content

Commit

Permalink
Merge pull request #408 from IBM/Documentation-Updates
Browse files Browse the repository at this point in the history
Documentation updates
  • Loading branch information
daw3rd authored Jul 15, 2024
2 parents c334038 + 6a11550 commit d4e3b50
Show file tree
Hide file tree
Showing 4 changed files with 34 additions and 45 deletions.
71 changes: 29 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,14 @@ As the variety of use cases grows, so does the need to support:
- New ways of transforming the data to optimize the performance of the resulting LLMs for each specific use case.
- A large variety in the scale of data to be processed, from laptop-scale to datacenter-scale

Data Prep Kit offers implementations of commonly needed data transformations, called *modules*, for both Code and Language modalities.
Data Prep Kit offers implementations of commonly needed data preparation steps, called *modules* or *transforms*, for both Code and Language modalities.
The goal is to offer high-level APIs for developers to quickly get started in working with their data, without needing expertise in the underlying runtimes and frameworks.

## 📝 Table of Contents
- [About](#about)
- [Quick Start](doc/quick-start/quick-start.md)
- [Transform Framework](data-processing-lib/doc/overview.md)
- [Pipeline Automation](#pipeline)
- [Data Processing Modules](#modules)
- [Data Processing Framework](#data-proc-lib)
- [Repository Use and Navigation](doc/repo.md)
- [How to Contribute](CONTRIBUTING.md)
- [Acknowledgments](#acknowledgement)
Expand All @@ -42,11 +42,26 @@ Eventually, Data Prep Kit will offer consistent APIs and configurations across t
1. Python runtime
2. Ray runtime (local and distributed)
3. Spark runtime (local and distributed)
4. [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) (local and distributed, wrapping Ray)
4. Kubeflow Pipelines (local and distributed, wrapping Ray)

The current matrix for the combination of modules and supported runtimes is shown in the table below.
Contributors are welcome to add new modules as well as add runtime support for existing modules!
Features of the toolkit:

- It aims to accelerate unstructured data prep for the "long tail" of LLM use cases.
- It offers a growing set of [module](/transforms) implementations across multiple runtimes, targeting laptop-scale to datacenter-scale processing.
- It provides a growing set of [sample data procesing pipelines](/examples) that can be used for real enterprise use cases.
- It provides the [Data processing library](data-processing-lib/ray) to enable contribution of new custom modules targeting new use cases.
- It uses [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/)-based [workflow automation](kfp/doc/simple_transform_pipeline.md).

Data modalities supported:

* Code - support for code datasets as downloaded .zip files of GitHub repositories converted to
[parquet](https://arrow.apache.org/docs/python/parquet.html) files.
* Language - supports for natural language datasets, and like the code transformations, will operate on parquet files.

Support for additional data modalities is expected in the future and additional data formats is welcome!

## Data Preparation Modules <a name = "modules"></a>
The below matrix shows the the combination of modules and supported runtimes. All the modules can be accessed [here](/transforms) and can be combined to form data processing pipelines, as shown in [examples](/examples) folder.

|Modules | Python-only | Ray | Spark | KFP on Ray |
|------------------------------ |------------------|------------------|------------------|------------------------|
Expand All @@ -63,33 +78,11 @@ Contributors are welcome to add new modules as well as add runtime support for e
|Profiler | |:white_check_mark:| |:white_check_mark: |
|Tokenizer |:white_check_mark:|:white_check_mark:| |:white_check_mark: |

Contributors are welcome to add new modules as well as add runtime support for existing modules!


Features of the toolkit:

- It aims to accelerate unstructured data prep for the "long tail" of LLM use cases.
- It offers a growing set of module implementations across multiple runtimes, targeting laptop-scale to datacenter-scale processing.
- It provides a growing set of sample pipelines developed for real enterprise use cases.
- It provides the [Data processing library](data-processing-lib/ray) to enable contribution of new custom modules targeting new use cases.
- It uses [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/)-based [workflow automation](kfp/doc/simple_transform_pipeline.md).

Data modalities supported:

* Code - support for code datasets as downloaded .zip files of GitHub repositories converted to
[parquet](https://arrow.apache.org/docs/python/parquet.html) files.
* Language - Future releases will provide transforms specific to natural language, and like the code transformations, will operate on parquet files.

Support for additional data modalities is expected in the future.

### Data Processing Library
A Python-based library that has ready-to-use transforms that can be supported across a variety of runtimes.
We use the popular [parquet](https://arrow.apache.org/docs/python/parquet.html) format to store the data (code or language).
Every parquet file follows a set
[schema](tools/ingest2parquet/).
Data is converted from raw form (e.g., zip files for GitHub repositories) to parquet files by the
[code2parquet](/transforms/code/code2parquet)
tool that also adds the necessary fields in the schema.
A user can then use one or more of the [available transforms](transforms) to process their data.
## Data Processing Framework <a name = "data-proc-lib"></a>
At the core of the framework, is a data processing library, that provides a systematic way to implement the data processing modules. The library is python-based and enables the application of "transforms" to a one or more input data files to produce one or more output data files. We use the popular [parquet](https://arrow.apache.org/docs/python/parquet.html) format to store the data (code or language).
Every parquet file follows a set [schema](/transforms/code/code2parquet/python/README.md). A user can use one or more transforms (or modules) as discussed above to process their data.

#### Transform Design
A transform can follow one of the two patterns: annotator or filter.
Expand All @@ -107,7 +100,8 @@ or [Spark](https://spark.apache.org) wrappers are provided, to readily scale out
A generalized workflow is shown [here](doc/data-processing.md).

#### Bring Your Own Transform
One can add new transforms by bringing in Python-based processing logic and using the Data Processing Library to build and contribute transforms.
One can add new transforms by bringing in Python-based processing logic and using the Data Processing Library to build and contribute transforms. We have provided an [example transform](/transforms/universal/noop) that can serve as a template to add new simple transforms.

More details on the data processing library are [here](data-processing-lib/doc/overview.md).

#### Automation
Expand All @@ -120,21 +114,14 @@ for creating and managing the Ray cluster and [KubeRay API server](https://githu
to interact with the KubeRay operator. An additional [framework](kfp/kfp_support_lib) along with several
[kfp components](kfp/kfp_ray_components) is used to simplify the pipeline implementation.

## Automate a Pipeline<a name="pipeline"></a>
The data preprocessing can be automated by running transformers as a Kubeflow pipeline (KFP).
The project facilitates the creation of a local [Kind cluster](https://kind.sigs.k8s.io/) with all the required
software and test data, or deployment of required software on an existing cluster.
See [Set up a Kubernetes clusters for KFP execution](kfp/doc/setup.md)

A simple transform pipeline [tutorial](kfp/doc/simple_transform_pipeline.md) explains the pipeline creation and execution.
In addition, if you want to combine several transformers in a single pipeline, you can look at [multi-steps pipeline](kfp/doc/multi_transform_pipeline.md)

When you finish working with the cluster, and want to clean up or destroy it. See the
[clean up the cluster](../kfp/doc/setup.md#cleanup)


## &#x2B50; Acknowledgements <a name = "acknowledgement"></a>
Thanks to the [BigCode Project](https://github.com/bigcode-project), which served as the source for borrowing the code quality metrics.
## Acknowledgements <a name = "acknowledgement"></a>
Thanks to the [BigCode Project](https://github.com/bigcode-project), which served as the source for borrowing few code quality metrics.



Expand Down
4 changes: 3 additions & 1 deletion doc/quick-start/new-transform-inside.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
# Creating a New Transform in this Repository
WIP - but please see [transform conventions](../../transforms/README.md).

An easy way could be to replicate [noop transform](transforms/universal/noop), change class names as per your needs and add your business logic in [noop_transform.py](../../transforms/universal/noop/python/src/noop_transform.py)
Please see [tutorials](../transform-tutorial-examples.md) for more details
2 changes: 1 addition & 1 deletion doc/quick-start/quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Here we provided short examples of various uses of the Data Prep Kit.
## Running transforms

* Notebooks
* [Various](../../examples/notebooks/README.md) - many notebook examples for code and language
* [Example data processing pipelines](../../examples/notebooks/README.md) - Use these to quickly process your data. A notebook structure allows a user to select/de-select transforms and change the order of processing as desired.
* Command line
* [Using a docker image](run-transform-image.md) - runs a transform in a docker transform image
* [Using a virtual environment](run-transform-venv.md) - runs a transform on the local host
Expand Down
2 changes: 1 addition & 1 deletion examples/notebooks/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Data Prep Kit Examples

* [Code](code) - shows ingestion and processing of github zip downloads.
* [Code](code)
* [Language](language) - coming soon.

0 comments on commit d4e3b50

Please sign in to comment.