Merge pull request #408 from IBM/Documentation-Updates

Documentation updates
IBM · Jul 15, 2024 · d4e3b50 · d4e3b50
2 parents c334038 + 6a11550
commit d4e3b50
Show file tree

Hide file tree

Showing 4 changed files with 34 additions and 45 deletions.
diff --git a/README.md b/README.md
@@ -19,14 +19,14 @@ As the variety of use cases grows, so does the need to support:
 - New ways of transforming the data to optimize the performance of the resulting LLMs for each specific use case.
 - A large variety in the scale of data to be processed, from laptop-scale to datacenter-scale
 
-Data Prep Kit offers implementations of commonly needed data transformations, called *modules*, for both Code and Language modalities.
+Data Prep Kit offers implementations of commonly needed data preparation steps, called *modules* or *transforms*, for both Code and Language modalities.
 The goal is to offer high-level APIs for developers to quickly get started in working with their data, without needing expertise in the underlying runtimes and frameworks.
 
 ## 📝 Table of Contents
 - [About](#about)
 - [Quick Start](doc/quick-start/quick-start.md)
-- [Transform Framework](data-processing-lib/doc/overview.md)
-- [Pipeline Automation](#pipeline)
+- [Data Processing Modules](#modules)
+- [Data Processing Framework](#data-proc-lib)
 - [Repository Use and Navigation](doc/repo.md)
 - [How to Contribute](CONTRIBUTING.md)
 - [Acknowledgments](#acknowledgement)
@@ -42,11 +42,26 @@ Eventually, Data Prep Kit will offer consistent APIs and configurations across t
 1. Python runtime
 2. Ray runtime (local and distributed)
 3. Spark runtime (local and distributed)
-4. [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) (local and distributed, wrapping Ray)
+4. Kubeflow Pipelines (local and distributed, wrapping Ray)
 
-The current matrix for the combination of modules and supported runtimes is shown in the table below. 
-Contributors are welcome to add new modules as well as add runtime support for existing modules!
+Features of the toolkit: 
+
+- It aims to accelerate unstructured data prep for the "long tail" of LLM use cases.
+- It offers a growing set of [module](/transforms) implementations across multiple runtimes, targeting laptop-scale to datacenter-scale processing.
+- It provides a growing set of [sample data procesing pipelines](/examples) that can be used for real enterprise use cases.
+- It provides the [Data processing library](data-processing-lib/ray) to enable contribution of new custom modules targeting new use cases.
+- It uses [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/)-based [workflow automation](kfp/doc/simple_transform_pipeline.md).
 
+Data modalities supported: 
+
+* Code - support for code datasets as downloaded .zip files of GitHub repositories converted to
+[parquet](https://arrow.apache.org/docs/python/parquet.html) files. 
+* Language - supports for natural language datasets, and like the code transformations, will operate on parquet files.
+
+Support for additional data modalities is expected in the future and additional data formats is welcome!
+
+## Data Preparation Modules <a name = "modules"></a>
+The below matrix shows the the combination of modules and supported runtimes. All the modules can be accessed [here](/transforms) and can be combined to form data processing pipelines, as shown in [examples](/examples) folder. 
 
 |Modules                         | Python-only      | Ray              | Spark            | KFP on Ray             |
 |------------------------------  |------------------|------------------|------------------|------------------------|
@@ -63,33 +78,11 @@ Contributors are welcome to add new modules as well as add runtime support for e
 |Profiler                        |                  |:white_check_mark:|                  |:white_check_mark:      |
 |Tokenizer                       |:white_check_mark:|:white_check_mark:|                  |:white_check_mark:      |
 
+Contributors are welcome to add new modules as well as add runtime support for existing modules!
 
-
-Features of the toolkit: 
-
-- It aims to accelerate unstructured data prep for the "long tail" of LLM use cases.
-- It offers a growing set of module implementations across multiple runtimes, targeting laptop-scale to datacenter-scale processing.
-- It provides a growing set of sample pipelines developed for real enterprise use cases.
-- It provides the [Data processing library](data-processing-lib/ray) to enable contribution of new custom modules targeting new use cases.
-- It uses [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/)-based [workflow automation](kfp/doc/simple_transform_pipeline.md).
-
-Data modalities supported: 
-
-* Code - support for code datasets as downloaded .zip files of GitHub repositories converted to
-[parquet](https://arrow.apache.org/docs/python/parquet.html) files. 
-* Language - Future releases will provide transforms specific to natural language, and like the code transformations, will operate on parquet files.
-
-Support for additional data modalities is expected in the future.
-
-### Data Processing Library 
-A Python-based library that has ready-to-use transforms that can be supported across a variety of runtimes.
-We use the popular [parquet](https://arrow.apache.org/docs/python/parquet.html) format to store the data (code or language). 
-Every parquet file follows a set 
-[schema](tools/ingest2parquet/).
-Data is converted from raw form (e.g., zip files for GitHub repositories) to parquet files by the
-[code2parquet](/transforms/code/code2parquet) 
-tool that also adds the necessary fields in the schema.  
-A user can then use one or more of the [available transforms](transforms) to process their data. 
+## Data Processing Framework <a name = "data-proc-lib"></a>
+At the core of the framework, is a data processing library, that provides a systematic way to implement the data processing modules. The library is python-based and enables the application of "transforms" to a one or more input data files to produce one or more output data files. We use the popular [parquet](https://arrow.apache.org/docs/python/parquet.html) format to store the data (code or language). 
+Every parquet file follows a set [schema](/transforms/code/code2parquet/python/README.md). A user can use one or more transforms (or modules) as discussed above to process their data. 
 
 #### Transform Design
 A transform can follow one of the two patterns: annotator or filter.
@@ -107,7 +100,8 @@ or [Spark](https://spark.apache.org) wrappers are provided, to readily scale out
 A generalized workflow is shown [here](doc/data-processing.md).
 
 #### Bring Your Own Transform 
-One can add new transforms by bringing in Python-based processing logic and using the Data Processing Library to build and contribute transforms.
+One can add new transforms by bringing in Python-based processing logic and using the Data Processing Library to build and contribute transforms. We have provided an [example transform](/transforms/universal/noop) that can serve as a template to add new simple transforms. 
+
 More details on the data processing library are [here](data-processing-lib/doc/overview.md). 
 
 #### Automation
@@ -120,21 +114,14 @@ for creating and managing the Ray cluster and [KubeRay API server](https://githu
 to interact with the KubeRay operator. An additional [framework](kfp/kfp_support_lib) along with several
 [kfp components](kfp/kfp_ray_components) is used to simplify the pipeline implementation.
 
-## Automate a Pipeline<a name="pipeline"></a>
-The data preprocessing can be automated by running transformers as a Kubeflow pipeline (KFP). 
-The project facilitates the creation of a local [Kind cluster](https://kind.sigs.k8s.io/) with all the required 
-software and test data, or deployment of required software on an existing cluster. 
-See [Set up a Kubernetes clusters for KFP execution](kfp/doc/setup.md)
-
 A simple transform pipeline [tutorial](kfp/doc/simple_transform_pipeline.md) explains the pipeline creation and execution. 
 In addition, if you want to combine several transformers in a single pipeline, you can look at [multi-steps pipeline](kfp/doc/multi_transform_pipeline.md) 
 
 When you finish working with the cluster, and want to clean up or destroy it. See the 
 [clean up the cluster](../kfp/doc/setup.md#cleanup)
 
-
-## &#x2B50; Acknowledgements <a name = "acknowledgement"></a>
-Thanks to the [BigCode Project](https://github.com/bigcode-project), which served as the source for borrowing the code quality metrics.
+## Acknowledgements <a name = "acknowledgement"></a>
+Thanks to the [BigCode Project](https://github.com/bigcode-project), which served as the source for borrowing few code quality metrics.
 
 
 

diff --git a/doc/quick-start/new-transform-inside.md b/doc/quick-start/new-transform-inside.md
@@ -1,2 +1,4 @@
 # Creating a New Transform in this Repository
-WIP - but please see [transform conventions](../../transforms/README.md).
+
+An easy way could be to replicate [noop transform](transforms/universal/noop), change class names as per your needs and add your business logic in [noop_transform.py](../../transforms/universal/noop/python/src/noop_transform.py) 
+Please see [tutorials](../transform-tutorial-examples.md) for more details 
diff --git a/doc/quick-start/quick-start.md b/doc/quick-start/quick-start.md
@@ -4,7 +4,7 @@ Here we provided short examples of various uses of the Data Prep Kit.
 ## Running transforms 
 
 * Notebooks
-    * [Various](../../examples/notebooks/README.md) - many notebook examples for code and language
+    * [Example data processing pipelines](../../examples/notebooks/README.md) - Use these to quickly process your data. A notebook structure allows a user to select/de-select transforms and change the order of processing as desired. 
 * Command line  
     * [Using a docker image](run-transform-image.md) - runs a transform in a docker transform image 
     * [Using a virtual environment](run-transform-venv.md) - runs a transform on the local host 

diff --git a/examples/notebooks/README.md b/examples/notebooks/README.md
@@ -1,4 +1,4 @@
 # Data Prep Kit Examples
 
-* [Code](code) - shows ingestion and processing of github zip downloads.
+* [Code](code) 
 * [Language](language) - coming soon.