Merge pull request #415 from IBM/add-data-access-readme-1

Added documentation on how to use data-access-factory
IBM · Jul 17, 2024 · b738ef9 · b738ef9
2 parents 736055c + 4bd762d
commit b738ef9
Show file tree

Hide file tree

Showing 2 changed files with 98 additions and 0 deletions.
diff --git a/data-processing-lib/doc/data-access-factory.md b/data-processing-lib/doc/data-access-factory.md
@@ -0,0 +1,97 @@
+# Data Access Factory 
+
+
+## Introduction
+[Data Access Factory(DAF)](../python/src/data_processing/data_access/data_access_factory.py) provides a mechanism to create 
+[DataAccess](../python/src/data_processing/data_access/data_access.py) 
+implementations that support
+the processing of input data files and the expected destination
+of the processed files.
+The `DataAccessFactory` is most often configured using command line arguments
+to specify the type of `DataAccess` instance to create
+(see `--data_*` options [here](python-launcher-options.md).
+Currently,  it supports
+[DataAccessLocal](../python/src/data_processing/data_access/data_access_local.py)
+and 
+[DataAccessS3](../python/src/data_processing/data_access/data_access_s3.py)
+implementations.
+
+You can use DAF and the resulting DataAccess implementation in your transform logic to
+read and write extra file(s), for example, write log or metadata files.
+
+This document explains how to initialize and use DAF to write a file using a `DataAccess` instance. 
+
+## Data Access 
+Each Data Access implementation supports the notion of processing a
+set of input files to produce a set of output files, generally in a 1:1 mapping, 
+although this is not strictly required.
+With this in mind, the following function is provided: 
+* Input file identification by 
+    * input folder 
+    * sub-directory selection (aka data sets))
+    * file extension
+    * maximum count
+    * random sampling
+* Output file identification (for a given input)
+* Checkpointing  - determines the set of input files that need processing 
+(i.e. which do not have corresponding output files).
+* Reading and writing of files.
+
+Each transform runtime uses a DataAccessFactory to create a DataAccess instance which
+is then used to identify and process the target input data.
+Transforms may use this the runtime instance or can use their own DataAccessFactory.
+This might be needed if reading or writing other files to/from other locations.
+
+## Creating DAF instance
+
+```python
+from data_processing.data_access import DataAccessFactory
+daf = DataAccessFactory("myprefix_", False)
+```
+The first parameter `cli_arg_prefix` is prefix used to look for parameter names 
+starting with prefix `myprefix_`. Generally the prefix used is specific to the
+transform.
+
+## Preparing and setting parameters 
+```python
+from argparse import Namespace
+
+s3_cred = {
+    "access_key": "XXXX",
+    "secret_key": "XXX",
+    "url": "https://s3.XXX",
+}
+
+s3_conf={
+    'input_folder': '<COS Location of input>', 
+    'output_folder': 'cos-optimal-llm-pile/somekey'
+}
+
+args = Namespace(
+    myprefix_s3_cred=s3_cred,
+    myprefix_s3_config=s3_conf,
+)
+assert daf.apply_input_params(args)
+
+```
+`apply_input_params` will extract and use parameters from `args` with 
+prefix `myprefix_`(which is `myprefix_s3_cred` and `myprefix_s3_config` in this example).
+
+The above is equivalent to passing the following on the command line to a runtime launcher
+```shell
+... --myprefix_s3_cred '{ "access_key": "XXXX", "secret_key": "XXX", "url": "https:/s3.XXX" }'\
+    --myprefix_s3_config '{ "input_folder": "<COS Location of input>", "cos-optimal-llm-pile/somekey" }'
+```
+
+## Create DataAccess and write file 
+```python
+data_access = daf.create_data_access()
+data_access.save_file(f"data/report.log", "success")
+```
+
+Call to `create_data_access` will create the `DataAccess` instance (`DataAccessS3` in this case) .
+`save_file` will write a new file at `data/report.log` with content `success`.
+
+When writing a transform, the `DataAccessFactory` is generally created in the
+transform's configuration class and passed to the transform's initializer by the runtime. 
+See [this section](transform-external-resources.md) on accessing external resources for details.
diff --git a/data-processing-lib/doc/overview.md b/data-processing-lib/doc/overview.md
@@ -45,6 +45,7 @@ To learn more consider the following:
 * [Transform Exceptions](transform-exceptions.md)
 * [Transform Runtimes](transform-runtimes.md)
 * [Transform Examples](transform-tutorial-examples.md)
+* [Data Access Factory](data-access-factory)
 * [Testing Transforms](transform-testing.md)
 * [Utilities](transformer-utilities.md)
 * [Architecture Deep Dive](architecture.md)