Skip to content

Commit

Permalink
Merge pull request #415 from IBM/add-data-access-readme-1
Browse files Browse the repository at this point in the history
Added documentation on how to use data-access-factory
  • Loading branch information
daw3rd authored Jul 17, 2024
2 parents 736055c + 4bd762d commit b738ef9
Show file tree
Hide file tree
Showing 2 changed files with 98 additions and 0 deletions.
97 changes: 97 additions & 0 deletions data-processing-lib/doc/data-access-factory.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Data Access Factory


## Introduction
[Data Access Factory(DAF)](../python/src/data_processing/data_access/data_access_factory.py) provides a mechanism to create
[DataAccess](../python/src/data_processing/data_access/data_access.py)
implementations that support
the processing of input data files and the expected destination
of the processed files.
The `DataAccessFactory` is most often configured using command line arguments
to specify the type of `DataAccess` instance to create
(see `--data_*` options [here](python-launcher-options.md).
Currently, it supports
[DataAccessLocal](../python/src/data_processing/data_access/data_access_local.py)
and
[DataAccessS3](../python/src/data_processing/data_access/data_access_s3.py)
implementations.

You can use DAF and the resulting DataAccess implementation in your transform logic to
read and write extra file(s), for example, write log or metadata files.

This document explains how to initialize and use DAF to write a file using a `DataAccess` instance.

## Data Access
Each Data Access implementation supports the notion of processing a
set of input files to produce a set of output files, generally in a 1:1 mapping,
although this is not strictly required.
With this in mind, the following function is provided:
* Input file identification by
* input folder
* sub-directory selection (aka data sets))
* file extension
* maximum count
* random sampling
* Output file identification (for a given input)
* Checkpointing - determines the set of input files that need processing
(i.e. which do not have corresponding output files).
* Reading and writing of files.

Each transform runtime uses a DataAccessFactory to create a DataAccess instance which
is then used to identify and process the target input data.
Transforms may use this the runtime instance or can use their own DataAccessFactory.
This might be needed if reading or writing other files to/from other locations.

## Creating DAF instance

```python
from data_processing.data_access import DataAccessFactory
daf = DataAccessFactory("myprefix_", False)
```
The first parameter `cli_arg_prefix` is prefix used to look for parameter names
starting with prefix `myprefix_`. Generally the prefix used is specific to the
transform.

## Preparing and setting parameters
```python
from argparse import Namespace

s3_cred = {
"access_key": "XXXX",
"secret_key": "XXX",
"url": "https://s3.XXX",
}

s3_conf={
'input_folder': '<COS Location of input>',
'output_folder': 'cos-optimal-llm-pile/somekey'
}

args = Namespace(
myprefix_s3_cred=s3_cred,
myprefix_s3_config=s3_conf,
)
assert daf.apply_input_params(args)

```
`apply_input_params` will extract and use parameters from `args` with
prefix `myprefix_`(which is `myprefix_s3_cred` and `myprefix_s3_config` in this example).

The above is equivalent to passing the following on the command line to a runtime launcher
```shell
... --myprefix_s3_cred '{ "access_key": "XXXX", "secret_key": "XXX", "url": "https:/s3.XXX" }'\
--myprefix_s3_config '{ "input_folder": "<COS Location of input>", "cos-optimal-llm-pile/somekey" }'
```

## Create DataAccess and write file
```python
data_access = daf.create_data_access()
data_access.save_file(f"data/report.log", "success")
```

Call to `create_data_access` will create the `DataAccess` instance (`DataAccessS3` in this case) .
`save_file` will write a new file at `data/report.log` with content `success`.

When writing a transform, the `DataAccessFactory` is generally created in the
transform's configuration class and passed to the transform's initializer by the runtime.
See [this section](transform-external-resources.md) on accessing external resources for details.
1 change: 1 addition & 0 deletions data-processing-lib/doc/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ To learn more consider the following:
* [Transform Exceptions](transform-exceptions.md)
* [Transform Runtimes](transform-runtimes.md)
* [Transform Examples](transform-tutorial-examples.md)
* [Data Access Factory](data-access-factory)
* [Testing Transforms](transform-testing.md)
* [Utilities](transformer-utilities.md)
* [Architecture Deep Dive](architecture.md)
Expand Down

0 comments on commit b738ef9

Please sign in to comment.