Skip to content

Commit

Permalink
update to version 3 (#70)
Browse files Browse the repository at this point in the history
  • Loading branch information
mshadbolt authored Aug 15, 2024
1 parent e57ca01 commit c532299
Show file tree
Hide file tree
Showing 7 changed files with 40 additions and 14 deletions.
Binary file removed dist/clinical_ETL-2.2.0-py3-none-any.whl
Binary file not shown.
Binary file removed dist/clinical_ETL-2.2.0.tar.gz
Binary file not shown.
Binary file added dist/clinical_ETL-3.0.0-py3-none-any.whl
Binary file not shown.
Binary file added dist/clinical_etl-3.0.0.tar.gz
Binary file not shown.
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ requires = ["setuptools >= 61.0"]
build-backend = "setuptools.build_meta"

[project]
version = "2.2.1"
version = "3.0.0"
name = "clinical_ETL"
dependencies = [
"pandas>=2.1.0",
Expand Down
48 changes: 36 additions & 12 deletions src/clinical_ETL.egg-info/PKG-INFO
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Metadata-Version: 2.1
Name: clinical_ETL
Version: 2.2.0
Version: 3.0.0
Summary: ETL module for transforming clinical CSV data into properly-formatted packets for ingest into Katsu
Project-URL: Repository, https://github.com/CanDIG/clinical_ETL_code
Requires-Python: >=3.10
Expand Down Expand Up @@ -50,6 +50,7 @@ Set up and activate a [virtual environment](https://docs.python.org/3/tutorial/v
python -m venv /path/to/new/virtual/environment
source /path/to/new/virtual/environment/bin/activate
```

[See here for Windows instructions](https://realpython.com/python-virtual-environments-a-primer/)

Clone this repo and enter the repo directory
Expand All @@ -63,38 +64,49 @@ Install the repo's requirements in your virtual environment
pip install -r requirements.txt
```

>[!NOTE]
> If Python can't find the `clinical_etl` module when running `CSVConvert`, install the depencency manually:
> ```
> pip install -e clinical_ETL_code/
> ```

Before running the script, you will need to have your input files, this will be clinical data in a tabular format (`xlsx`/`csv`) that can be read into program and a cohort directory containing the files that define the schema and mapping configurations.

### Input file/s format

The input for `CSVConvert` is either a single xlsx file, a single csv, or a directory of csvs that contain your clinical data. If providing a spreadsheet, there can be multiple sheets (usually one for each sub-schema). Examples of how csvs may look can be found in [test_data/raw_data](test_data/raw_data).
The input for `CSVConvert` is either a single xlsx file, a single csv, or a directory of csvs that contain your clinical data. If providing a spreadsheet, there can be multiple sheets (usually one for each sub-schema). Examples of how csvs may look can be found in [tests/raw_data](tests/raw_data).

All rows must contain identifiers that allow linkage between the objects in the schema, for example, a row that describes a Treatment must have a link to the Donor / Patient id for that Treatment.

Data should be [tidy](https://r4ds.had.co.nz/tidy-data.html), with each variable in a separate column, each row representing an observation, and a single data entry in each cell. In the case of fields that can accept an array of values, the values within a cell should be delimited such that a mapping function can accurately return an array of permissible values.

If you are working with exports from RedCap, the sample files in the [`sample_inputs/redcap_example`](sample_inputs/redcap_example) folder may be helpful.

### Setting up a cohort directory

For each dataset (cohort) that you want to convert, create a directory outside of this repository. For CanDIG devs, this will be in the private `data` repository. This cohort directory should contain the same files as shown in the `sample_inputs` directory, which are:
For each dataset (cohort) that you want to convert, create a directory outside of this repository. For CanDIG devs, this will be in the private `data` repository. This cohort directory should contain the same files as shown in the [`sample_inputs/generic_example`](sample_inputs/generic_example) directory, which are:

* a [`manifest.yml`](#Manifest-file) file with configuration settings for the mapping and schema validation
* a [mapping template](#Mapping-template) csv that lists custom mappings for each field (based on `moh_template.csv`)
* (if needed) One or more python files that implement any cohort-specific mapping functions (See [mapping functions](mapping_functions.md) for detailed information)

Example files for how to convert a large single csv export, such as those exported from a redcap database can be found in [`sample_inputs/redcap_example`](sample_inputs/redcap_example).

> [!IMPORTANT]
> If you are placing this directory under version control and the cohort is not sample / synthetic data, do not place raw or processed data files in this directory, to avoid any possibility of committing protected data.

#### Manifest file
The `manifest.yml` file contains settings for the cohort mapping. There is a sample file in [`sample_inputs/manifest.yml`](sample_inputs/manifest.yml) with documentation and example inputs. The fields are:
The `manifest.yml` file contains settings for the cohort mapping. There is a sample file in [`sample_inputs/generic_example/manifest.yml`](sample_inputs/generic_example/manifest.yml) with documentation and example inputs. The fields are:

| field | description |
|---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| description | A brief description of what mapping task this manifest is being used for |
| mapping | the mapping template csv file that lists the mappings for each field based on `moh_template.csv`, assumed to be in the same directory as the `manifest.yml` file |
| identifier | the unique identifier for the donor or root node |
| schema | a URL to the openapi schema file |
| schema_class | The name of the class in the schema used as the model for creating the map.json. Currently supported: `MoHSchema` - for clinical MoH data and `GenomicSchema` for creating a genomic ingest linking file. |
| schema_class | The name of the class in the schema used as the model for creating the map.json. Currently supported: `MoHSchemaV2` and `MoHSchemaV3` - for clinical MoH data and `GenomicSchema` for creating a genomic ingest linking file. |
| reference_date | a reference date used to calculate date intervals, formatted as a mapping entry for the mapping template |
| date_format | Specify the format of the dates in your input data. Use any combination of the characters `DMY`to specify the order (e.g. `DMY`, `MDY`, `YMD`, etc). |
| functions | A list of one or more filenames containing additional mapping functions, can be omitted if not needed. Assumed to be in the same directory as the `manifest.yml` file |

#### Mapping template
Expand Down Expand Up @@ -128,6 +140,7 @@ usage: generate_schema.py [-h] --url URL [--out OUT]
options:
-h, --help show this help message and exit
--url URL URL to openAPI schema file (raw github link)
--schema Name of schema class. Default is MoHSchemaV3
--out OUT name of output file; csv extension will be added. Default is template
```
</details>
Expand All @@ -139,7 +152,7 @@ CSVConvert requires two inputs:
2. a path to a `manifest.yml`, in a directory that also contains the other files defined in [Setting up a cohort directory](#Setting-up-a-cohort-directory)

```
$ python src/clinical_etl/CSVConvert.py -h
python src/clinical_etl/CSVConvert.py -h
usage: CSVConvert.py [-h] --input INPUT --manifest MANIFEST [--test] [--verbose] [--index] [--minify]

options:
Expand All @@ -164,16 +177,16 @@ The main output `<INPUT_DIR>_map.json` and optional output`<INPUT_DIR>_indexed.j

Validation will automatically be run after the conversion is complete. Any validation errors or warnings will be reported both on the command line and as part of the `<INPUT_DIR>_map.json` file.

>[!NOTE]
> If Python can't find the `clinical_etl` module when running `CSVConvert`, install the depencency manually:
> ```
> pip install -e clinical_ETL_code/
> ```

#### Format of the output files

`<INPUT_DIR>_map.json` is the main output and contains the results of the mapping, conversion and validation as well as summary statistics.

The mapping and transformation result is found in the `"donors"` key.

Arrays of validation warnings and errors are found in `validation_warnings` & `validation_errors`.

Summary statistics about the completeness of the objects against the schema are in the `statistics` key.

A summarised example of the output is below:

```json
Expand Down Expand Up @@ -207,6 +220,17 @@ A summarised example of the output is below:
}
}
```

The mapping and transformation result is found in the `"donors"` key.

Arrays of validation warnings and errors are found in `validation_warnings` & `validation_errors`.

Summary statistics about the completeness of the objects against the schema are in the `statistics` key. You can create a readable CSV table
of the summary statistics by running `completeness_table.py`. The table will be saved in `<INPUT_DIR>_completeness.csv`.
```
python src/clinical_etl/completeness_table.py --input <INPUT_DIR>_map.json
```

`<INPUT_DIR>_validation_results.json` contains all validation warnings and errors.

`<INPUT_DIR>_indexed.json` contains information about how the ETL is looking up the mappings and can be useful for debugging. It is only generated if the `--index` argument is specified when CSVConvert is run. Note: This file can be very large if the input data is large.
Expand Down
4 changes: 3 additions & 1 deletion src/clinical_ETL.egg-info/SOURCES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,14 @@ src/clinical_ETL.egg-info/requires.txt
src/clinical_ETL.egg-info/top_level.txt
src/clinical_etl/CSVConvert.py
src/clinical_etl/__init__.py
src/clinical_etl/completeness_table.py
src/clinical_etl/create_test_mapping.py
src/clinical_etl/generate_mapping_docs.py
src/clinical_etl/generate_schema.py
src/clinical_etl/genomicschema.py
src/clinical_etl/mappings.py
src/clinical_etl/mohschema.py
src/clinical_etl/mohschemav2.py
src/clinical_etl/mohschemav3.py
src/clinical_etl/schema.py
src/clinical_etl/validate_coverage.py
tests/test_data_ingest.py
Expand Down

0 comments on commit c532299

Please sign in to comment.