Skip to content

Commit

Permalink
improve readability of README
Browse files Browse the repository at this point in the history
  • Loading branch information
mshadbolt committed Nov 8, 2023
1 parent a3c968b commit 2e24452
Showing 1 changed file with 38 additions and 27 deletions.
65 changes: 38 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@ Prerequisites:

## Running from the command line

Most of the heavy lifting is done in the CSVConvert.py script. See sections below for setting up the inputs. This script:
Most of the heavy lifting is done in the [`CSVConvert.py`](CSVConvert.py) script. See sections below for setting up the inputs. This script:
* reads a file (.xlsx or .csv) or a directory of files (csv)
* reads a template file that contains a list of fields and (if needed) a mapping function
* reads a [template file](#mapping-template) that contains a list of fields and (if needed) a mapping function
* for each field for each patient, applies the mapping function to transform the raw data into valid model data
* exports the data into a json file(s) appropriate for ingest

Expand Down Expand Up @@ -78,54 +78,55 @@ Validation will automatically be run after the conversion is complete. Any valid

## Input file format

The input for CSVConvert is either a single xlsx file, a single csv, or a directory of csvs. If providing a spreadsheet, there can be multiple sheets (usually one for each sub-schema).
The input for `CSVConvert` is either a single xlsx file, a single csv, or a directory of csvs. If providing a spreadsheet, there can be multiple sheets (usually one for each sub-schema).

All rows must contain identifiers that allow linkage to the containing schema, for example, a row that describes a Treatment must have a link to the Donor / Patient id for that Treatment.

Data should be (tidy)[https://r4ds.had.co.nz/tidy-data.html], with each variable in a separate column, each row representing an observation, and a single data entry in each cell.
Data should be (tidy)[https://r4ds.had.co.nz/tidy-data.html], with each variable in a separate column, each row representing an observation, and a single data entry in each cell. In the case of fields that can accept an array of values, the values within a cell should be delimited such that a mapping function can accurately return an array of permissible values.

Depending on the format of your raw data, you may need to write an additional tidying script to pre-process. For example, the `ingest_redcap_data.py` converts the export format from redcap into a set of input csvs for CSVConvert.

## Setting up a cohort directory

For each dataset (cohort) that you want to convert, create a directory outside of this repository. For CanDIG devs, this will be in the private `data` repository. This cohort directory should contain:
For each dataset (cohort) that you want to convert, create a directory outside of this repository. For CanDIG devs, this will be in the private `data` repository. This cohort directory should contain the same elements as shown in the `sample_inputs` directory, which are:

* a `manifest.yml` file with settings for the mapping
* the template file lists custom mappings for each field
* the template csv that lists custom mappings for each field (based on `moh_template.csv`)
* (if needed) a python file that implements any cohort-specific mapping functions

**Important:** If you are placing this directory under version control and the cohort is not sample / synthetic data, do not place raw or processed data files in this directory, to avoid any possibility of committing protected data.
> [!IMPORTANT]
> If you are placing this directory under version control and the cohort is not sample / synthetic data, do not place raw or processed data files in this directory, to avoid any possibility of committing protected data.
## Manifest file
The `manifest.yml` file contains settings for the cohort mapping. There is a sample file in `sample_inputs/manifest.yml` with documentation. The fields are:

```yaml
description: # A brief description of what mapping task this manifest is being used for
mapping: # the mapping template csv file that lists the mappings for each field based on `moh_template.csv`
mapping: # the mapping template csv file that lists the mappings for each field based on `moh_template.csv`, assumed to be in the same directory as the manifest.yml file
identifier: # the unique identifier for the donor
schema: # a URL to the openapi schema file
functions: # One or more filenames containing additional mapping functions, can be omitted if not needed
- new_cohort # name of a python file with the set of mapping functions to be used in addition to the core set of functions specified in mappings.py. Assumed to be in the same directory as the manifest.yml file
- # name of one or more python files with the set of mapping functions to be used in addition to the core set of functions specified in mappings.py. Assumed to be in the same directory as the manifest.yml file
```
## Mapping template
You'll need to create a mapping template that defines which mapping functions (if any) should be used for which fields.
## Mapping template
If you're generating a mapping for the current MoH model, you can use the pre-generated `moh_template.csv` file. This file is modified from the auto-generated template to update a few fields that require specific handling.
You'll need to create a mapping template that defines the mapping between the fields in your input files and the fields in the target schema. It also defines what mapping functions (if any) should be used to transform the input data into the required format to pass validation under the target schema.
<details>
<summary>"Compare moh_template.csv" fails</summary>
Each line in the mapping template is composed of comma separated values with two components. The first value is an `element` or field from the target schema and the second value contains a suggested `mapping method` or function to map a field from an input sheet to the identified `element`. Each `element`, shows the full object linking path to each field required by the model. These values should not be edited.

### You changed the `moh_template.csv` file:
If you're generating a mapping for the current MoH model, you can use the pre-generated [`moh_template.csv`](moh_template.csv) file. This file is modified from the auto-generated template to update a few fields that require specific handling.

To fix this, you'll need to update the diffs file. Run `bash update_moh_template.sh` and commit the changes that are generated for `test_data/moh_diffs.txt`.
You will need to edit the `mapping method` column in the following ways:
1. Replace the generic sheet names (e.g. `DONOR_SHEET`, `SAMPLE_REGISTRATIONS_SHEET`) with the sheet names you are using as your input to `CSVConvert.py`
2. Replace suggested field names with the relevant field/column names in your input sheets, if they differ

### You did not change the `moh_template.csv` file:
If the field does not map in the same way as the suggested mapping function you will also need to:
3. Choose a different existing [mapping function](mappings.py) or write a new function that does the required transformation. See the [mapping instructions](mapping_functions.md) for detailed documentation on writing your own mapping functions.

There have probably been MoH model changes in katsu.

Run the `update_moh_template.sh` script to see what's changed in `test_data/moh_diffs.txt`. Update `moh_template.csv` to reconcile any differences, then re-run `update_moh_template.sh`. Commit any changes in both `moh_template.csv` and `test_data/moh_diffs.txt`.
</details>
>[!NOTE]
> * Do not edit, delete, or re-order the template lines, except to adjust the sheet name, mapping function and field name in the `mapping method` column.
> * Fields not requiring mapping can be commented out with a # at the start of the line

<details>
<summary>Generating a template from a different schema</summary>
Expand All @@ -143,18 +144,28 @@ options:
```
</details>
Each line in the mapping template will have a suggested mapping function to map a field on an input sheet to a field in the schema. Replace the generic sheet names with your sheet names. You may need to replace suggested field names with your own field names, if they differ.
If your data do not map in the same way as the suggested mapping functions, you may need to write your own mapping functions. See the [mapping instructions](mapping_functions.md) for detailed documentation on writing your own mapping functions.
**Note**: Do not edit, delete, or re-order the template lines, except to add mapping functions after the comma in each line.
## Testing
Continuous integration testing for this repository is implemented through Pytest and GitHub Actions which run when pushes occur. Build results can be found at [this repository's GitHub Actions page](https://github.com/CanDIG/clinical_ETL_code/actions/workflows/test.yml).
To run tests manually, enter from command line `$ pytest`
### When tests fail...
<details>
<summary>"Compare moh_template.csv" fails</summary>
### You changed the `moh_template.csv` file:
To fix this, you'll need to update the diffs file. Run `bash update_moh_template.sh` and commit the changes that are generated for `test_data/moh_diffs.txt`.
### You did not change the `moh_template.csv` file:
There have probably been MoH model changes in katsu.
Run the `update_moh_template.sh` script to see what's changed in `test_data/moh_diffs.txt`. Update `moh_template.csv` to reconcile any differences, then re-run `update_moh_template.sh`. Commit any changes in both `moh_template.csv` and `test_data/moh_diffs.txt`.
</details>
## Validating the mapping
You can validate the generated json mapping file against the MoH data model. The validation will compare the mapping to the json schema used to generate the template, as well as other known requirements and data conditions specified in the MoH data model.
Expand Down

0 comments on commit 2e24452

Please sign in to comment.