Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge release v2.1.0 into stable branch #60

Merged
merged 51 commits into from
Mar 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
0af78af
update packages
daisieh Dec 19, 2023
7c84690
applymap is deprecated
daisieh Dec 19, 2023
0208763
build files
daisieh Dec 19, 2023
01c9541
Update test.yml
daisieh Dec 19, 2023
f23bd22
Merge pull request #47 from CanDIG/daisieh/updates
daisieh Dec 19, 2023
5ade787
compile result for load_manifest
daisieh Jan 23, 2024
e7153f4
add some prints
daisieh Jan 23, 2024
9193cf4
stub for validation
daisieh Jan 24, 2024
1f7aaf6
don't hard-code main result key
daisieh Jan 23, 2024
808715c
another key to not hard-code
daisieh Jan 24, 2024
d29cf0e
programatically use schema class based on manifest
daisieh Jan 24, 2024
54f33ba
use a default dateparser
daisieh Jan 23, 2024
daba925
add reference date and date interval calculations
daisieh Jan 23, 2024
5f25983
if date_intervals are used for birth and death, use month_intervals
daisieh Jan 23, 2024
ec8db08
update templates to use date_interval
daisieh Jan 23, 2024
e35f769
temporarily don't count interval errors
daisieh Jan 24, 2024
7c535b3
check to see if the identifier exists on the sheet
daisieh Jan 24, 2024
bebb169
add new manifest param to README
mshadbolt Jan 24, 2024
665a810
added reference_date to manifest docs
mshadbolt Jan 25, 2024
c898c2e
Merge pull request #48 from CanDIG/daisieh/genomic
daisieh Jan 25, 2024
897799f
Merge branch 'develop' into daisieh/date-intervals
mshadbolt Jan 25, 2024
65ae481
pass date_resolution into earliest_date mapping
daisieh Jan 26, 2024
1661036
add date_resolution to the templates
daisieh Jan 26, 2024
c001f36
add date_resolution to required_fields for Donor
daisieh Jan 26, 2024
1a75e3f
add test values for Donor.date_resolution
daisieh Jan 26, 2024
da05b60
sheet should be consistently the first param's sheet
daisieh Jan 26, 2024
c9ff51b
add docstrings
mshadbolt Jan 26, 2024
ac20357
update mapping docs
mshadbolt Jan 26, 2024
e76c1a1
raise an exception if there's no reference_date
daisieh Jan 26, 2024
5889ecf
docstrings
daisieh Jan 26, 2024
5b74358
Merge pull request #49 from CanDIG/daisieh/date-intervals
daisieh Jan 26, 2024
c4a1624
Module path changes to allow running standalone (not as an included m…
DavidBrownlee Feb 9, 2024
96d6249
module import path fixed for validate_coverage.py as well.
DavidBrownlee Feb 12, 2024
c90eca0
Updte sample_inputs to work as standalone (not imported library).
DavidBrownlee Feb 19, 2024
e25e0cf
Added documentation clarity for writing and using custom functions.
DavidBrownlee Feb 19, 2024
eb23190
Merge pull request #50 from CanDIG/david/runStandAlone
daisieh Feb 23, 2024
f247ef7
merge
daisieh Feb 23, 2024
7637349
Validate that date intervals are integers.
DavidBrownlee Feb 15, 2024
bdb3ccf
month_interval is now always calculated. day_interval is optional.
DavidBrownlee Feb 16, 2024
07e4960
merge
daisieh Feb 23, 2024
189c36f
Minor correction to sample_inputs/moh_template.csv.
DavidBrownlee Feb 16, 2024
c0aaeac
Merge pull request #51 from CanDIG/david/acceptDateIntervals
daisieh Feb 23, 2024
bc91757
change float() to floating() in mapping functions
lilyyangyi301 Feb 29, 2024
0870368
revert
lilyyangyi301 Feb 29, 2024
f02fa8c
change float() to floating() in mapping functions
lilyyangyi301 Feb 29, 2024
a9ba7c7
Merge pull request #54 from CanDIG/lilyyang/fix_floating_mapping_func…
lilyyangyi301 Feb 29, 2024
815cbfc
DIG-1150: Template update and documentation for DateIntervals (#53)
mshadbolt Feb 29, 2024
8a01ff9
Add index and minify command line arguments (#55)
mshadbolt Mar 4, 2024
cb4269e
update docs for new args (#56)
mshadbolt Mar 4, 2024
1513b0d
add main and have CSVConvert as CLI script (#57)
mshadbolt Mar 6, 2024
b3b4816
fix int method, more details in warning (#58)
mshadbolt Mar 6, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10"]
python-version: ["3.12"]

steps:
- uses: actions/checkout@v2
Expand Down
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,5 @@ __pycache__/*
.venv/
_local
.idea
.~lock*
.~lock*
build/
32 changes: 18 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,13 +68,15 @@ For each dataset (cohort) that you want to convert, create a directory outside o
#### Manifest file
The `manifest.yml` file contains settings for the cohort mapping. There is a sample file in [`sample_inputs/manifest.yml`](sample_inputs/manifest.yml) with documentation and example inputs. The fields are:

| field | description |
|-------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| description | A brief description of what mapping task this manifest is being used for |
| mapping | the mapping template csv file that lists the mappings for each field based on `moh_template.csv`, assumed to be in the same directory as the `manifest.yml` file |
| identifier | the unique identifier for the donor or root node |
| schema | a URL to the openapi schema file |
| functions | A list of one or more filenames containing additional mapping functions, can be omitted if not needed. Assumed to be in the same directory as the `manifest.yml` file |
| field | description |
|---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| description | A brief description of what mapping task this manifest is being used for |
| mapping | the mapping template csv file that lists the mappings for each field based on `moh_template.csv`, assumed to be in the same directory as the `manifest.yml` file |
| identifier | the unique identifier for the donor or root node |
| schema | a URL to the openapi schema file |
| schema_class | The name of the class in the schema used as the model for creating the map.json. Currently supported: `MoHSchema` - for clinical MoH data and `GenomicSchema` for creating a genomic ingest linking file. |
| reference_date | a reference date used to calculate date intervals, formatted as a mapping entry for the mapping template |
| functions | A list of one or more filenames containing additional mapping functions, can be omitted if not needed. Assumed to be in the same directory as the `manifest.yml` file |

#### Mapping template

Expand Down Expand Up @@ -119,25 +121,27 @@ CSVConvert requires two inputs:

```
$ python src/clinical_etl/CSVConvert.py -h
usage: CSVConvert.py [-h] [--input INPUT] [--manifest manifest_file] [--test] [--verbose]
usage: CSVConvert.py [-h] --input INPUT --manifest MANIFEST [--test] [--verbose] [--index] [--minify]

options:
-h, --help show this help message and exit
--input INPUT Path to either an xlsx file or a directory of csv files for ingest
--manifest MANIFEST Path to a manifest file describing the mapping.
--manifest MANIFEST Path to a manifest file describing the mapping. See README for more information
--test Use exact template specified in manifest: do not remove extra lines
--verbose, --v Print extra information, useful for debugging and understanding how the code runs.

--test allows you to add extra lines to your manifest's template file that will be populated in the mapped schema. NOTE: this mapped schema will likely not be a valid mohpacket: it should be used only for debugging.
--index, --i Output 'indexed' file, useful for debugging and seeing relationships.
--minify Remove white space and line breaks from json outputs to reduce file size. Less readable for humans.
```

* `--test` allows you to add extra lines to your manifest's template file that will be populated in the mapped schema. NOTE: this mapped schema will likely not be a valid mohpacket: it should be used only for debugging.

Example usage:

```
python src/clinical_etl/CSVConvert.py --input test_data/raw_data --manifest test_data/manifest.yml
```

The output packets `<INPUT_DIR>_map.json` and `<INPUT_DIR>_indexed.json` will be in the parent of the `INPUT` directory / file. In the example above, this would be in the `test_data` directory.
The main output `<INPUT_DIR>_map.json` and optional output`<INPUT_DIR>_indexed.json` will be in the parent of the `INPUT` directory / file. In the example above, this would be in the `test_data` directory.

Validation will automatically be run after the conversion is complete. Any validation errors or warnings will be reported both on the command line and as part of the `<INPUT_DIR>_map.json` file.

Expand Down Expand Up @@ -193,7 +197,7 @@ A summarised example of the output is below:
}
```

`<INPUT_DIR>_indexed.json` contains information about how the ETL is looking up the mappings and can be useful for debugging.
`<INPUT_DIR>_indexed.json` contains information about how the ETL is looking up the mappings and can be useful for debugging. It is only generated if the `--index` argument is specified when CSVConvert is run. Note: This file can be very large if the input data is large.

## Testing

Expand Down Expand Up @@ -223,7 +227,7 @@ You can validate the generated json mapping file against the MoH data model. The

```
$ python src/clinical_etl/validate_coverage.py -h
validate_coverage.py [-h] [--input map.json] [--manifest MAPPING]
usage: validate_coverage.py [-h] --json JSON [--verbose]

options:
-h, --help show this help message and exit
Expand Down
Binary file modified dist/clinical_ETL-2.0.0-py3-none-any.whl
Binary file not shown.
78 changes: 67 additions & 11 deletions mapping_functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,23 +58,45 @@ If your schema requires more complex mapping calculations, you can define an ind

In addition to mapping column names, you can also transform the values inside the cells to make them align with the schema. We've already seen the simplest case - the `single_val` function takes a single value for the named field and returns it (and should only be used when you expect one single value).

The standard functions are defined in `mappings.py`. They include functions for handling single values, list values, dates, and booleans.
The standard functions are defined in `mappings.py`. They include functions for handling single values, list values, dates, and booleans.

Many functions take one or more `data_values` arguments as input. These are a dictionary representing how the CSVConvert script parses each cell of the input data. It is a dictionary of the format `{<field>:{<OBJECT_SHEET>: <value>}}`, e.g. `{'date_of_birth': {'Donor': '6 Jan 1954'}}`.
Many functions take one or more `data_values` arguments as input. These are a dictionary representing how the CSVConvert script parses each cell of the input data. It is a dictionary of the format `{<field>:{<OBJECT_SHEET>: <value>}}`, e.g. `{'date_of_birth': {'Donor': '6 Jan 1954'}}`.

A detailed index of all standard functions can be viewed below in the [Standard functions index](#Standard-functions-index).

### Dealing with Dates

As of version 2.1 of the [MoHCCN Data Model](https://www.marathonofhopecancercentres.ca/docs/default-source/policies-and-guidelines/clinical-data-model-v2.1/mohccn-clinical-data-model-release-notes_sep2023.pdf?Status=Master&sfvrsn=19ece028_3), dates need to be converted into date intervals relative to the earliest date of diagnosis. Support for this has been incorporated into clinical_ETL_code v.2.0.0. In order to convert dates to date intervals, a `reference_date` must be provided in the `manifest.yml`. This can be an absolute date, or a function to calculate a date based on the input date, e.g. `earliest_date(Donor.date_resolution, PrimaryDiagnosis.date_of_diagnosis)`. In the mapping csv, the in-built `date_interval()` mapping function can be used to calculate the appropriate date interval information for any date-type field. e.g.:

```commandline
DONOR.INDEX.date_of_birth, {date_interval(Donor.date_of_birth)}
```

If input data has pre-calculated date intervals as integers, the `int_to_date_interval_json()` function can be used to transform the integer into the required DateInterval json object. e.g.:

```commandline
DONOR.INDEX.date_of_death, {int_to_date_interval_json(Donor.date_of_death)}
```

## Writing your own custom functions

If the data cannot be transformed with one of the standard functions, you can define your own. In your data directory (the one that contains `manifest.yml`) create a python file (let's assume you called it `new_cohort.py`) and add the name of that file as the `mapping` entry in the manifest.
If the data cannot be transformed with one of the standard functions, you can define your own. In your data directory (the one that contains `manifest.yml`) create a python file (let's assume you called it `new_cohort.py`) and add the name of that file as the `functions` entry in the manifest (without the .py extension).

Following the format in the generic `mappings.py`, write your own functions in your python file for how to translate the data. To specify a custom mapping function in the template:
In your data directory (the one that contains `manifest.yml`) create a python file (let's assume you called it `new_cohort.py`) and add the name of that file as a .yml list after `functions` in the manifest. For example:
```
functions:
- new_cohort
```

`DONOR.INDEX.primary_diagnoses.INDEX.basis_of_diagnosis,{new_cohort.custom_function(DATA_SHEET.field_name)}`
Following the format in the generic `mappings.py`, write your own functions in your python file to translate the data.

To use a custom mapping function in the template, you must specify the file and function using dot-separated notation:

DONOR.INDEX.primary_diagnoses.INDEX.basis_of_diagnosis,{**new_cohort.custom_function**(DATA_SHEET.field_name)}

Examples:

To map input values to output values (in case your data capture used different values than the model):
Map input values to output values (in case your data capture used different values than the model):

```
def sex(data_value):
Expand Down Expand Up @@ -125,6 +147,7 @@ represents the following JSON dict:
# Standard Functions Index

<!--- documentation below this line is generated automatically by running generate_mapping_docs.py --->

Module mappings
===============

Expand All @@ -140,9 +163,10 @@ Functions

Returns:
A boolean based on the input,
`False` if value is in ["No", "no", "False", "false"]
`None` if value is in [`None`, "nan", "NaN", "NAN"]
`True` otherwise
`False` if value is in ["No", "no", "N", "n", "False", "false", "F", "f"]
`True` if value is in ["Yes", "yes", "Y", "y", True", "true", "T", "t"]
None if value is in [`None`, "nan", "NaN", "NAN"]
None otherwise


`concat_vals(data_values)`
Expand All @@ -167,6 +191,27 @@ Functions
a list of dates in YYYY-MM format or None if blank/empty/unparseable


`date_interval(data_values)`
: Calculates a date interval from a given date relative to the reference date specified in the manifest.

Args:
data_values: a values dict with a date

Returns:
A dictionary with calculated month_interval and optionally a day_interval depending on the specified
date_resolution.


`earliest_date(data_values)`
: Calculates the earliest date from a set of dates

Args:
data_values: A values dict of dates of diagnosis and date_resolution

Returns:
A dictionary containing the earliest date (`offset`) as a date object and the provided `date_resolution`


`flat_list_val(data_values)`
: Take a list mapping and break up any stringified lists into multiple values in the list.

Expand All @@ -177,8 +222,9 @@ Functions
Returns:
A parsed list of items in the list, e.g. ['a', 'b', 'c']


`float(data_values)`

`floating(data_values)`

: Convert a value to a float.

Args:
Expand Down Expand Up @@ -210,6 +256,16 @@ Functions
{"field": <identifier_field>,"sheet_name": <sheet_name>,"values": [<identifiers>]}


`int_to_date_interval_json(data_values)`
: Converts an integer date interval into JSON format.

Args:
data_values: a values dict with an integer.

Returns:
A dictionary with a calculated month_interval and optionally a day_interval depending on the specified date_resolution in the donor file.


`integer(data_values)`
: Convert a value to an integer.

Expand Down
Loading
Loading