diff --git a/dist/clinical_ETL-2.2.0-py3-none-any.whl b/dist/clinical_ETL-2.2.0-py3-none-any.whl deleted file mode 100644 index c4de356..0000000 Binary files a/dist/clinical_ETL-2.2.0-py3-none-any.whl and /dev/null differ diff --git a/dist/clinical_ETL-2.2.0.tar.gz b/dist/clinical_ETL-2.2.0.tar.gz deleted file mode 100644 index 4033b05..0000000 Binary files a/dist/clinical_ETL-2.2.0.tar.gz and /dev/null differ diff --git a/dist/clinical_ETL-3.0.0-py3-none-any.whl b/dist/clinical_ETL-3.0.0-py3-none-any.whl new file mode 100644 index 0000000..4deea7b Binary files /dev/null and b/dist/clinical_ETL-3.0.0-py3-none-any.whl differ diff --git a/dist/clinical_etl-3.0.0.tar.gz b/dist/clinical_etl-3.0.0.tar.gz new file mode 100644 index 0000000..40aa0c0 Binary files /dev/null and b/dist/clinical_etl-3.0.0.tar.gz differ diff --git a/pyproject.toml b/pyproject.toml index 87b46f0..7091809 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -3,7 +3,7 @@ requires = ["setuptools >= 61.0"] build-backend = "setuptools.build_meta" [project] -version = "2.2.1" +version = "3.0.0" name = "clinical_ETL" dependencies = [ "pandas>=2.1.0", diff --git a/src/clinical_ETL.egg-info/PKG-INFO b/src/clinical_ETL.egg-info/PKG-INFO index e47670e..2bcdf5d 100644 --- a/src/clinical_ETL.egg-info/PKG-INFO +++ b/src/clinical_ETL.egg-info/PKG-INFO @@ -1,6 +1,6 @@ Metadata-Version: 2.1 Name: clinical_ETL -Version: 2.2.0 +Version: 3.0.0 Summary: ETL module for transforming clinical CSV data into properly-formatted packets for ingest into Katsu Project-URL: Repository, https://github.com/CanDIG/clinical_ETL_code Requires-Python: >=3.10 @@ -50,6 +50,7 @@ Set up and activate a [virtual environment](https://docs.python.org/3/tutorial/v python -m venv /path/to/new/virtual/environment source /path/to/new/virtual/environment/bin/activate ``` + [See here for Windows instructions](https://realpython.com/python-virtual-environments-a-primer/) Clone this repo and enter the repo directory @@ -63,29 +64,39 @@ Install the repo's requirements in your virtual environment pip install -r requirements.txt ``` +>[!NOTE] +> If Python can't find the `clinical_etl` module when running `CSVConvert`, install the depencency manually: +> ``` +> pip install -e clinical_ETL_code/ +> ``` + Before running the script, you will need to have your input files, this will be clinical data in a tabular format (`xlsx`/`csv`) that can be read into program and a cohort directory containing the files that define the schema and mapping configurations. ### Input file/s format -The input for `CSVConvert` is either a single xlsx file, a single csv, or a directory of csvs that contain your clinical data. If providing a spreadsheet, there can be multiple sheets (usually one for each sub-schema). Examples of how csvs may look can be found in [test_data/raw_data](test_data/raw_data). +The input for `CSVConvert` is either a single xlsx file, a single csv, or a directory of csvs that contain your clinical data. If providing a spreadsheet, there can be multiple sheets (usually one for each sub-schema). Examples of how csvs may look can be found in [tests/raw_data](tests/raw_data). All rows must contain identifiers that allow linkage between the objects in the schema, for example, a row that describes a Treatment must have a link to the Donor / Patient id for that Treatment. Data should be [tidy](https://r4ds.had.co.nz/tidy-data.html), with each variable in a separate column, each row representing an observation, and a single data entry in each cell. In the case of fields that can accept an array of values, the values within a cell should be delimited such that a mapping function can accurately return an array of permissible values. +If you are working with exports from RedCap, the sample files in the [`sample_inputs/redcap_example`](sample_inputs/redcap_example) folder may be helpful. + ### Setting up a cohort directory -For each dataset (cohort) that you want to convert, create a directory outside of this repository. For CanDIG devs, this will be in the private `data` repository. This cohort directory should contain the same files as shown in the `sample_inputs` directory, which are: +For each dataset (cohort) that you want to convert, create a directory outside of this repository. For CanDIG devs, this will be in the private `data` repository. This cohort directory should contain the same files as shown in the [`sample_inputs/generic_example`](sample_inputs/generic_example) directory, which are: * a [`manifest.yml`](#Manifest-file) file with configuration settings for the mapping and schema validation * a [mapping template](#Mapping-template) csv that lists custom mappings for each field (based on `moh_template.csv`) * (if needed) One or more python files that implement any cohort-specific mapping functions (See [mapping functions](mapping_functions.md) for detailed information) +Example files for how to convert a large single csv export, such as those exported from a redcap database can be found in [`sample_inputs/redcap_example`](sample_inputs/redcap_example). + > [!IMPORTANT] > If you are placing this directory under version control and the cohort is not sample / synthetic data, do not place raw or processed data files in this directory, to avoid any possibility of committing protected data. #### Manifest file -The `manifest.yml` file contains settings for the cohort mapping. There is a sample file in [`sample_inputs/manifest.yml`](sample_inputs/manifest.yml) with documentation and example inputs. The fields are: +The `manifest.yml` file contains settings for the cohort mapping. There is a sample file in [`sample_inputs/generic_example/manifest.yml`](sample_inputs/generic_example/manifest.yml) with documentation and example inputs. The fields are: | field | description | |---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| @@ -93,8 +104,9 @@ The `manifest.yml` file contains settings for the cohort mapping. There is a sam | mapping | the mapping template csv file that lists the mappings for each field based on `moh_template.csv`, assumed to be in the same directory as the `manifest.yml` file | | identifier | the unique identifier for the donor or root node | | schema | a URL to the openapi schema file | -| schema_class | The name of the class in the schema used as the model for creating the map.json. Currently supported: `MoHSchema` - for clinical MoH data and `GenomicSchema` for creating a genomic ingest linking file. | +| schema_class | The name of the class in the schema used as the model for creating the map.json. Currently supported: `MoHSchemaV2` and `MoHSchemaV3` - for clinical MoH data and `GenomicSchema` for creating a genomic ingest linking file. | | reference_date | a reference date used to calculate date intervals, formatted as a mapping entry for the mapping template | +| date_format | Specify the format of the dates in your input data. Use any combination of the characters `DMY`to specify the order (e.g. `DMY`, `MDY`, `YMD`, etc). | | functions | A list of one or more filenames containing additional mapping functions, can be omitted if not needed. Assumed to be in the same directory as the `manifest.yml` file | #### Mapping template @@ -128,6 +140,7 @@ usage: generate_schema.py [-h] --url URL [--out OUT] options: -h, --help show this help message and exit --url URL URL to openAPI schema file (raw github link) + --schema Name of schema class. Default is MoHSchemaV3 --out OUT name of output file; csv extension will be added. Default is template ``` @@ -139,7 +152,7 @@ CSVConvert requires two inputs: 2. a path to a `manifest.yml`, in a directory that also contains the other files defined in [Setting up a cohort directory](#Setting-up-a-cohort-directory) ``` -$ python src/clinical_etl/CSVConvert.py -h +python src/clinical_etl/CSVConvert.py -h usage: CSVConvert.py [-h] --input INPUT --manifest MANIFEST [--test] [--verbose] [--index] [--minify] options: @@ -164,16 +177,16 @@ The main output `_map.json` and optional output`_indexed.j Validation will automatically be run after the conversion is complete. Any validation errors or warnings will be reported both on the command line and as part of the `_map.json` file. +>[!NOTE] +> If Python can't find the `clinical_etl` module when running `CSVConvert`, install the depencency manually: +> ``` +> pip install -e clinical_ETL_code/ +> ``` + #### Format of the output files `_map.json` is the main output and contains the results of the mapping, conversion and validation as well as summary statistics. -The mapping and transformation result is found in the `"donors"` key. - -Arrays of validation warnings and errors are found in `validation_warnings` & `validation_errors`. - -Summary statistics about the completeness of the objects against the schema are in the `statistics` key. - A summarised example of the output is below: ```json @@ -207,6 +220,17 @@ A summarised example of the output is below: } } ``` + +The mapping and transformation result is found in the `"donors"` key. + +Arrays of validation warnings and errors are found in `validation_warnings` & `validation_errors`. + +Summary statistics about the completeness of the objects against the schema are in the `statistics` key. You can create a readable CSV table +of the summary statistics by running `completeness_table.py`. The table will be saved in `_completeness.csv`. +``` +python src/clinical_etl/completeness_table.py --input _map.json +``` + `_validation_results.json` contains all validation warnings and errors. `_indexed.json` contains information about how the ETL is looking up the mappings and can be useful for debugging. It is only generated if the `--index` argument is specified when CSVConvert is run. Note: This file can be very large if the input data is large. diff --git a/src/clinical_ETL.egg-info/SOURCES.txt b/src/clinical_ETL.egg-info/SOURCES.txt index fcd2d7a..4b8027f 100644 --- a/src/clinical_ETL.egg-info/SOURCES.txt +++ b/src/clinical_ETL.egg-info/SOURCES.txt @@ -9,12 +9,14 @@ src/clinical_ETL.egg-info/requires.txt src/clinical_ETL.egg-info/top_level.txt src/clinical_etl/CSVConvert.py src/clinical_etl/__init__.py +src/clinical_etl/completeness_table.py src/clinical_etl/create_test_mapping.py src/clinical_etl/generate_mapping_docs.py src/clinical_etl/generate_schema.py src/clinical_etl/genomicschema.py src/clinical_etl/mappings.py -src/clinical_etl/mohschema.py +src/clinical_etl/mohschemav2.py +src/clinical_etl/mohschemav3.py src/clinical_etl/schema.py src/clinical_etl/validate_coverage.py tests/test_data_ingest.py