update to version 3 (#70)

CanDIG · Aug 15, 2024 · c532299 · c532299
1 parent e57ca01
commit c532299
Show file tree

Hide file tree

Showing 7 changed files with 40 additions and 14 deletions.
diff --git a/dist/clinical_ETL-2.2.0-py3-none-any.whl b/dist/clinical_ETL-2.2.0-py3-none-any.whl
diff --git a/dist/clinical_ETL-2.2.0.tar.gz b/dist/clinical_ETL-2.2.0.tar.gz
diff --git a/dist/clinical_ETL-3.0.0-py3-none-any.whl b/dist/clinical_ETL-3.0.0-py3-none-any.whl
diff --git a/dist/clinical_etl-3.0.0.tar.gz b/dist/clinical_etl-3.0.0.tar.gz
diff --git a/pyproject.toml b/pyproject.toml
@@ -3,7 +3,7 @@ requires = ["setuptools >= 61.0"]
 build-backend = "setuptools.build_meta"
 
 [project]
-version = "2.2.1"
+version = "3.0.0"
 name = "clinical_ETL"
 dependencies = [
     "pandas>=2.1.0",

diff --git a/src/clinical_ETL.egg-info/PKG-INFO b/src/clinical_ETL.egg-info/PKG-INFO
@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: clinical_ETL
-Version: 2.2.0
+Version: 3.0.0
 Summary: ETL module for transforming clinical CSV data into properly-formatted packets for ingest into Katsu
 Project-URL: Repository, https://github.com/CanDIG/clinical_ETL_code
 Requires-Python: >=3.10
@@ -50,6 +50,7 @@ Set up and activate a [virtual environment](https://docs.python.org/3/tutorial/v
 python -m venv /path/to/new/virtual/environment
 source /path/to/new/virtual/environment/bin/activate
 ```
+
 [See here for Windows instructions](https://realpython.com/python-virtual-environments-a-primer/)
 
 Clone this repo and enter the repo directory
@@ -63,38 +64,49 @@ Install the repo's requirements in your virtual environment
 pip install -r requirements.txt
 ```
 
+>[!NOTE]
+> If Python can't find the `clinical_etl` module when running `CSVConvert`, install the depencency manually:
+> ```
+> pip install -e clinical_ETL_code/
+> ```
+
 Before running the script, you will need to have your input files, this will be clinical data in a tabular format (`xlsx`/`csv`) that can be read into program and a cohort directory containing the files that define the schema and mapping configurations.
 
 ### Input file/s format
 
-The input for `CSVConvert` is either a single xlsx file, a single csv, or a directory of csvs that contain your clinical data. If providing a spreadsheet, there can be multiple sheets (usually one for each sub-schema). Examples of how csvs may look can be found in [test_data/raw_data](test_data/raw_data).
+The input for `CSVConvert` is either a single xlsx file, a single csv, or a directory of csvs that contain your clinical data. If providing a spreadsheet, there can be multiple sheets (usually one for each sub-schema). Examples of how csvs may look can be found in [tests/raw_data](tests/raw_data).
 
 All rows must contain identifiers that allow linkage between the objects in the schema, for example, a row that describes a Treatment must have a link to the Donor / Patient id for that Treatment.
 
 Data should be [tidy](https://r4ds.had.co.nz/tidy-data.html), with each variable in a separate column, each row representing an observation, and a single data entry in each cell. In the case of fields that can accept an array of values, the values within a cell should be delimited such that a mapping function can accurately return an array of permissible values.
 
+If you are working with exports from RedCap, the sample files in the [`sample_inputs/redcap_example`](sample_inputs/redcap_example) folder may be helpful. 
+
 ### Setting up a cohort directory
 
-For each dataset (cohort) that you want to convert, create a directory outside of this repository. For CanDIG devs, this will be in the private `data` repository. This cohort directory should contain the same files as shown in the `sample_inputs` directory, which are:
+For each dataset (cohort) that you want to convert, create a directory outside of this repository. For CanDIG devs, this will be in the private `data` repository. This cohort directory should contain the same files as shown in the [`sample_inputs/generic_example`](sample_inputs/generic_example) directory, which are:
 
 * a [`manifest.yml`](#Manifest-file) file with configuration settings for the mapping and schema validation
 * a [mapping template](#Mapping-template) csv that lists custom mappings for each field (based on `moh_template.csv`)
 * (if needed) One or more python files that implement any cohort-specific mapping functions (See [mapping functions](mapping_functions.md) for detailed information)
 
+Example files for how to convert a large single csv export, such as those exported from a redcap database can be found in [`sample_inputs/redcap_example`](sample_inputs/redcap_example).
+
 > [!IMPORTANT]
 > If you are placing this directory under version control and the cohort is not sample / synthetic data, do not place raw or processed data files in this directory, to avoid any possibility of committing protected data.
 
 #### Manifest file
-The `manifest.yml` file contains settings for the cohort mapping. There is a sample file in [`sample_inputs/manifest.yml`](sample_inputs/manifest.yml) with documentation and example inputs. The fields are:
+The `manifest.yml` file contains settings for the cohort mapping. There is a sample file in [`sample_inputs/generic_example/manifest.yml`](sample_inputs/generic_example/manifest.yml) with documentation and example inputs. The fields are:
 
 | field         | description                                                                                                                                                                                               |
 |---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | description   | A brief description of what mapping task this manifest is being used for                                                                                                                                  |
 | mapping       | the mapping template csv file that lists the mappings for each field based on `moh_template.csv`, assumed to be in the same directory as the `manifest.yml` file                                          |
 | identifier    | the unique identifier for the donor or root node                                                                                                                                                          |
 | schema        | a URL to the openapi schema file                                                                                                                                                                          |
-| schema_class  | The name of the class in the schema used as the model for creating the map.json. Currently supported: `MoHSchema` - for clinical MoH data and `GenomicSchema` for creating a genomic ingest linking file. |
+| schema_class  | The name of the class in the schema used as the model for creating the map.json. Currently supported: `MoHSchemaV2` and `MoHSchemaV3` - for clinical MoH data and `GenomicSchema` for creating a genomic ingest linking file. |
 | reference_date | a reference date used to calculate date intervals, formatted as a mapping entry for the mapping template                                                                                                 |
+| date_format | Specify the format of the dates in your input data. Use any combination of the characters `DMY`to specify the order (e.g. `DMY`, `MDY`, `YMD`, etc).                                                                                    |
 | functions     | A list of one or more filenames containing additional mapping functions, can be omitted if not needed. Assumed to be in the same directory as the `manifest.yml` file                                     |
 
 #### Mapping template
@@ -128,6 +140,7 @@ usage: generate_schema.py [-h] --url URL [--out OUT]
 options:
   -h, --help  show this help message and exit
   --url URL   URL to openAPI schema file (raw github link)
+  --schema    Name of schema class. Default is MoHSchemaV3
   --out OUT   name of output file; csv extension will be added. Default is template
 ```
 </details>
@@ -139,7 +152,7 @@ CSVConvert requires two inputs:
 2. a path to a `manifest.yml`, in a directory that also contains the other files defined in [Setting up a cohort directory](#Setting-up-a-cohort-directory)
 
 ```
-$ python src/clinical_etl/CSVConvert.py -h
+python src/clinical_etl/CSVConvert.py -h
 usage: CSVConvert.py [-h] --input INPUT --manifest MANIFEST [--test] [--verbose] [--index] [--minify]
 
 options:
@@ -164,16 +177,16 @@ The main output `<INPUT_DIR>_map.json` and optional output`<INPUT_DIR>_indexed.j
 
 Validation will automatically be run after the conversion is complete. Any validation errors or warnings will be reported both on the command line and as part of the `<INPUT_DIR>_map.json` file.
 
+>[!NOTE]
+> If Python can't find the `clinical_etl` module when running `CSVConvert`, install the depencency manually:
+> ```
+> pip install -e clinical_ETL_code/
+> ```
+
 #### Format of the output files
 
 `<INPUT_DIR>_map.json` is the main output and contains the results of the mapping, conversion and validation as well as summary statistics.
 
-The mapping and transformation result is found in the `"donors"` key.
-
-Arrays of validation warnings and errors are found in `validation_warnings` & `validation_errors`.
-
-Summary statistics about the completeness of the objects against the schema are in the `statistics` key.
-
 A summarised example of the output is below:
 
 ```json
@@ -207,6 +220,17 @@ A summarised example of the output is below:
     }
 }
 ```
+
+The mapping and transformation result is found in the `"donors"` key.
+
+Arrays of validation warnings and errors are found in `validation_warnings` & `validation_errors`.
+
+Summary statistics about the completeness of the objects against the schema are in the `statistics` key. You can create a readable CSV table
+of the summary statistics by running `completeness_table.py`. The table will be saved in `<INPUT_DIR>_completeness.csv`.
+```
+python src/clinical_etl/completeness_table.py --input <INPUT_DIR>_map.json
+```
+
 `<INPUT_DIR>_validation_results.json` contains all validation warnings and errors.
 
 `<INPUT_DIR>_indexed.json` contains information about how the ETL is looking up the mappings and can be useful for debugging. It is only generated if the `--index` argument is specified when CSVConvert is run. Note: This file can be very large if the input data is large.

diff --git a/src/clinical_ETL.egg-info/SOURCES.txt b/src/clinical_ETL.egg-info/SOURCES.txt
@@ -9,12 +9,14 @@ src/clinical_ETL.egg-info/requires.txt
 src/clinical_ETL.egg-info/top_level.txt
 src/clinical_etl/CSVConvert.py
 src/clinical_etl/__init__.py
+src/clinical_etl/completeness_table.py
 src/clinical_etl/create_test_mapping.py
 src/clinical_etl/generate_mapping_docs.py
 src/clinical_etl/generate_schema.py
 src/clinical_etl/genomicschema.py
 src/clinical_etl/mappings.py
-src/clinical_etl/mohschema.py
+src/clinical_etl/mohschemav2.py
+src/clinical_etl/mohschemav3.py
 src/clinical_etl/schema.py
 src/clinical_etl/validate_coverage.py
 tests/test_data_ingest.py