Skip to content

Latest commit

 

History

History
19 lines (10 loc) · 1.2 KB

README.md

File metadata and controls

19 lines (10 loc) · 1.2 KB

Build a Genomics Data Lake on AWS

This repo contains the code referenced in the AWS blog post "Build a Genomics data lake on AWS".

ETL - contains the transformation scripts and the cloudformation template to spin up the EMR cluster

EMRGenomics.py - Lambda function that is triggered by the cloudFormation template to create EMR cluster to process VCFs.

EventEMRGenomics.py - Event trigger Lambda function

emr_config.json - JSON file with EMR configuration for this example. This file can be edited to change EMR configuration parameters.

vcfToParquetTransform.py - pySpark script that performs the VCF to parquet transformation using the Hail API. This can be customized to perform any specific transformation steps required.

genomics_datalake_emr.template - Cloudformation template that can be deployed in your account for the solution.

1000Genomes.ipynb - Python notebook with sample queries

For instructions on how to create the Glue data catalog tables for 1000 Genomes on the Registry of Open Data, please check the DataLakeAsCode repo at https://github.com/aws-samples/data-lake-as-code/tree/roda#readme. The repo also has CloudFormation templates for ClinVar and gnomAD.