Build a Genomics Data Lake on AWS

This repo contains the code referenced in the AWS blog post "Build a Genomics data lake on AWS".

ETL - contains the transformation scripts and the cloudformation template to spin up the EMR cluster

EMRGenomics.py - Lambda function that is triggered by the cloudFormation template to create EMR cluster to process VCFs.

EventEMRGenomics.py - Event trigger Lambda function

emr_config.json - JSON file with EMR configuration for this example. This file can be edited to change EMR configuration parameters.

vcfToParquetTransform.py - pySpark script that performs the VCF to parquet transformation using the Hail API. This can be customized to perform any specific transformation steps required.

genomics_datalake_emr.template - Cloudformation template that can be deployed in your account for the solution.

1000Genomes.ipynb - Python notebook with sample queries

For instructions on how to create the Glue data catalog tables for 1000 Genomes on the Registry of Open Data, please check the DataLakeAsCode repo at https://github.com/aws-samples/data-lake-as-code/tree/roda#readme. The repo also has CloudFormation templates for ClinVar and gnomAD.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Build a Genomics Data Lake on AWS

ETL - contains the transformation scripts and the cloudformation template to spin up the EMR cluster

1000Genomes.ipynb - Python notebook with sample queries

Files

README.md

Latest commit

History

README.md

File metadata and controls

Build a Genomics Data Lake on AWS

ETL - contains the transformation scripts and the cloudformation template to spin up the EMR cluster

1000Genomes.ipynb - Python notebook with sample queries