This repository contains a comprehensive pipeline script for germline mutation analysis using the Genome Analysis Toolkit (GATK4) HaplotypeCaller. The pipeline includes steps for variant calling, filtering, and annotation, demonstrating a complete bioinformatics workflow from downloading reference files and reads, to identifying and annotating germline mutations.
The pipeline is specifically designed for germline mutation calling in human or similar organisms. It includes:
- Downloading reference genome and known variants
- Aligning reads to a reference genome
- Marking duplicates and recalibrating base scores
- Calling germline variants (SNPs and INDELs) with GATK's HaplotypeCaller
- Filtering and annotating variants with Funcotator
This pipeline is ideal for researchers and bioinformaticians working on germline variant discovery.
This pipeline requires the following tools to be installed and accessible in your $PATH
:
- bwa for sequence alignment
- samtools for SAM/BAM file handling
- GATK4 for germline variant calling, filtering, and annotation
- wget for downloading files (replaceable with other download tools)
To install these tools, consult their official documentation for installation instructions and dependencies.
-
Clone this repository.
git clone https://github.com/yourusername/germline-mutation-analysis-pipeline.git cd germline-mutation-analysis-pipeline
-
Ensure
bwa
,samtools
,GATK4
, andwget
are installed. -
Download any required data sources for GATK Funcotator annotation. Update paths in the script if needed.
To run the pipeline:
-
Make the script executable:
chmod +x pipeline.sh
-
Run the script:
./pipeline.sh
The script automatically downloads the necessary files (reference genome, known variants, and example reads) if they are not already present, so make sure you have an internet connection for the first run.
This script is divided into sections for clarity. Below is a breakdown of each part.
The script downloads:
- Reference genome (FASTA file and associated index and dictionary files)
- Known sites for base quality score recalibration (BQSR) like dbSNP and known indels
- Sample input reads (paired-end FASTQ files for demonstration)
- GATK Funcotator data sources for annotation
- Align Reads: Uses
bwa mem
to align input reads to the reference genome. - Convert and Sort: Converts SAM to BAM, sorts, and indexes the BAM file.
- Mark Duplicates: Identifies and marks duplicate reads using GATK.
- Base Quality Score Recalibration (BQSR): Recalibrates base scores based on known sites of variation.
- Germline Variant Calling: Calls raw germline variants (SNPs and INDELs) using GATK HaplotypeCaller.
- Filter Variants: Applies quality filters to SNPs and INDELs to reduce false positives.
- Select Passed Variants: Extracts variants that passed filtering criteria for further analysis.
- Annotate with Funcotator: Annotates filtered variants using GATK Funcotator.
- Export to Table: Outputs key fields from annotated SNPs to a tab-delimited table.
Results are stored in the specified output directory (output_directory/results
). Key output files include:
aligned_reads.sam
andsorted_reads.bam
: Aligned read filesraw_variants.vcf
: Initial germline variant callsfiltered_snps.vcf
andfiltered_indels.vcf
: Quality-filtered SNPs and INDELsanalysis-ready-snps-filteredGT-functotated.vcf
: SNPs annotated with Funcotatoroutput_snps.table
: Tab-delimited table with annotated SNPs
- Adjust Paths: Paths in the script (
ref
,reads_dir
,results
,funcotator_data
) are customizable based on your file organization. - Download URLs: Placeholder URLs for genome resources should be replaced with actual resource URLs (e.g., from UCSC or NCBI).
- Runtime: Processing times depend on data size and computational resources.
If you encounter any issues or have questions, feel free to open an issue or contact the repository maintainer.
This pipeline script is for educational and demonstrative purposes. Contributions to improve the script’s efficiency or add new features are welcome!