Germline Mutation Analysis Pipeline with GATK4 HaplotypeCaller

This repository contains a comprehensive pipeline script for germline mutation analysis using the Genome Analysis Toolkit (GATK4) HaplotypeCaller. The pipeline includes steps for variant calling, filtering, and annotation, demonstrating a complete bioinformatics workflow from downloading reference files and reads, to identifying and annotating germline mutations.

Pipeline Overview

The pipeline is specifically designed for germline mutation calling in human or similar organisms. It includes:

Downloading reference genome and known variants
Aligning reads to a reference genome
Marking duplicates and recalibrating base scores
Calling germline variants (SNPs and INDELs) with GATK's HaplotypeCaller
Filtering and annotating variants with Funcotator

This pipeline is ideal for researchers and bioinformaticians working on germline variant discovery.

Requirements

This pipeline requires the following tools to be installed and accessible in your $PATH:

bwa for sequence alignment
samtools for SAM/BAM file handling
GATK4 for germline variant calling, filtering, and annotation
wget for downloading files (replaceable with other download tools)

To install these tools, consult their official documentation for installation instructions and dependencies.

Setup

Clone this repository.

git clone https://github.com/yourusername/germline-mutation-analysis-pipeline.git
cd germline-mutation-analysis-pipeline

Ensure bwa, samtools, GATK4, and wget are installed.
Download any required data sources for GATK Funcotator annotation. Update paths in the script if needed.

Usage

To run the pipeline:

Make the script executable:
```
chmod +x pipeline.sh
```
Run the script:
```
./pipeline.sh
```

The script automatically downloads the necessary files (reference genome, known variants, and example reads) if they are not already present, so make sure you have an internet connection for the first run.

Pipeline Steps

This script is divided into sections for clarity. Below is a breakdown of each part.

PART 0: Download Required Files

The script downloads:

Reference genome (FASTA file and associated index and dictionary files)
Known sites for base quality score recalibration (BQSR) like dbSNP and known indels
Sample input reads (paired-end FASTQ files for demonstration)
GATK Funcotator data sources for annotation

PART 1: Preprocessing and Germline Variant Calling

Align Reads: Uses bwa mem to align input reads to the reference genome.
Convert and Sort: Converts SAM to BAM, sorts, and indexes the BAM file.
Mark Duplicates: Identifies and marks duplicate reads using GATK.
Base Quality Score Recalibration (BQSR): Recalibrates base scores based on known sites of variation.
Germline Variant Calling: Calls raw germline variants (SNPs and INDELs) using GATK HaplotypeCaller.

PART 2: Filter and Annotate Variants

Filter Variants: Applies quality filters to SNPs and INDELs to reduce false positives.
Select Passed Variants: Extracts variants that passed filtering criteria for further analysis.
Annotate with Funcotator: Annotates filtered variants using GATK Funcotator.
Export to Table: Outputs key fields from annotated SNPs to a tab-delimited table.

Output

Results are stored in the specified output directory (output_directory/results). Key output files include:

aligned_reads.sam and sorted_reads.bam: Aligned read files
raw_variants.vcf: Initial germline variant calls
filtered_snps.vcf and filtered_indels.vcf: Quality-filtered SNPs and INDELs
analysis-ready-snps-filteredGT-functotated.vcf: SNPs annotated with Funcotator
output_snps.table: Tab-delimited table with annotated SNPs

Notes

Adjust Paths: Paths in the script (ref, reads_dir, results, funcotator_data) are customizable based on your file organization.
Download URLs: Placeholder URLs for genome resources should be replaced with actual resource URLs (e.g., from UCSC or NCBI).
Runtime: Processing times depend on data size and computational resources.

If you encounter any issues or have questions, feel free to open an issue or contact the repository maintainer.

This pipeline script is for educational and demonstrative purposes. Contributions to improve the script’s efficiency or add new features are welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
pipeline-WGS-GATK.sh		pipeline-WGS-GATK.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Germline Mutation Analysis Pipeline with GATK4 HaplotypeCaller

Table of Contents

Pipeline Overview

Requirements

Setup

Usage

Pipeline Steps

PART 0: Download Required Files

PART 1: Preprocessing and Germline Variant Calling

PART 2: Filter and Annotate Variants

Output

Notes

About

Releases

Packages

Languages

santi-souza/WGS-GATK

Folders and files

Latest commit

History

Repository files navigation

Germline Mutation Analysis Pipeline with GATK4 HaplotypeCaller

Table of Contents

Pipeline Overview

Requirements

Setup

Usage

Pipeline Steps

PART 0: Download Required Files

PART 1: Preprocessing and Germline Variant Calling

PART 2: Filter and Annotate Variants

Output

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages