-
Notifications
You must be signed in to change notification settings - Fork 0
1‐Introduction
- Welcome to the G-WASPiper wiki!
- A handy pipeline for GWAS/genotyping data analysis.
- It can be easly used without having coding skills.
- Each pipelines are general indipentent , and can be highly modified by changing
R scripts
.
- This repository is my collection of pipelines written in R to simplify, please cite original tools/approaches referred in the pipelines. The main idea here is putting every step in a pipe to ensure repredociblity and simplyfy all process.
- The pipeline designed to start from genotyping results to create ancestery / GWAS / TWAS / PRS / MR / xQTL / GWIS / FINE mapping (will be updated) analysis.
- The main idea here creating standart pipeline for all process.
- There diffrent scripts for diffrent aims which written in R. python or bash.
- There is not any correct order to run any script it is highly depented aim and input file
- This pipeline is designed very flexible to start any step and go to the other
- But, please carefully read the instruction and check arguments in the scripts. If needed please change it!
- Input: Illumina final report, strand file (usually availble in genotyping proviers web page)
- Functions:
- Takes illumina final report files and convert it to pplink binary files. Checking and aligning all variants to TOP+ strand
- Filter low quaility variants GC Score > 0.15 (if neseccery please change it in the script before running)
- Check report and strand file structure
- Outputs
- final2plink/ (Igen files - Pheno01.fam Pheno01.lgen Pheno01.map), ped/map files (Pheno02.log Pheno02.map Pheno02.nosex Pheno02.ped), non-TOP+ SNPs (flip.missnp)
- raw_plink/ plink binary files (Pheno03.bed Pheno03.bim Pheno03.fam Pheno03.log Pheno03.nosex)
Basic Usage:
Rscript --no-save finalreport2plink.R --frn <final_report_name> --pfr <final_report_path> --sfn <strand_file_name> --psf <strand_file_path>
Example:
Rscript --no-save finalreport2plink.R --frn FinalReport.txt --pfr /path/to/final/report/ --sfn strand_file.csv --psf /path/to/strand/file/
How to run in the R:
system ("cd G-WASPiper/scripts ; Rscript --no-save finalreport2plink.R --frn FinalReport.txt --pfr /path/to/final/report/ --sfn strand_file.csv --psf /path/to/strand/file/")
Please change --frn (final report file names), --pfr (path for final report), --sfn(name of strand file, --psf (path for strand file)
- List of arguments:
Argument | Default | Explaination |
---|---|---|
--frn | FinalReport.txt | Name of final report |
--frp | data | Path for final report |
--sfn | strand_file.csv | Name of strand file |
--psf | data | Path for strand file |
-
Important notes
- Please add phenotype and sex information by modifing R script to .fam file if it is availble
- Before running this script you need to find correct strand file from service provider
- Now you should have plink binary (bfile) files and wants to do basic quality control for individual and genotyping level
- To make easy and systematic we will use in the each step Pheno[XX] naming, from previous step we have Pheno03 .fam .bim .bed in the raw_plink folder
- If your bfiles has different name please change it Pheno03 to do this step and continue next steps
- Please add phenotype and sex information to fam file if you didnt put it in the first step
- Before the removing any SNPs or individuals, we need to understand the data. How many individual do you have?
How to check number of individual the Terminal?:
wc -l Pheno03.fam
- The pipeline created for large scale studies, so sample size should be minimum ~500 the conduct this QC steps
- If you sample size below to this numbers, please dont use very strick tresholds to filter variant and individual (e.g. MAF, relatedness)
- If you sample size too small consider to merge public data set to run same QC steps
- The very first step of QC is make a decision for the MAF, mind, geno, hwe
R:
system("cd raw_plink ; plink --bfile Pheno03 --missing --freq --hardy ; cd .. ; mv raw_plink/plink* results")
Terminal:
plink --bfile raw_plink/Pheno03 --missing --freq --hardy
mv raw_plink/plink* results
system ("cd scripts ; Rscript --no-save basic_stat.R")
#############################################################################
system("cd raw_plink ; plink --bfile Pheno03 --missing --freq --hardy ; cd .. ; mv raw_plink/plink* results")
system ("cd scripts ; Rscript --no-save basic_stat.R")
system("cd raw_plink ; plink --bfile Pheno03 --geno 0.02 --mind 0.02 --maf 0.01 --hwe include-nonctrl 5e-08 --make-bed --out Pheno04 ; cd .. ; mv raw_plink/Pheno04* QCsteps")
system("cd QCsteps ; plink --bfile Pheno04 --check-sex ; cd .. ; cp QCsteps/plink* results")
system("cd QCsteps; grep PROBLEM plink.sexcheck| awk '{print$1,$2}'> sex_discrepancy.txt")
system("cd QCsteps ; plink --bfile Pheno04 --remove sex_discrepancy.txt --make-bed --out Pheno05")
system("cd QCsteps ; plink --bfile Pheno04 --autosome --make-bed --out Pheno06")
system("plink --bfile QCsteps/Pheno06 --exclude scripts/high-LD-regions-hg19-GRCh37.txt --range --indep-pairwise 50 5 0.3 --out indepSNP ; mv indepSNP* QCsteps/") #highld region for 37 and 38 provided
system("cd QCsteps ; plink --bfile Pheno06 --extract indepSNP.prune.in --het --out check")
system("Rscript --no-save heterozygosity.R")
system("plink --bfile QCsteps/Pheno06 --remove results/fail-het.txt --make-bed --out Pheno07 ; mv Pheno07* QCsteps")
#############################################################################
system("cd QCsteps ; plink --bfile Pheno06 --extract indepSNP.prune.in --genome --out pihat ; cd .. ; mv QCsteps/pihat* results")
system("Rscript --no-save Relatedness.R")
system("plink --bfile QCsteps/Pheno06 --remove results/relatedness_fail.txt --make-bed --out Pheno08 ; mv Pheno08* QCsteps")
system("Rscript --no-save 1000G_download.R")
system("Rscript --no-save 1000G_merge.R")
system("bash scripts/fraposa.sh")
system("Rscript --no-save scripts/fraposa.R")
system("Rscript --no-save scripts/ancestry.R")
system("bash scripts/plink2vcf.sh")
cd topmed
mkdir topmed_raw cp *zip topmed_raw
cd topmed mkdir topmed_raw cp *zip topmed_raw
Please change "your-password" argument with password that you get from topmed after your imputation done via email
system("bash scripts/unziptopmed.sh")
#!/bin/bash for ((chr=1; chr<=22; chr++)); do unzip -P 'your-password' chr_${chr}.zip done
wget -O GRCh38_full_analysis_set_plus_decoy_hla.fa.zst https://www.dropbox.com/s/xyggouv3tnamh0j/GRCh38_full_analysis_set_plus_decoy_hla.fa.zst?dl=1 zstd --decompress GRCh38_full_analysis_set_plus_decoy_hla.fa.zst
#concat combines all the chromosomes into a single file #view filters by info score #norm normalises indels. Split multiallelic sites into biallelic records. SNPs and indels merged into a single record #create final gzipped VCF file and annotate. Remove original SNP ID and assign new SNP ID as chrom:position:ref:alt
bcftools concat chr*.dose.vcf.gz -Ou | bcftools view -Ou -i 'R2>0.3' | bcftools norm -Ou -m -any | bcftools norm -Ou -f hs37d5.fa | bcftools annotate -Oz -x ID -I +'%CHROM:%POS:%REF:%ALT' -o allchr.converted.R2_0p3.vcf.gz
#--double-id means that both family and within-family IDs to be set to the sample ID
plink --vcf allchr.converted.R2_0p3.vcf.gz
--double-id
--allow-extra-chr 0
--maf 0.0000001
--make-bed
--out allchr.converted.R2_0p3.MAF_1e7
#You can also include these other flags - depends on what you want to do
--vcf-min-GP 0.9 \
--vcf-idspace-to _ \
system("bash scripts/QC_impute.R")
1) Marees, A. T., de Kluiver, H., Stringer, S., Vorspan, F., Curis, E., Marie‐Claire, C., & Derks, E. M. (2018). A tutorial on conducting genome‐wide association studies: Quality control and statistical analysis. International journal of methods in psychiatric research, 27(2), e1608.
setwd("your/finalreport/path")
setwd("HOME/{user}/{path/folder containing your final report}")
folder name: final2plink/ flip.missnp Pheno01.fam Pheno01.lgen Pheno01.map Pheno02.log Pheno02.map Pheno02.nosex Pheno02.ped
folder name: your main folder/ InfiniumOmniExpress-24v1-4_A1.csv InfiniumOmniExpress-24v1-4_A1_csv.zip
The pipeline created for large scale studies, so sample size should be minimum 1000s the conduct this QC steps
Also please change this thresholds accourding to your aim and consider to merge your data with public data if possible
system("cd raw_plink ; plink --bfile Pheno03 --missing --freq --hardy ; cd .. ; mv raw_plink/plink* results")
system ("cd scripts ; Rscript --no-save basic_stat.R")
genothreshold.txt: shows how many SNP will remain after setting --geno filter with certain threshold
mindthreshold.txt: shows how individual will remain after setting --mind filter with certain threshold
hwethreshold.txt: Show how many SNP will remain after setting --hwe filter. all, aff(cases only),unaff(controls)
hwe.pdf: Two pages. First page: histogram of x=P values y=SNP count, second page x=P values y=SNP percentage
MAF >= 0 & MAF < 0.001 ~ "very_rare", MAF >= 0.001 & MAF < 0.005 ~ "rare", MAF >= 0.005 & MAF < 0.05 ~ "low_freq", MAF >= 0.05 & MAF <= 0.5 ~ "common
maf.pdf: Two pages. First page: histogram of x=MAF y=SNP count, second page x=MAF y=SNP by= categories
Warning! if your sample size less thound or only a few hundered be carefull to apply maf filter may results loss of too many variant
system("cd raw_plink ; plink --bfile Pheno03 --geno 0.02 --mind 0.02 --maf 0.01 --hwe include-nonctrl 5e-08 --make-bed --out Pheno04 ; cd .. ; mv raw_plink/Pheno04* QCsteps")
#outputs: Folder: QCsteps/ Pheno04.bed Pheno04.bim Pheno04.fam Pheno04.hh Pheno04.log
Subjects who were a priori determined as females must have a F value of <0.2, and subjects who were a priori determined as males must have a F value >0.8. This F value is based on the X chromosome inbreeding (homozygosity) estimate.
system("cd QCsteps ; plink --bfile Pheno04 --check-sex ; cd .. ; cp QCsteps/plink* results")
#outputs: Folder: results/ plink.sexcheck
system("cd QCsteps; grep PROBLEM plink.sexcheck| awk '{print$1,$2}'> sex_discrepancy.txt")
Optional(if needed): system("cd QCsteps ; plink --bfile Pheno04 --remove sex_discrepancy.txt --make-bed --out Pheno05")
system("cd QCsteps ; plink --bfile Pheno04 --autosome --make-bed --out Pheno06")
#the window at each step, and the multiple correlation coefficient for a SNP being regressed on all #other SNPs simultaneously.
In the case of small sample size please do not remove any individual before merging your data with public data like 1000G; see 1000G steps
system("plink --bfile QCsteps/Pheno06 --exclude scripts/high-LD-regions-hg19-GRCh37.txt --range --indep-pairwise 50 5 0.3 --out indepSNP ; mv indepSNP* QCsteps/") #highld region for 37 and 38 provided
system("cd QCsteps ; plink --bfile Pheno06 --extract indepSNP.prune.in --het --out check")
#Output:QCsteps/check.het
system("Rscript --no-save heterozygosity.R")
Optional(if needed): system("plink --bfile QCsteps/Pheno06 --remove results/fail-het.txt --make-bed --out Pheno07 ; mv Pheno07* QCsteps")
############################################################################# ############################################################################# #############################################################################
Assuming a random population sample we are going to exclude all individuals above the pihat threshold of 0.2 in this tutorial.
Recent shared ancestry for a pair of individuals (identity by descent, IBD) can be estimated using genome-wide IBS
Genotyping error, LD and population structure cause variation around these theoretical values and it is typical to remove
one individual from each pair with an IBD > 0.1875 (halfway between the expected IBD for third- and second-degree relaIves).
system("cd QCsteps ; plink --bfile Pheno06 --extract indepSNP.prune.in --genome --out pihat ; cd .. ; mv QCsteps/pihat* results")
For the practical reason IBD based relatedness threshold can be vary but so thresholds below used to identify related and relatedness_degree
system("Rscript --no-save Relatedness.R")
Output : results/ relatedness.txt relatedness_degree.txt relatedness_missing.txt relatedness_fail.txt relatedness.pdf
The R script will create relatedness_fail.txt which select related individuals with lowest genotyping rate
Optional(if needed): system("plink --bfile QCsteps/Pheno06 --remove results/relatedness_fail.txt --make-bed --out Pheno08 ; mv Pheno08* QCsteps")
system("Rscript --no-save 1000G_download.R")
You can delete all other raw files (all_phase3.pgen all_phase3.pvar clean.bed clean.fam genetic_map_b37 phase3_corrected.psam raw.bim raw.log
system("Rscript --no-save 1000G_merge.R")
Output(s) : Folder: QCsteps/ final_1k_merged.bed/bim/fam/log common_1000g.bed/bim/fam/log common_Pheno06.bed/bim/fam/log
system("bash scripts/fraposa.sh")
Outputs: Folder fraposa/Pheno06_stupref.png Pheno06_stupref1.png Pheno06_stupref.popu Pheno06_stupref1.popu
system("Rscript --no-save scripts/fraposa.R") #Outputs !!!!
system("Rscript --no-save scripts/ancestry.R")
system("bash scripts/plink2vcf.sh")
#Outputs: Folder: QCsteps/vcf files for each chr to updated TOPMed imputation server
cd topmed
mkdir topmed_raw cp *zip topmed_raw
cd topmed mkdir topmed_raw cp *zip topmed_raw
Please change "your-password" argument with password that you get from topmed after your imputation done via email
system("bash scripts/unziptopmed.sh")
#!/bin/bash for ((chr=1; chr<=22; chr++)); do unzip -P 'your-password' chr_${chr}.zip done
wget -O GRCh38_full_analysis_set_plus_decoy_hla.fa.zst https://www.dropbox.com/s/xyggouv3tnamh0j/GRCh38_full_analysis_set_plus_decoy_hla.fa.zst?dl=1 zstd --decompress GRCh38_full_analysis_set_plus_decoy_hla.fa.zst
#concat combines all the chromosomes into a single file #view filters by info score #norm normalises indels. Split multiallelic sites into biallelic records. SNPs and indels merged into a single record #create final gzipped VCF file and annotate. Remove original SNP ID and assign new SNP ID as chrom:position:ref:alt
bcftools concat chr*.dose.vcf.gz -Ou | bcftools view -Ou -i 'R2>0.3' | bcftools norm -Ou -m -any | bcftools norm -Ou -f hs37d5.fa | bcftools annotate -Oz -x ID -I +'%CHROM:%POS:%REF:%ALT' -o allchr.converted.R2_0p3.vcf.gz
#--double-id means that both family and within-family IDs to be set to the sample ID
plink --vcf allchr.converted.R2_0p3.vcf.gz
--double-id
--allow-extra-chr 0
--maf 0.0000001
--make-bed
--out allchr.converted.R2_0p3.MAF_1e7
#You can also include these other flags - depends on what you want to do
--vcf-min-GP 0.9 \
--vcf-idspace-to _ \
system("bash scripts/QC_impute.R")
- It is my personal reposotry to keep track all pipeline that I am using.
- Of course anyone can used as it or with modifaction.
- I will acknowledge any resourse/pipeline/code inclueded this codes.
- It is not automatic pipeline or click and run pipeline!
- You need to modify some arguments (MAF, INFO, HWE, p value etc.) in the codes, so please be carefull before running any pipeline.
- Some steps and pipelines are need strong computational resources and running this process with out any paralellization/optimization running pipelines as it will waste your time.
- If you have access to any HPC, please run this analysis in side to HPC. The pipelines not optimazated for parallel work.
- It is not a novel package/software, published work. I will try to answer/fix any question/bug, but it would be regulary basic.