Skip to content

Commit

Permalink
Whatshap allelic (#1086)
Browse files Browse the repository at this point in the history
* test dag

* fix modes

* abspath

* args

* fix

* sorted

* fix

* wildcard constrains

* does this work

* parenthesis

* expand

* expand

* suffix

* allelic feature counts

* dt qc

* deeptools allelic

* dtqc

* fixes

* multiqc

* pvcf

* reverse in

* multiqc

* parenthesis

* dqc

* DESeq2

* test dag

* link bam

* frombam

* index external

* quotes

* escape fromBAM

* test dag

* test dag

* test dag

* pytest

* pytest

* docs

* underline

* underline

---------

Co-authored-by: [email protected] <[email protected]>
  • Loading branch information
katsikora and [email protected] authored Jan 13, 2025
1 parent 4add952 commit 8d1132a
Show file tree
Hide file tree
Showing 14 changed files with 191 additions and 13 deletions.
12 changes: 12 additions & 0 deletions .ci_stuff/test_dag.sh
Original file line number Diff line number Diff line change
Expand Up @@ -348,6 +348,18 @@ WC=`mRNAseq -m allelic-mapping,deepTools_qc -i allelic_BAM_input/filtered_bam --
if [ ${PIPESTATUS[0]} -ne 0 ] || [ $WC -ne 1315 ]; then exit 1 ; fi
WC=`mRNAseq -m allelic-mapping,deepTools_qc,alignment-free -i PE_input -o output --sampleSheet .ci_stuff/test_sampleSheet_multiComp.tsv --snakemakeOptions " --dryrun --conda-prefix /tmp" --VCFfile allelic_input/file.vcf.gz --strains strain1 .ci_stuff/organism.yaml | tee >(cat 1>&2) | grep -v -f .ci_stuff/test_ignore_patterns.txt | sed '/^\s*$/d' | sed '/^host:/d' | wc -l `
if [ ${PIPESTATUS[0]} -ne 0 ] || [ $WC -ne 3114 ]; then exit 1 ; fi
#whatshap-allelic
WC=`mRNAseq -m allelic-whatshap -i PE_input -o output --snakemakeOptions " --dryrun --conda-prefix /tmp" --phased-vcf allelic_input/file.vcf.gz --sampleSheet .ci_stuff/test_sampleSheet.tsv .ci_stuff/organism.yaml | tee >(cat 1>&2) | grep -v -f .ci_stuff/test_ignore_patterns.txt | sed '/^\s*$/d' | sed '/^host:/d' | wc -l `
if [ ${PIPESTATUS[0]} -ne 0 ] || [ $WC -ne 1472 ]; then exit 1 ; fi
#whatshap-allelic + fromBam
WC=`mRNAseq -m allelic-whatshap -i BAM_input/filtered_bam --fromBAM --bamExt '.filtered.bam' -o output --sampleSheet .ci_stuff/test_sampleSheet.tsv --snakemakeOptions " --dryrun --conda-prefix /tmp" --phased-vcf allelic_input/file.vcf.gz .ci_stuff/organism.yaml | tee >(cat 1>&2) | grep -v -f .ci_stuff/test_ignore_patterns.txt | sed '/^\s*$/d' | sed '/^host:/d' | wc -l `
if [ ${PIPESTATUS[0]} -ne 0 ] || [ $WC -ne 669 ]; then exit 1 ; fi
#whatshap-allelic + deepTools qc
WC=`mRNAseq -m allelic-whatshap,deepTools_qc -i PE_input -o output --snakemakeOptions " --dryrun --conda-prefix /tmp" --phased-vcf allelic_input/file.vcf.gz --sampleSheet .ci_stuff/test_sampleSheet.tsv .ci_stuff/organism.yaml | tee >(cat 1>&2) | grep -v -f .ci_stuff/test_ignore_patterns.txt | sed '/^\s*$/d' | sed '/^host:/d' | wc -l `
if [ ${PIPESTATUS[0]} -ne 0 ] || [ $WC -ne 1965 ]; then exit 1 ; fi
#whatshap-allelic + fromBam + deepTools_qc
WC=`mRNAseq -m allelic-whatshap,deepTools_qc -i BAM_input/filtered_bam --fromBAM --bamExt '.filtered.bam' -o output --sampleSheet .ci_stuff/test_sampleSheet.tsv --snakemakeOptions " --dryrun --conda-prefix /tmp" --phased-vcf allelic_input/file.vcf.gz .ci_stuff/organism.yaml | tee >(cat 1>&2) | grep -v -f .ci_stuff/test_ignore_patterns.txt | sed '/^\s*$/d' | sed '/^host:/d' | wc -l `
if [ ${PIPESTATUS[0]} -ne 0 ] || [ $WC -ne 1236 ]; then exit 1 ; fi

#ncRNAseq
WC=`ncRNAseq -i PE_input -o output --snakemakeOptions " --dryrun --conda-prefix /tmp" .ci_stuff/organism.yaml | tee >(cat 1>&2) | grep -v -f .ci_stuff/test_ignore_patterns.txt | sed '/^\s*$/d' | sed '/^host:/d' | wc -l `
Expand Down
6 changes: 6 additions & 0 deletions docs/content/News.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
snakePipes News
===============

snakePipes 3.1.1
________________

* added whatshap-allelic mode to mRNA seq


snakePipes 3.1.0
________________

Expand Down
11 changes: 11 additions & 0 deletions docs/content/workflows/mRNAseq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,8 @@ There is a configuration file in ``snakePipes/workflows/mRNAseq/defaults.yaml``:
# for allele-spcific mapping
SNPFile:
NMaskedIndex:
# for allelic-whatshap
pvcf:
#### Flag to control the pipeline entry point
fromBAM: False
bamExt: '.bam'
Expand Down Expand Up @@ -202,6 +204,15 @@ Allele-specific, gene-level differential expression analysis is then performed u

**allelic-counting** mode requires the user to input, per sample, 4 bam files, corresponding to haplotype1, haplotype2, unassigned and haplotagged , e.g. as generated by whatshap. The respective suffixes ".genome1", ".genome2", ".unassigned", ".alelle_flagged" are required to be followed by the bam extention ".sorted.bam". This mode is mutually exclusive with "deepTools_qc". Only the allelic version of deepTools qc will be run, by default. Allelic version of featureCounts will be run by default. If sample sheet is provided, allelic DESeq2- or allelic Salmon-based differential gene expression analysis will be run.


"allelic-whatshap"
~~~~~~~~~~~~~~~~~~

**allelic-whatshap** mode applies a standard alignment to a nonmasked genome with STAR, followed by allele-specific splitting
of mapped files with whatshap, requiring a phased vcf file as input ( ``--phased-vcf`` ). Gene-level quantification is performed for each allele using **featureCounts**.
Allele-specific, gene-level differential expression analysis is then performed using **DESeq2**.


"alignment-free"
~~~~~~~~~~~~~~~~

Expand Down
3 changes: 2 additions & 1 deletion snakePipes/common_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,8 @@ def set_env_yamls():
'CONDA_SAMBAMBA_ENV': 'envs/sambamba.yaml',
'CONDA_pysam_ENV': 'envs/pysam.yaml',
'CONDA_SEACR_ENV': 'envs/chip_seacr.yaml',
'CONDA_FQLINT_ENV': 'envs/fqlint.yaml'
'CONDA_FQLINT_ENV': 'envs/fqlint.yaml',
'CONDA_WHATSHAP_ENV': 'envs/whatshap.yaml'
}


Expand Down
4 changes: 2 additions & 2 deletions snakePipes/shared/rules/DESeq2.singleComp.snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ def get_outdir(folder_name,sampleSheet):
## DESeq2 (on featureCounts)
rule DESeq2:
input:
counts_table = lambda wildcards : "featureCounts/counts_allelic.tsv" if 'allelic-mapping' in mode or "allelic-counting" in mode else "featureCounts/counts.tsv",
counts_table = lambda wildcards : "featureCounts/counts_allelic.tsv" if 'allelic-mapping' in mode or 'allelic-counting' in mode or 'allelic-whatshap' in mode else "featureCounts/counts.tsv",
sampleSheet = sampleSheet,
symbol_file = "Annotation/genes.filtered.symbol" #get_symbol_file
output:
Expand All @@ -20,7 +20,7 @@ rule DESeq2:
outdir = get_outdir("DESeq2",sampleSheet),
fdr = fdr,
importfunc = os.path.join(maindir, "shared", "rscripts", "DE_functions.R"),
allele_info = lambda wildcards : 'TRUE' if 'allelic-mapping' in mode or "allelic-counting" in mode else 'FALSE',
allele_info = lambda wildcards : 'TRUE' if 'allelic-mapping' in mode or 'allelic-counting' in mode or 'allelic-whatshap' in mode else 'FALSE',
tx2gene_file = 'NA',
rmdTemplate = os.path.join(maindir, "shared", "rscripts", "DESeq2Report.Rmd"),
formula = config["formula"],
Expand Down
23 changes: 23 additions & 0 deletions snakePipes/shared/rules/LinkBam.snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,29 @@ if pipeline=="rnaseq" and "allelic-counting" in mode:
shell: "if [[ ! -f {output[0]} ]]; then samtools index {input[0]}; fi"


elif pipeline=="rnaseq" and "allelic-whatshap" in mode:
rule link_bam:
input:
indir + "/{sample}" + bamExt
output:
"filtered_bam/{sample}.filtered.bam"
params:
input_bai = indir + "/{sample}" + bamExt + ".bai",
output_bai = "filtered_bam/{sample}.filtered.bam.bai"
run:
if os.path.exists(params.input_bai) and not os.path.exists(os.path.join(outdir,params.output_bai)):
os.symlink(params.input_bai,os.path.join(outdir,params.output_bai))
if not os.path.exists(os.path.join(outdir,output[0])):
os.symlink(os.path.join(outdir,input[0]),os.path.join(outdir,output[0]))

rule samtools_index_external:
input:
"filtered_bam/{sample}.filtered.bam"
output:
"filtered_bam/{sample}.filtered.bam.bai"
conda: CONDA_SHARED_ENV
shell: "if [[ ! -f {output[0]} ]]; then samtools index {input[0]}; fi"

else:
rule link_bam:
input:
Expand Down
6 changes: 6 additions & 0 deletions snakePipes/shared/rules/envs/whatshap.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
name: snakepipes_whatshap_environment_1.0
channels:
- conda-forge
- bioconda
dependencies:
- whatshap = 2.3
11 changes: 9 additions & 2 deletions snakePipes/shared/rules/multiQC.snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -66,15 +66,22 @@ def multiqc_input_check(return_value):
indir += "allelic_bams"
elif pipeline=="rnaseq":
# must be RNA-mapping, add files as per the mode
if ( "alignment" in mode or "deepTools_qc" in mode or "three-prime-seq" in mode ) and not "allelic-mapping" in mode and not "allelic-counting" in mode:
if ( "alignment" in mode or "deepTools_qc" in mode or "three-prime-seq" in mode ) and not "allelic-mapping" in mode and not "allelic-counting" in mode and not "allelic-whatshap" in mode:
infiles.append( expand(aligner+"/{sample}.markdup.bam", sample = samples) +
expand("Sambamba/{sample}.markdup.txt", sample = samples) +
expand("deepTools_qc/estimateReadFiltering/{sample}_filtering_estimation.txt",sample=samples)+
expand("featureCounts/{sample}.counts.txt", sample = samples))
indir += aligner + " featureCounts "
indir += " Sambamba "
indir += " deepTools_qc "
if "allelic-mapping" in mode or "allelic-counting" in mode:
if "allelic-whatshap" in mode and not fromBAM:
infiles.append( expand(aligner+"/{sample}.markdup.bam", sample = samples) +
expand("Sambamba/{sample}.markdup.txt", sample = samples) +
expand("deepTools_qc/estimateReadFiltering/{sample}_filtering_estimation.txt",sample=samples))
indir += aligner
indir += " Sambamba "
indir += " deepTools_qc "
if "allelic-mapping" in mode or "allelic-counting" in mode or "allelic-whatshap" in mode:
infiles.append( expand("featureCounts/{sample}.allelic_counts.txt", sample = samples) )
indir += aligner + " featureCounts "
if "allelic-mapping" in mode:
Expand Down
40 changes: 40 additions & 0 deletions snakePipes/shared/rules/whatshap.snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
rule whatshap_haplotag:
input:
ref = genome_fasta,
pvcf = pvcf,
bam = "filtered_bam/{sample}.filtered.bam",
bai = "filtered_bam/{sample}.filtered.bam.bai"
output:
hbam = "allelic_bams/{sample}.allele_flagged.sorted.bam",
hlist = "allelic_bams/{sample}_haplotype_list.tsv"
benchmark:
"allelic_bams/.benchmark/whatshap_haplotag.{sample}.benchmark"
threads: 4
conda: CONDA_WHATSHAP_ENV
shell: """
whatshap haplotag --ignore-read-groups -o {output.hbam} --reference {input.ref} --output-threads={threads} --output-haplotag-list={output.hlist} {input.pvcf} {input.bam}
"""

rule whatshap_split:
input:
hbam = "allelic_bams/{sample}.allele_flagged.sorted.bam",
hlist = "allelic_bams/{sample}_haplotype_list.tsv"
output:
h1bam = "allelic_bams/{sample}.genome1.sorted.bam",
h2bam = "allelic_bams/{sample}.genome2.sorted.bam",
unbam = "allelic_bams/{sample}.unassigned.sorted.bam"
benchmark:
"allelic_bams/.benchmark/whatshap_split.{sample}.benchmark"
conda: CONDA_WHATSHAP_ENV
shell: """
whatshap split --output-h1 {output.h1bam} --output-h2 {output.h2bam} --output-untagged {output.unbam} {input.hbam} {input.hlist}
"""


rule BAMindex_allelic:
input:
expand("allelic_bams/{sample}.{suffix}.sorted.bam",sample=samples,suffix=['allele_flagged', 'genome1', 'genome2', 'unassigned'])
output:
expand("allelic_bams/{sample}.{suffix}.sorted.bam.bai",sample=samples,suffix=['allele_flagged', 'genome1', 'genome2', 'unassigned'])
conda: CONDA_SHARED_ENV
shell: "samtools index {input}"
19 changes: 14 additions & 5 deletions snakePipes/workflows/mRNAseq/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,13 @@ if not fromBAM:
include: os.path.join(maindir, "shared", "rules", "Salmon_allelic.snakefile")
include: os.path.join(maindir, "shared", "rules", "sleuth.singleComp.snakefile")
# include: os.path.join(maindir, "shared", "rules", "DESeq2.singleComp.snakefile")
elif "allelic-whatshap" in mode:
include: os.path.join(maindir, "shared", "rules", "RNA_mapping.snakefile")
include: os.path.join(maindir, "shared", "rules", "whatshap.snakefile")
include: os.path.join(maindir, "shared", "rules", "deepTools_RNA_allelic.snakefile")
if "alignment-free" in mode:
include: os.path.join(maindir, "shared", "rules", "Salmon_allelic.snakefile")
include: os.path.join(maindir, "shared", "rules", "sleuth.singleComp.snakefile")
else:
# HISAT2/STAR
include: os.path.join(maindir, "shared", "rules", "RNA_mapping.snakefile")
Expand Down Expand Up @@ -108,11 +115,13 @@ else:
elif "allelic-counting" in mode:
allele_mode = "from_split_bam"
include: os.path.join(maindir, "shared", "rules", "deepTools_RNA_allelic.snakefile")

elif "allelic-whatshap" in mode:
include: os.path.join(maindir, "shared", "rules", "whatshap.snakefile")
include: os.path.join(maindir, "shared", "rules", "deepTools_RNA_allelic.snakefile")
include: os.path.join(maindir, "shared", "rules", "LinkBam.snakefile")


if "allelic-mapping" in mode or "allelic-counting" in mode:
if "allelic-mapping" in mode or "allelic-counting" in mode or "allelic-whatshap" in mode:
## featureCounts_allelic
include: os.path.join(maindir, "shared", "rules", "featureCounts_allelic.snakefile")
else:
Expand Down Expand Up @@ -170,7 +179,7 @@ def run_Trimming(trim, fastqc):

def run_alignment_free():
if "alignment-free" in mode:
if not "allelic-mapping" in mode:
if not "allelic-mapping" in mode and not "allelic-whatshap" in mode:
file_list = [
expand("Salmon/{sample}.quant.sf", sample=samples),
"Salmon/TPM.transcripts.tsv",
Expand Down Expand Up @@ -247,7 +256,7 @@ def make_nmasked_genome():
return([])

def run_allelesp_mapping():
if "allelic-mapping" in mode or "allelic-counting" in mode:
if "allelic-mapping" in mode or "allelic-counting" in mode or "allelic-whatshap" in mode:
allele_suffix = ['allele_flagged', 'genome1', 'genome2', 'unassigned']
file_list = [
expand("allelic_bams/{sample}.{suffix}.sorted.bam", sample = samples,
Expand Down Expand Up @@ -289,7 +298,7 @@ def run_deepTools_qc():
"deepTools_qc/plotCorrelation/correlation.pearson.bed_coverage.tsv",
"deepTools_qc/plotCorrelation/correlation.spearman.bed_coverage.tsv",
"deepTools_qc/plotPCA/PCA.bed_coverage.tsv"] )
if 'allelic-mapping' in mode:
if 'allelic-mapping' in mode or 'allelic-whatshap' in mode:
file_list.append(["deepTools_qc/plotEnrichment/plotEnrichment_allelic.tsv"])
if len(samples)>1:
file_list.append( ["deepTools_qc/multiBigwigSummary/coverage_allelic.bed.npz",
Expand Down
2 changes: 2 additions & 0 deletions snakePipes/workflows/mRNAseq/defaults.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,8 @@ plotFormat: png
# for allele-spcific mapping
SNPFile:
NMaskedIndex:
# for allelic-whatshap
pvcf:
#### Flag to control the pipeline entry point
fromBAM: False
bamExt: '.bam'
Expand Down
8 changes: 8 additions & 0 deletions snakePipes/workflows/mRNAseq/internals.snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,14 @@ else:
else:
samples = cf.get_sample_names_bam(infiles, bamExt)

if "allelic-whatshap" in mode:
if not pvcf:
print("Please provide a phased vcf file as input for whatshap!")
exit(1)
if not os.path.exists(pvcf):
print("Phased vcf file " + str(pvcf) + " not found!")
exit(1)

if formula and not sampleSheet:
print("In order to apply custom formula, please provide a sample sheet!")
exit(1)
Expand Down
13 changes: 10 additions & 3 deletions snakePipes/workflows/mRNAseq/mRNAseq.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ def parse_args(defaults={"verbose": False, "configFile": None,
"UMIDedup": False,
"UMIDedupOpts": "", "bcPattern": "NNNNCCCCCCCCC",
"UMIDedupSep": "_", "UMIBarcode": False, "rMats": False,
"fdr": 0.05}):
"fdr": 0.05, "pvcf": None}):
"""
Parse arguments from the command line.
"""
Expand All @@ -49,7 +49,7 @@ def parse_args(defaults={"verbose": False, "configFile": None,
# Workflow options
optional = parser.add_argument_group('Options')
optional.add_argument("-m", "--mode",
help="workflow running modes (available: 'alignment-free, alignment, allelic-mapping, allelic-counting, deepTools_qc, three-prime-seq')"
help="workflow running modes (available: 'alignment-free, alignment, allelic-mapping, allelic-counting, allelic-whatshap, deepTools_qc, three-prime-seq')"
" (default: '%(default)s')",
default=defaults["mode"])

Expand Down Expand Up @@ -95,6 +95,11 @@ def parse_args(defaults={"verbose": False, "configFile": None,
help="Design formula to use in linear model fit (default: '%(default)s')",
default=defaults["formula"])

optional.add_argument("--phased-vcf",
dest="pvcf",
help="Phased vcf required for whatshap haplotagging. (default: '%(default)s')",
default=defaults["pvcf"])


optional.add_argument("--dnaContam",
action="store_true",
Expand Down Expand Up @@ -150,13 +155,15 @@ def main():
# check for Allele-specific mapping mode
args.allele_mode = cf.checkAlleleParams(args)
# convert file path to abspath
if args.pvcf:
args.pvcf = os.path.abspath(args.pvcf)
if args.allele_mode == "create_and_map":
args.VCFfile = os.path.abspath(args.VCFfile)
elif args.allele_mode == "map_only":
args.SNPfile = os.path.abspath(args.SNPfile)
args.NMaskedIndex = os.path.abspath(args.NMaskedIndex)
modeTemp = args.mode.split(",")
validModes = set(["alignment", "alignment-free", "deepTools_qc", "allelic-mapping", "allelic-counting", "three-prime-seq"])
validModes = set(["alignment", "alignment-free", "deepTools_qc", "allelic-mapping", "allelic-counting", "allelic-whatshap", "three-prime-seq"])
for mode in modeTemp:
if mode not in validModes:
sys.exit("{} is not a valid mode!\n".format(mode))
Expand Down
46 changes: 46 additions & 0 deletions tests/test_jobcounts.py
Original file line number Diff line number Diff line change
Expand Up @@ -1725,6 +1725,52 @@ def test_allelic_alfree_multicomp(self, ifs):
_p = sp.run(ci, capture_output=True, text=True)
assert _p.returncode == 0
assert parseSpOut(_p) == 340
def test_whatshap_allelic(self, ifs):
ci = [
"mRNAseq",
'-i',
ifs / 'PE',
'-o',
ifs / 'outdir',
'--snakemakeOptions',
SMKOPTS,
ifs / 'org.yaml',
'--sampleSheet',
ifs / 'sampleSheet.tsv',
'-m',
'allelic-whatshap,deepTools_qc',
'--phased-vcf',
ifs / 'allelic_input' / 'file.vcf.gz'
]
print(' '.join([str(i) for i in ci]))
_p = sp.run(ci, capture_output=True, text=True)
assert _p.returncode == 0
assert parseSpOut(_p) == 208
def test_whatshap_allelic_fromBAM(self, ifs):
ci = [
"mRNAseq",
'-i',
ifs / 'allelic_bam_input' / 'filtered_bam',
'-o',
ifs / 'outdir',
'--snakemakeOptions',
SMKOPTS,
'--fromBAM',
'--bamExt',
'.filtered.bam',
ifs / 'org.yaml',
'--sampleSheet',
ifs / 'sampleSheet.tsv',
'-m',
'allelic-whatshap,deepTools_qc',
'--phased-vcf',
ifs / 'allelic_input' / 'file.vcf.gz'
]
print(' '.join([str(i) for i in ci]))
_p = sp.run(ci, capture_output=True, text=True)
assert _p.returncode == 0
assert parseSpOut(_p) == 127


class TestncRNAseq():
def test_default(self, ifs):
Expand Down

0 comments on commit 8d1132a

Please sign in to comment.