-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hg19 and hg38 analysis from same workflow branch #8
base: master
Are you sure you want to change the base?
Conversation
Conflicts: resources/analysisTools/snvPipeline/confidenceAnnotation_SNVs.py resources/analysisTools/snvPipeline/filter_PEoverlap.py resources/analysisTools/snvPipeline/snvAnnotation.sh resources/configurationFiles/analysisSNVCalling.xml
resources/analysisTools/snvPipeline/intermutationDistance_Coord_color.r
Outdated
Show resolved
Hide resolved
resources/analysisTools/snvPipeline/environments/tbi-lsf-cluster.sh
Outdated
Show resolved
Hide resolved
Conflicts: README.md resources/analysisTools/snvPipeline/environments/tbi-lsf-cluster.sh resources/analysisTools/snvPipeline/filter_PEoverlap.py resources/configurationFiles/analysisSNVCalling.xml
@@ -251,6 +251,10 @@ def parseVcf(file,num): | |||
while (l!= ""): | |||
t=l.split('\t') | |||
if (t[0][0] != "#") and isValid(t): | |||
# Skipping the non-primary assembly variants from purity calculations | |||
if t[0].startswith('HLA') or t[0].endswith('_alt'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t
-> fields
l
-> line
BTW (2 lines up): line[0] != "#"
is much clearer than fields[0][0] != "#"
(not to speak of t[0][0]
).
# Skipping the non-primary assembly variants from purity calculations | ||
if t[0].startswith('HLA') or t[0].endswith('_alt'): | ||
l=vcf.readline() | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is i
(next line)? Maybe mapped_chromosome
?
@@ -199,6 +199,10 @@ def calculateErrorMatrix(vcfFilename, referenceFilename, errorType): | |||
# 23.05.2016 JB: Excluded multiallelic SNVs | |||
if ',' in split_line[header.index("ALT")]: continue | |||
|
|||
# 21.02.2023 NP: Excluded SNVs with 'N' before or after "," in context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a short word about why you exclude these? I guess it is not obvious because it was added to the SNV workflow only after years of operation.
@@ -1,3 +1,483 @@ | |||
<<<<<<< HEAD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge error.
# Reference file for CRAM files | ||
reference_file = args.refFileName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICS reference_file
is only used once, in the CRAM branch of the if-else block below. Moving it down will make it clearer
# Reference file for CRAM files | |
reference_file = args.refFileName |
samfile = pysam.Samfile(args.alignmentFile, mode) | ||
elif args.alignmentFile.split(".")[-1] == "cram": | ||
mode += "c" | ||
samfile = pysam.Samfile(args.alignmentFile, mode, reference_filename = reference_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
samfile = pysam.Samfile(args.alignmentFile, mode, reference_filename = reference_file) | |
# CRAM needs a reference file. | |
samfile = pysam.Samfile(args.alignmentFile, mode, reference_filename = args.refFileName) |
<cvalue name='CHROMOSOME_LENGTH_FILE' value='${hg38BaseDirectory}/stats/GRCh38_decoy_ebv_alt_hla_phiX.fa.chrLength.tsv' type="path" /> | ||
|
||
<cvalue name="CHROMOSOME_INDICES" value="( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y HLA ALT )" type="bashArray" description="Chr indices for the calling"/> | ||
<cvalue name="CHROMOSOME_HLA_CONTIGS_FILE" value='${hg38BaseDirectory}/stats/GRCh38_decoy_ebv_alt_hla_phiX.fa.HLA_contigs.bed' type="path" description="HLA contig list"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a description
field with a short description of the file content and format? If I saw that correctly, it should be suited for samtools mpileup -R $file
, right? That would already be a valuable information. Same for the ALT file.
$ccoord = $ctrl[1]; | ||
if ($tcoord == $ccoord) # matching pair found! | ||
if ($tcoord == $ccoord && $current_t_chr eq $current_c_chr) # matching pair found! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I see this correctly, the algorithm takes the two position-sorted VCFs and steps forward the feature -- either tumor or control VCF -- that is expected "earlier" in the algorithm, until it finds a match. Right?
I think, here and elsewhere, it would make more sense and would make the code clearer, if the chromosome comparison came before the coordinate comparison, i.e.
if ($tcoord == $ccoord && $current_t_chr eq $current_c_chr) # matching pair found! | |
if ($current_t_chr eq $current_c_chr && $tcoord == $ccoord) # matching pair found! |
@@ -48,6 +48,7 @@ dat$chromosome <- factor(dat$chromosome, levels = paste0("chr",c(seq(1,22),"X"," | |||
|
|||
|
|||
chromLength = read.table(file = opt$chromLengthFile, header = F) | |||
chromLength$V1 = gsub("chr", "", chromLength$V1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good practice to define a header during loading of the table. At least it would add a bit of information about the structure of the table.
} | ||
else{ | ||
$all++; | ||
@help = split ("\t", $line); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also rename this to @line
. I think, it also has advantages if two variables that differ only by the type, but not by the content (in principle) are called the same. Some for @head
above.
<cvalue name="CHROM_SIZES_FILE" value="${hg38BaseDirectory}/stats/GRCh38_decoy_ebv_alt_hla_phiX.fa.chrLenOnlyACGT_realChromosomes.tsv" type="path" /> | ||
<cvalue name='CHROMOSOME_LENGTH_FILE' value='${hg38BaseDirectory}/stats/GRCh38_decoy_ebv_alt_hla_phiX.fa.chrLength.tsv' type="path" /> | ||
|
||
<cvalue name="CHROMOSOME_INDICES" value="( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y HLA ALT )" type="bashArray" description="Chr indices for the calling"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add CHROMOSOME_INDICES_PLOTTING
with a description.
* 3.0.0 | ||
|
||
* Major | ||
* Support for hg38/GRCh38 reference genome and variant calling from ALT and HLA contigs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any new mandatory configuration values? At least list them. Details should be in the XML.
Was the semantics of conf. values changed?
* Major | ||
* Support for hg38/GRCh38 reference genome and variant calling from ALT and HLA contigs. | ||
* Minor | ||
* For hg38: Removing mappability and repeat elements' annotations from penalty calculations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any new optional configuration values? (at least list them)
* skipREMAP: Option to remove repeat elements and mappability from confidence annotations in hg19. | ||
* Removing EVS And ExAC AF from the annotations and no-control workflow filtering | ||
* Support for variant calling from CRAM files | ||
* Bug fix: Removing "quote" around the raw filter option `<RAW_SNV_FILTER_OPTIONS>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. But this is a technical/code description. What was the effect of the bug to the user? (probably that multiple RAW_SNV_FILTER_OPTIONS
could not be used).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you try limiting the (heap) memory consumption of this Python process. See here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is passing over the VCF once. Not sure how much memory it uses, but maybe it is worthwhile limiting the memory with this approach.
Merging hg38 developments into the main (master) branch.
Main updates
Integrated testing