Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BamParser.Parse() returns null objects in rare instances #41

Open
RoboStephen opened this issue Dec 18, 2019 · 2 comments
Open

BamParser.Parse() returns null objects in rare instances #41

RoboStephen opened this issue Dec 18, 2019 · 2 comments

Comments

@RoboStephen
Copy link

Today I used dotnetbio to parse a bam file. In rare instances (for a total of 3 reads out of 1 million), BamParser.Parse() returned a null object rather than a SAMAlignedSequence object.

I stepped through in the debugger and I think these lines, in BamParser.cs, are the cause of my issue:
if (alignedSeq.RefEndPos - 1 < start && alignedSeq.RName!="*")
{
return null;
}

Here I'm reading the whole bam, so start == 0. This is a read which is unmapped, but it still has its Pos set to 1 and RName set because its mate was mapped with Pos==1. For the affected read, alignedSeq.RefEndPos is 0. By subtracting 1 from alignedSeq.RefEndPos we get -1, which is less than start == 0, so we return null.

This one-line change fixes the bug, and I believe it's correct in general - I confirmed the unit tests still pass:
if (alignedSeq.RefEndPos - 1 < start && alignedSeq.CIGAR != "*")
{
return null;
}

@evolvedmicrobe
Copy link
Member

Hi,

Thanks for commenting on this!

This strikes me as a bit tricky, I think the method you're looking at: GetAlignedSequence(int start, int end), is designed to return all queries that overlap a range, if the read is associated with a chromosome (e.g. RNAME!=*). For reads that are unmapped but associated with a chromosome for sorting purposes, I think there are circumstances both where one would, and would not, like to consider that read to overlap with the region queried. Do you happen to know what samtools view does for these reads if one specifies the start/end coordinates on the chromosome? I think when coordinates are specified the library should probably match that behavior.

When coordinates are not specified these reads should be returned, so that's a bug as you point out. Right now it looks like this case is indicated with a 0/MaxInt for start/end, perhaps with a different sentinel value to indicate all reads regardless of overlap should be returned (start = -1? ) we could get the correct behavior?. Right now if a read is unmapped (e.g. Bit 0x4) we shouldn't make any assumptions about RNAME/CIGAR/POS according the SAM spec, yet we'd want to return reads like this in all cases when we're parsing everything. Any idea if changing:

return GetAlignedSequence(0, int.MaxValue);

to
return GetAlignedSequence(-1, int.MaxValue); in some form is also compatible with the current API?

@RoboStephen
Copy link
Author

I ran a couple of samtools view tests. It looks like samtools treats the unmapped read as though it overlaps the first base of the chromosome and no others, when it comes to including or excluding it. That seems appropriate to me:

  • samtools view includes the read
  • samtools view $ContigName includes the read
  • samtools view $ContigName:1-$Length includes the read
  • samtools view $ContigName:1 includes the read
  • samtools view $ContigName:2-$Length excludes the read

I confirmed that if we change the start and end parameters to GetAlignedSequence to -1 and int.MaxValue (instead of 0 and int.MaxValue), that resolves this issue too. (There are several APIs which use those sentinel values; if we change one, we'd want to change them all for consistency)

Yet another option: If GetAlignedSequence accepted nullable ints, I think that would make the logic clearer than having special explicit sentinel values like 0 or -1 and int.MaxValue. GetAlignmentMapIterator already has one nullable-int argument (refSeq) so making start and end nullable seems natural.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants