Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy in gene ID mapping on same positions in different samples #195

Open
Sidduppal opened this issue Oct 16, 2024 · 8 comments
Open
Labels
bug Something isn't working

Comments

@Sidduppal
Copy link

Hi,

First of all, thank you for developing this amazing tool!

I have encountered a discrepancy in the inStrain profile results regarding gene mapping at the same positions between different samples. Specifically:

  • In sample SLG529, no gene is mapped at position 331 on scaffold SLG1007_DASTool_bins_61.fa_k141_12596.
  • In sample SLG1132, the gene SLG1007_DASTool_bins_61.fa_k141_12596_2 is mapped at the same position (331) on the same scaffold (SLG1007_DASTool_bins_61.fa_k141_12596).

Additionally, in sample SLG529, the same gene (SLG1007_DASTool_bins_61.fa_k141_12596_2) is mapped at positions immediately before and after position 331 (positions 301 and 332), but not at position 331.

I ran inStrain profile using the same representative MAGs but with reads from different samples.

This is just one example, but these discrepancies were seen at many other sites as well. I'm unsure whether I'm missing something in my analysis or if this is a potential bug. Any clarification would be greatly appreciated.

  • inStrain version 1.8.0
  • Command used for sample SLG529 : inStrain profile SLG529.bam representative_megamag.fa -o SLG529.IS -p 64 -g megamag_representative.fna -s scaffoldtobin.stb --database_mode
  • Command used for sample SLG1132 : inStrain profile SLG1132.bam representative_megamag.fa -o SLG1132.IS -p 64 -g megamag_representative.fna -s scaffoldtobin.stb --database_mode

I've attached the relevant dataset for your reference.

Thank you!

SLG529.txt
SLG1132.txt

@MrOlm
Copy link
Owner

MrOlm commented Nov 15, 2024

Hello

Thanks for your question and for using inStrain! The discrepancy you’re seeing is likely due to differences in coverage or read alignment at position 331 between the samples. Low coverage in SLG529 might prevent the gene from being mapped at that position, even if it's mapped nearby.

Here’s what you can try:

  1. Check the coverage at position 331 in SLG529 using the .cov file.
  2. Confirm that the same -g and -s files were used for both samples.
  3. Inspect the alignments at position 331 in a genome viewer (e.g., IGV).

If the issue persists, feel free to share more details!

Best,
Matt

@MrOlm MrOlm closed this as completed Nov 15, 2024
@Sidduppal
Copy link
Author

Hey @MrOlm, thanks a lot for taking a look.

Unless I'm missing something the .cov file seems to give the coverage per gene and not for each position within a gene. I'm looking at genes_coverage.csv inside raw_datasubdirectory generated from the profile results. For the gene under consideration the coverage is 16.4.

Furthermore, it appears that the position_coverage at position 331 is equivalent if not higher to surrounding positions where the gene has been mapped correctly.

The same -g and -s files were used for both samples as well.

I also want to reiterate that this uneven mapping of genes was present at many other positions. I shared a subset of data but happy to share the full profile results over email, if you think that'll be helpful.

Cheers!

@MrOlm
Copy link
Owner

MrOlm commented Nov 19, 2024

Ah- I see the problem now. In the case of SLG529, the variant at position 331 has 3 alleles detected (allele_count > 2). In this case inStrain does not attempt to classify the mutation (S, N, etc.) because it doesn't know what allele to use, resulting in it not attempting to link the gene.

This isn't great behavior and I'll try and fix in the next update, but this is what's going on.

Best,
Matt

@MrOlm MrOlm reopened this Nov 19, 2024
@MrOlm MrOlm added the bug Something isn't working label Nov 19, 2024
@Sidduppal
Copy link
Author

Sidduppal commented Nov 19, 2024

Hey @MrOlm, I had a similar intuition however, after inspecting the data a bit more I found sites with a gene mapped and allele count more than 2. For example, positions 144 and 650 in the SLG1132.txt file attach above has allele count of 3 but also has genes mapped. Also, these positions seem be identify mutation_type as well, M and S respectively.

P.S. Issue was closed by mistake. I have re-oped it.

@Sidduppal Sidduppal reopened this Nov 19, 2024
@MrOlm
Copy link
Owner

MrOlm commented Nov 19, 2024

Mutation_type = M means that there are multiple genes at that position (https://instrain.readthedocs.io/en/latest/example_output.html)

In that sample, 650 has a gene mapped and 144 is M?

@Sidduppal
Copy link
Author

Both positions 144 and 650 have genes mapped. The mutation_type is M at position 144 and S at 650.
mutation_type is M at other positions where allele count is 2.

@MrOlm
Copy link
Owner

MrOlm commented Nov 20, 2024

InStrain thinks 144 has two genes mapped at that location; could that be the case?

650 seems to be working correctly. What is the problem with 650 in that sample?

@Sidduppal
Copy link
Author

So there is no problem per se with both of those sites. You mentioned above

In the case of SLG529, the variant at position 331 has 3 alleles detected (allele_count > 2). In this case inStrain does not attempt to classify the mutation (S, N, etc.) because it doesn't know what allele to use, resulting in it not attempting to link the gene.

I just pointed out that this is not always the case as there are sites where genes have been mapped and the allele count is greater than 2, eg site 144 and 650. Plus despite having more than 2 alleles inStrain has classified the mutation at these sites (under column mutation_type).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants