Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5% of BA.1.1 carrying the R346K mutations are classified as BA.1 #80

Open
wodanaz opened this issue Apr 1, 2022 · 4 comments
Open

5% of BA.1.1 carrying the R346K mutations are classified as BA.1 #80

wodanaz opened this issue Apr 1, 2022 · 4 comments

Comments

@wodanaz
Copy link

wodanaz commented Apr 1, 2022

Hi folks,

I posted this issue in pangolin and realized that it was the wrong place. Sorry about that.

In our genomic surveillance data, we have been consistently finding that about 5% of samples with high quality sequencing seem to be misclassified as BA.1 even when using the most recent version of pangolin and pangoLEARN. In addition, almost all the misclassified samples BA.1s have a phylogenetic backbone that corresponds to BA.1.1 and they have great calling quality for the R346K mutation. Given the medical importance of that mutation (R346K) I hope this phenomenon can be reviewed .

Example of read depth and quality in one sample

image

and Phylogeny:

and an example of the tree placing of the BA.1 (These samples are confirmed to have the R346K mutation)

image

Thank you so much an keep up with the great work

Warmly,

Alejandro Berrio
Duke University

@corneliusroemer
Copy link

corneliusroemer commented Apr 1, 2022

It's a pangoLEARN issue, not due to wrong designations (which this repo [pango-designation, where it was originally posted] is for).

So I'll transfer it to pangoLEARN.

It's known that pangoLEARN can be wrong in ways that don't really make sense to humans - maybe the decision tree is overfitted.

The standard recommendation is to use Usher mode, which you can enable by appending --usher to your CLI run, like pangolin --usher input.fasta. Usher should have much lower false classification.

You could also try out Nextclade's pango classifier, it should likewise not have problems getting these sequences classified as BA.1.1.X

Soon, pangolin v4 will be released which uses Usher mode by default, so you won't even have to append --usher anymore.

I hope this helps.

@corneliusroemer corneliusroemer transferred this issue from cov-lineages/pango-designation Apr 1, 2022
@wodanaz
Copy link
Author

wodanaz commented Apr 1, 2022

Fantastic, thank you!

@aineniamh
Copy link
Member

I believe adding more representatives into the designations will resolve the pangoLEARN model error, that's how the pangoLEARN issues are usually resolved (the decision tree is definitely over-fit, and the more informative training data it gets the better it does), this is why I suggested porting the issue over to designation. pangolin 4.0 has a new model which is a random forest, which should be less overfit and give more interpretable confidence scores too. Hopefully the usher mode will resolve this all anyway as pangolin 4.0 has just been released.

FYI @corneliusroemer it's not an issue for the pangolin repo, as it's not pangolin software related just for future ref. I appreciate your great explanation above though!

@wodanaz if you still see issues with the assignments with the latest pangolin version, modes and models please let me know and I'll get to the bottom of it! If there's something particularly tricksy going on we might need to add a constellation definition in.

@corneliusroemer
Copy link

@aineniamh you're right, I reworded slightly. It was originally in pango-designation but it belongs only here in pangoLEARN, not pangolin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants