5% of BA.1.1 carrying the R346K mutations are classified as BA.1 #80

wodanaz · 2022-04-01T12:29:33Z

Hi folks,

I posted this issue in pangolin and realized that it was the wrong place. Sorry about that.

In our genomic surveillance data, we have been consistently finding that about 5% of samples with high quality sequencing seem to be misclassified as BA.1 even when using the most recent version of pangolin and pangoLEARN. In addition, almost all the misclassified samples BA.1s have a phylogenetic backbone that corresponds to BA.1.1 and they have great calling quality for the R346K mutation. Given the medical importance of that mutation (R346K) I hope this phenomenon can be reviewed .

Example of read depth and quality in one sample

and Phylogeny:

and an example of the tree placing of the BA.1 (These samples are confirmed to have the R346K mutation)

Thank you so much an keep up with the great work

Warmly,

Alejandro Berrio
Duke University

corneliusroemer · 2022-04-01T12:46:15Z

It's a pangoLEARN issue, not due to wrong designations (which this repo [pango-designation, where it was originally posted] is for).

So I'll transfer it to pangoLEARN.

It's known that pangoLEARN can be wrong in ways that don't really make sense to humans - maybe the decision tree is overfitted.

The standard recommendation is to use Usher mode, which you can enable by appending --usher to your CLI run, like pangolin --usher input.fasta. Usher should have much lower false classification.

You could also try out Nextclade's pango classifier, it should likewise not have problems getting these sequences classified as BA.1.1.X

Soon, pangolin v4 will be released which uses Usher mode by default, so you won't even have to append --usher anymore.

I hope this helps.

wodanaz · 2022-04-01T12:48:55Z

Fantastic, thank you!

aineniamh · 2022-04-01T13:20:55Z

I believe adding more representatives into the designations will resolve the pangoLEARN model error, that's how the pangoLEARN issues are usually resolved (the decision tree is definitely over-fit, and the more informative training data it gets the better it does), this is why I suggested porting the issue over to designation. pangolin 4.0 has a new model which is a random forest, which should be less overfit and give more interpretable confidence scores too. Hopefully the usher mode will resolve this all anyway as pangolin 4.0 has just been released.

FYI @corneliusroemer it's not an issue for the pangolin repo, as it's not pangolin software related just for future ref. I appreciate your great explanation above though!

@wodanaz if you still see issues with the assignments with the latest pangolin version, modes and models please let me know and I'll get to the bottom of it! If there's something particularly tricksy going on we might need to add a constellation definition in.

corneliusroemer · 2022-04-01T13:49:30Z

@aineniamh you're right, I reworded slightly. It was originally in pango-designation but it belongs only here in pangoLEARN, not pangolin.

corneliusroemer transferred this issue from cov-lineages/pango-designation Apr 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5% of BA.1.1 carrying the R346K mutations are classified as BA.1 #80

5% of BA.1.1 carrying the R346K mutations are classified as BA.1 #80

wodanaz commented Apr 1, 2022

corneliusroemer commented Apr 1, 2022 •

edited

Loading

wodanaz commented Apr 1, 2022

aineniamh commented Apr 1, 2022

corneliusroemer commented Apr 1, 2022

5% of BA.1.1 carrying the R346K mutations are classified as BA.1 #80

5% of BA.1.1 carrying the R346K mutations are classified as BA.1 #80

Comments

wodanaz commented Apr 1, 2022

corneliusroemer commented Apr 1, 2022 • edited Loading

wodanaz commented Apr 1, 2022

aineniamh commented Apr 1, 2022

corneliusroemer commented Apr 1, 2022

corneliusroemer commented Apr 1, 2022 •

edited

Loading