Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime for Large Dataset (1.3M seqs) #14

Open
e-trop opened this issue Mar 9, 2023 · 1 comment
Open

Runtime for Large Dataset (1.3M seqs) #14

e-trop opened this issue Mar 9, 2023 · 1 comment

Comments

@e-trop
Copy link

e-trop commented Mar 9, 2023

Hi there,

Thanks for making this tool available and for the clean repo!

I am trying to run MeshClust on a set of 1.3 million sequences with lengths ranging from 75bp - 6000bp. From the paper I saw that you were able to run meshclust on a microbiome dataset which comprised ~1 million sequences in ~2hrs with the hardware specified in the paper.

I've run meshclust on my dataset with a calculated identity threshold and its been running for 12 hrs and has only processed 160k sequences and is on the first data pass. I see that there are still many ~50k seqs in the reservoir. I'm guessing the reason it is taking so long to run is that the resevoir is continually being filled and then the initialization step for mean shift is being rerun causing the long runtime.

I wanted to check to see if you had any ideas why its taking this long or ways I could maybe split the data for better runtime?

Kind regards,
Evan

@hani-girgis
Copy link
Member

Hi, Evan.

The length range of the microbiome sequences in the paper is 171–372, which is more homogenous than yours.

Yes, dividing your data set based on length would work. Then I would cluster each group separately.

After that you may want to extract the centers and use Identity (all-vs-all) on the centers and merge (select one) centers that are similar (with identity scores greater than the threshold).

Finally, run Identity on the reduced center set and the entire data set and assign a sequence to the closest center.

This is a work around for now. But this process can be automated in future releases.

Let me know if you have additional questions.

Best regards.

Hani

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants