Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering long-read 18S amplicons #21

Open
pjmramond opened this issue Jan 18, 2024 · 3 comments
Open

Clustering long-read 18S amplicons #21

pjmramond opened this issue Jan 18, 2024 · 3 comments

Comments

@pjmramond
Copy link

Hello there
Thanks a lot already for the work on this package!

I am trying to cluster 34,937,058 sequences of about 1000bp (18S amplicons) contained in a single fasta file, I'm using the following code on HPC:

meshclust \
  -d /export/lv6/projects/NIOZ320/Analysis/3.1_Ecological_Analysis/18S_NIOZ320_NIOZ326.fa \
  -o /export/lv6/projects/NIOZ320/Analysis/3.1_Ecological_Analysis/consensus_95/18S_NIOZ320_NIOZ326_cl_0.95.txt \
  -t 0.95 \
  -b 45000 \
  -v 180000

The code has been running for 125 days and was about to finish its 4th run, which I thought would be the last, but a 5th clustering run of the data has started (see screenshot). This last run indicate from the beginning that there are "0 unprocessed sequences" and the number of found centers has been stagnating around 47,900 for quite sometime.

I understand that this is a lot of data and that the error rate of Oxford Nanopore reads probably adds complexity to the clustering algorithm. The amplicons have nevertheless been quality filtered and represent consensuses of several amplicons (pre-clustered based Unique Molecular Identifiers). A previous Meshclust run with a similar approach but 16S data took ~80 days to cluster 33,306,880 amplicons and found 55,715 centers.

My questions are:

  1. Am I doing something wrong here? Can Meshclust support such a computation? ("swarm -d 3" ran faster but clustered only 500K reads).

  2. Is there a way to stop the run at this stage and get the current output (centers and their composition)? Is there a way to predict how many runs will it take Meshclust to give an output?

Any help would be highly appreciated!
Best
Pierre

Capture d’écran 2024-01-18 à 16 12 14

Capture d’écran 2024-01-18 à 16 30 44

Capture d’écran 2024-01-18 à 16 30 59

@pjmramond
Copy link
Author

and now starting run 6...

@hani-girgis
Copy link
Member

Hi, Pierre.

Thanks for your interest in MeShClust v3.0.

No, you are not doing any thing wrong. MeShClust v3 should take longer than MeShClust v1 because of the all-vs-all done at the beginning and when there are enough sequences are accumulated in the reservoir.

The current version does NOT log the results (this is a good feature to include in the next release God willing).

The -p parameter controls the number of data passes (default: 10). The algorithm may converge before the 10th iteration if the number of clusters does not change during a data pass. The good news is that the algorithm should run faster in late iterations than the early ones because it may not need to do as many as of the all-vs-all blocks.

If you would like to speed up the algorithm in the future, you may want to reduce the size of the all-vs-all block (-b) and increase the size of the batch (-v).

Please keep me posted.

Hani Z. Girgis, PhD

@hani-girgis
Copy link
Member

Hello again, Pierre.

I am working on the next version of MeShClust. It would be very helpful to use your data while developing and testing it. Are the 16S or the 18S data already published? If yes, where can I download the data set(s). Feel free to email me at hzgirgis at buffalo dot edu.

Best regards.

Hani Z. Girgis, PhD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants