Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identity on very large data #7

Open
davidmaimoun opened this issue Jul 10, 2022 · 10 comments
Open

Identity on very large data #7

davidmaimoun opened this issue Jul 10, 2022 · 10 comments

Comments

@davidmaimoun
Copy link

Hi,

Thanks for the tool!

Can I use it on a very large data, like 300K-400K genomes (reads & assemblies)

Best,

@hani-girgis
Copy link
Member

Hello.

How long are these genomes? And how many reads are there?

Best,

Hani Z. Girgis, PhD

@davidmaimoun
Copy link
Author

Approximately 5Mb, and I think to focus on assemblies first, i.e, if I have ~400k fastas, it would be possible to use Identity on it?

Thank yoy

@hani-girgis
Copy link
Member

I believe so. How much RAM do you have available? The -b and -v parameters will help you with controlling the memory consumption. I'd start with -b of 1000 and -v of 1000. Please keep me posted.

Best regards.

@davidmaimoun
Copy link
Author

Thank you very much for your help,

I'll let you know when I get the results

Kind Regards

@LauraVP1994
Copy link

I also have a quite large dataset and it has already been running a week. I was wondering if there are possibilities to speed this up (I have tried already to reduce the number of sequences as much as possible)? Also, being able to see how much it has done and still to do, would be great to know...

@hani-girgis
Copy link
Member

Hi.

Would you please provide some information about your dataset? How many sequences? What is the average length? What parameters are you using to run Identity?

Best regards.

@LauraVP1994
Copy link

LauraVP1994 commented Jun 9, 2023

I have multiple datasets on which I would like to use it as I'm using this tool to select sequences to shrink down my dataset. I have for example a dataset of 553 123 Campylobacter coli sequences for which the length ranges from 20340 to 1822675 bp.

I'm currently using this command:
srun --cpus-per-task 40 --mem=100G identity -d -Campylobacter.coli_concatenated_filter.fasta -o Campylobacter.coli_identity.txt -t 0.9

@hani-girgis
Copy link
Member

Hello there.

I would divide the sequences into groups based on length. Plotting the length distribution will help with finding the boundaries. Then I'd cluster each group separately. If you would like me to a take a look at the plot, feel free to email it to me at hzgirgis at buffalo dot edu. Please keep me posted.

Best regards.

@LauraVP1994
Copy link

And how big should these groups be then ideally (in numbers and difference in lengths)?

@hani-girgis
Copy link
Member

hani-girgis commented Jun 16, 2023

You should use a large number of short sequences and a small number of long sequences. A plot of the length distribution would help with finding the cut-offs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants