Identity on very large data #7

davidmaimoun · 2022-07-10T06:09:35Z

Hi,

Thanks for the tool!

Can I use it on a very large data, like 300K-400K genomes (reads & assemblies)

Best,

hani-girgis · 2022-07-11T15:03:52Z

Hello.

How long are these genomes? And how many reads are there?

Best,

Hani Z. Girgis, PhD

davidmaimoun · 2022-07-13T04:10:33Z

Approximately 5Mb, and I think to focus on assemblies first, i.e, if I have ~400k fastas, it would be possible to use Identity on it?

Thank yoy

hani-girgis · 2022-07-13T16:56:43Z

I believe so. How much RAM do you have available? The -b and -v parameters will help you with controlling the memory consumption. I'd start with -b of 1000 and -v of 1000. Please keep me posted.

Best regards.

davidmaimoun · 2022-07-14T04:51:09Z

Thank you very much for your help,

I'll let you know when I get the results

Kind Regards

LauraVP1994 · 2023-06-08T08:03:23Z

I also have a quite large dataset and it has already been running a week. I was wondering if there are possibilities to speed this up (I have tried already to reduce the number of sequences as much as possible)? Also, being able to see how much it has done and still to do, would be great to know...

hani-girgis · 2023-06-08T14:22:47Z

Hi.

Would you please provide some information about your dataset? How many sequences? What is the average length? What parameters are you using to run Identity?

Best regards.

LauraVP1994 · 2023-06-09T08:12:00Z

I have multiple datasets on which I would like to use it as I'm using this tool to select sequences to shrink down my dataset. I have for example a dataset of 553 123 Campylobacter coli sequences for which the length ranges from 20340 to 1822675 bp.

I'm currently using this command:
srun --cpus-per-task 40 --mem=100G identity -d -Campylobacter.coli_concatenated_filter.fasta -o Campylobacter.coli_identity.txt -t 0.9

hani-girgis · 2023-06-09T19:23:53Z

Hello there.

I would divide the sequences into groups based on length. Plotting the length distribution will help with finding the boundaries. Then I'd cluster each group separately. If you would like me to a take a look at the plot, feel free to email it to me at hzgirgis at buffalo dot edu. Please keep me posted.

Best regards.

LauraVP1994 · 2023-06-13T07:45:37Z

And how big should these groups be then ideally (in numbers and difference in lengths)?

hani-girgis · 2023-06-16T17:01:05Z

You should use a large number of short sequences and a small number of long sequences. A plot of the length distribution would help with finding the cut-offs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identity on very large data #7

Identity on very large data #7

davidmaimoun commented Jul 10, 2022

hani-girgis commented Jul 11, 2022

davidmaimoun commented Jul 13, 2022

hani-girgis commented Jul 13, 2022

davidmaimoun commented Jul 14, 2022

LauraVP1994 commented Jun 8, 2023

hani-girgis commented Jun 8, 2023

LauraVP1994 commented Jun 9, 2023 •

edited

Loading

hani-girgis commented Jun 9, 2023

LauraVP1994 commented Jun 13, 2023

hani-girgis commented Jun 16, 2023 •

edited

Loading

Identity on very large data #7

Identity on very large data #7

Comments

davidmaimoun commented Jul 10, 2022

hani-girgis commented Jul 11, 2022

davidmaimoun commented Jul 13, 2022

hani-girgis commented Jul 13, 2022

davidmaimoun commented Jul 14, 2022

LauraVP1994 commented Jun 8, 2023

hani-girgis commented Jun 8, 2023

LauraVP1994 commented Jun 9, 2023 • edited Loading

hani-girgis commented Jun 9, 2023

LauraVP1994 commented Jun 13, 2023

hani-girgis commented Jun 16, 2023 • edited Loading

LauraVP1994 commented Jun 9, 2023 •

edited

Loading

hani-girgis commented Jun 16, 2023 •

edited

Loading