Large data set crashing #4

jpalmer37 · 2019-07-10T22:53:03Z

Hi there,

I've been using Historian to reconstruct the ancestral sequences within multiple different sequence data sets. My largest data set, which contains roughly 1000 sequences, has been unable to complete a run with Historian. The error message is displayed below with a high verbosity setting, simply stating "Killed" when it crashes.

I was first wondering whether this is expected when handling a large data set like this one. Is Historian able to handle sequence sets of this size?

If this failure isn't expected, would you know of a way to retrieve more information about the problem? Or know of possible adjustments to try?

Thanks in advance,

John

ihh · 2019-07-11T16:36:48Z

Hi John,

Sorry to hear you've had this issue. If you're willing to share your data file, I'd like to try and replicate it.

A few points:

From the info you've given (including the 1000 sequences and the cryptic "Killed" message), this sounds like an out-of-memory error
If you can't share the input file, can you be more specific about its nature (are the sequences unaligned, or are you supplying a guide alignment? How long are they?)
Historian should be able to deal with large datasets, but you may need to limit its memory usage. The default options make some attempt to do this, but there is a slightly nontrivial interplay between the sequence length, number of sequences, diversity of sequences, and size of the ancestral sequence profiles, which may mean that memory usage needs to be fine-tuned
There are a few options to constrain historian's memory usage, specifically the amount of memory it allocates to profiles of ancestral sequences; for example the -profmaxmem option may be useful. historian -h will give you a list of all options (see under "Reconstruction algorithm options" in the help text)

jpalmer37 · 2019-07-11T17:56:00Z

Hi Dr. Holmes,

It's interesting to hear what you mentioned about the memory usage. I will definitely take a look at the -profmaxmem option in the manual.

My data is publicly available online, so I'd certainly be willing to share my data with you. I use both a guide alignment and guide tree as input when running Historian, so I'll send both of those files to you. Would you prefer to receive them over email?

Thanks for the quick response!

ihh · 2019-07-11T18:01:51Z

Hi John, great thanks! You can attach the file here, or send by email, whichever is easiest. It may take me a few days to get around to debugging but I will try to prioritize it.

One option if you are using a guide alignment is to constrain the reconstruction to be very close to that guide alignment. The -band option specifies the width of the band around the guide alignment that historian will use. By default it is 40, but if you set it to e.g. -band 5 then it should go much faster and use less memory (but will obviously be more dependent on the accuracy of the guide alignment).

jpalmer37 · 2019-07-11T18:28:10Z

Much appreciated, Dr. Holmes! This issue isn't very urgent or pressing, so please take the time you need. I included a second data set which is essentially identical and appeared to run into the same problem. Both contain roughly the same number of sequences (~1000), but the one labeled 101034 contains greater sequence diversity.

historian_data.zip

And thank you for the suggestion. I read about the -band option previously, but haven't experimented with it. Good to know that it might be useful for reducing memory consumption. Thanks again!

ihh · 2019-07-18T11:34:54Z

Running this in background on my laptop now. I do see some hefty memory usage. Could you supply the exact command-line usage that led to the crash, and also details of your machine (most importantly memory, but also OS, CPU etc)?

jpalmer37 · 2019-07-18T17:36:09Z

Certainly.

This is the original command I used to run Historian on both machines:

historian -vvv -guide ~/4MSA/111848.fasta -tree ~/7_MCC/rescaled/111848.tree -ancseq -output fasta > 111848_recon.fasta

This was the first machine that I detected the crash on (but could not see the error messages):


CPU	2 x (Intel Xeon E5-2690v4 2.6 GHz 14-core/28-thread)
OS	CentOS 7.3
RAM	56 GB

This is the machine where I performed a single test run to read the output of the crash:


CPU	AMD Ryzen Threadripper 1950X (16-core/32-thread) 4.0 GHz
OS	Ubuntu 18.04.2 LTS (Bionic Beaver)
RAM	32 GB

jpalmer37 mentioned this issue Jul 18, 2019

Problem with MPI script running Historian on cluster PoonLab/vindels#70

Closed

ihh added the bug label Apr 26, 2020

ihh mentioned this issue May 10, 2023

Update models to use GGI continuous-time approximators #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large data set crashing #4

Large data set crashing #4

jpalmer37 commented Jul 10, 2019

ihh commented Jul 11, 2019

jpalmer37 commented Jul 11, 2019

ihh commented Jul 11, 2019 •

edited

Loading

jpalmer37 commented Jul 11, 2019

ihh commented Jul 18, 2019

jpalmer37 commented Jul 18, 2019

Large data set crashing #4

Large data set crashing #4

Comments

jpalmer37 commented Jul 10, 2019

ihh commented Jul 11, 2019

jpalmer37 commented Jul 11, 2019

ihh commented Jul 11, 2019 • edited Loading

jpalmer37 commented Jul 11, 2019

ihh commented Jul 18, 2019

jpalmer37 commented Jul 18, 2019

ihh commented Jul 11, 2019 •

edited

Loading