Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process hangs on certain inputs #15

Open
gfetterman opened this issue Apr 9, 2019 · 7 comments
Open

Process hangs on certain inputs #15

gfetterman opened this issue Apr 9, 2019 · 7 comments

Comments

@gfetterman
Copy link

When we run ms4alg.sort with certain combinations of (a) file length, (b) sorting parameters, and (c) number of threads, the process hangs.

When this happens, it reliably happens after the "Reassigning events" phase - i.e., the last status update line in the terminal reads Reassigning events for channel [channel_number] (phase 1). Once this occurs, the process never advances (we've let it sit there for >24 hours with no change; usual run time on "good" inputs is ~4-5hr). The number of workers visible in top remains constant, as does memory usage. However, the workers are not using any CPU time.

We can conclude three things:

  • file length is a factor: ~1hr-long files complete just fine. Longer files (3-4 hours) hang.
  • parameter choice is a factor: {adjacency_radius: 100, detect_threshold: 2, detect_interval: 5} will complete without a problem. {adjacency_radius: 150, detect_threshold: 1, detect_interval: 5} will hang.
  • number of threads is a factor: with num_workers:1, the process completes. With num_workers:12, it hangs. (We're running this on a 12-core machine.)

(NB: we've also exchanged a couple of emails on this issue with @tjd2002 .)

@tjd2002
Copy link
Contributor

tjd2002 commented Apr 9, 2019 via email

@gfetterman
Copy link
Author

We haven't titrated the size, but the cutoff sits somewhere between 1 hour and 3 hours.

Above that, it's mostly reliable, but there has been the odd time (invariably using what we've taken to thinking of as the "easy" parameter set described above) when a longer file hasn't hung.

@alexmorley
Copy link
Collaborator

Have you checked what the memory usage is during this process (all of those changes would increase it) both in terms of available RAM and in terms of temporary disk space?

@gfetterman
Copy link
Author

Both before and during the hang, each thread is using between 100MB and 2GB of memory. The machine has 64GB of memory, and in total it never rises above about 50% in use. This doesn't appear to vary significantly between the two file sizes.

MountainSort does appear to be using a significant volume of temporary disk space - on the order of 4-5x the size of the file being sorted.

@magland
Copy link
Owner

magland commented Sep 14, 2019

@gfetterman was this ever resolved for you?

@mafrasiabi
Copy link

Hi, recently I faced the doing nothing problem even with the toy data. I'm using ms4 with spikeinterface and have install it with pip (ml-ms4alg==0.3.2). When I stop the process via keyboard I can see that It's been stuck in the pool even though I set the num_workers to 1.

@teristam
Copy link

teristam commented May 21, 2020

Hi, recently I faced the doing nothing problem even with the toy data. I'm using ms4 with spikeinterface and have install it with pip (ml-ms4alg==0.3.2). When I stop the process via keyboard I can see that It's been stuck in the pool even though I set the num_workers to 1.

Just want to come to say that I encountered the same problem. I have an old conda env that's working properly, but then I tried to create a new one then it is already stuck at the PCA step. I enclose two environment files for reference:

Working env:
working.txt

Not working env:
notworking.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants