-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task.await deadlock (task finishes but await never returns) #47
Comments
dune fragment:
|
I tried adding some debugging into domainslib (there is probably a better way but just used printfs here): https://github.com/ocaml-multicore/domainslib/compare/master...edwintorok:debug?expand=1 And I got this snippet:
So 785 was waiting on 784, then 784 got its result set with Atomic.set, but later on Atomic.get for 784 stil returns none, which is weird. I could understand a little bit of delay, but looping over and over again it still says None (and it has even taken a mutex and made a sycall meanwhile), so I don't think its a cache coherency issue or a missing fence as I thought initially. |
Simplified form reported on ocaml-multicore tracker, seems to be an OCaml compiler/runtime issue, not just a domainslib issue. |
Fixed by #51. |
Is it valid to nest Task.await inside a Task.async, and to nest Task.async inside Task.async? (I would assume so, otherwise how would you implement a monadic bind using async+await)
I attempted to write a version of
parallel_for
that works on a stream, and not on an array of known size.It looks like this (with some debugging sprinkled in), but gets stuck in Task.await with an infinite loop on multiple cores:
Looking at a sample failure on ocaml 4.12.0+domains it looks like the async task has "finished" (or at least it acquired the mutex, printed the message that it near the end, and unlocked it), but then there is an await happening later (we know it is later due to the mutex) and that never returns and just spins endlessly.
It appears that Task.ml thinks there should be more tasks to run (because Atomic.get returned None), but when it goes looking there are none.
Now the previous task that has finished should've set the atomic value with Atomic.set (either on success or failure), and the GC didn't appear to have run inbetween, so I'm puzzled to why Atomic.get finds None (unless the implementation of Atomic.get is incorrect? looking at the disassembly I don't see any lock prefixes or fence instructions).
Here is an example failure (and it fails in both native and bytecode modes):
See that 86458 is nearly at the end of the async task (there is not supposed to be anything blocking happening after this),
and then 86459 tries to wait for it, but gets stuck.
I'm on Fedora 34, and the libc here is somewhat buggy when it comes to multicore (pthread_cond_wait/pthread_cond_signal in particular is known to be buggy (as is anything with glibc>=2.27): https://bugzilla.redhat.com/show_bug.cgi?id=1889892, but the above stacktrace doesn't show pthread_cond_wait)
Note that smaller values for the Array work fine, but in the thousands or millions it does fail as above almost always (not always at the same task id)
Platform:
The text was updated successfully, but these errors were encountered: