-
Notifications
You must be signed in to change notification settings - Fork 730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skynet failures on larger machines #16652
Comments
@fengxue-IS Can you see if you can increase reproducability no smaller machines by setting -Djdk.virtualThreadScheduler.parallelism/maxPoolSize/minRunnable to a high valuees |
I was able to reproduce this locally on a 8 core xlinux machine using
Corefile not correctly generated due to machine issue, will generate again to look into the corefile data more closely |
Looking forward for more stack traces to see if this is specific to Concurrent Global GC... But, If this is scanning a carrier thread with a mounted virtual thread, one of the first generic questions from GC perspective is if JIT ensured that both carrier and virtual threads are at a safe point to walk their stacks (to call J9AllocateObject rather than J9AllocateObjectNoGC)? |
I have figured out what caused this issue. For temporary workaround is disabling concurrent marking via -Xgc:noConcurrentMarkKO. We use "one way synchronization" between concurrent scanning and mounting instead of full mutex, the concurrent marking continuation scan block the continuation mounting(for the same continuation Object), mounting thread during waiting would release the vm access, there could be second race condition for vm access between scavenge(STW) and mounting thread requiring back vm access after concurrent mark continuation scanning is done, if scavenger win exclusive vm access, mounting thread would still be blocked(haven't swapped java stack), but the j9vmcontinuation.state has been marked to "mounted", it would cause GC(scavenger) skip the continuation scan, then related references on the java stack would not be updated/copyforwarded. for concurrent scavenger enabled case, there would be more chances to expose the problem. I will think about the solution for both default gencon case and concurrent scavenger enabled case. please assign the issue to me. |
@fengxue-IS Can you confirm via a grinder if this is just a perf issue now? It will be a perf issue if the failure goes away by applying the following workaround: Note: Perf issues should not block the JDK19 release since we have decided to address perf issues closer to the JEP's GA milestone. |
Grinder seeing an assertion failure in GC with option
@LinHu2016 can you take a look at the output |
In general this assertion means Global Mark discovered object pointer (in the range of the heap) which is not pointed to the object (stale?) any more. These objects are taken from Work Packet, so it is relatively easy to figure out which one. Search of bad pointer over core might help to understand what has been scanned when it was discovered and put to WP. If you need to investigate such details and have systems core please let me know |
@fengxue-IS noticed the failure run with -Xgc:noConcurrentMarkKO, it doesn't looks like relate with "pending to be mounted" case, which caused issue was reported originally, I will try to reproduce/debug the issue locally. |
Keep this open until we have a fix in the 0.37 stream. |
The Skynet test fails readily (maybe 1 in 5 runs) in a variety of different ways when run on larger machines. For example, an Intel Cascade Lake with 112 hardware threads.
When a failure occurs, it occurs almost immediately.
As the hardware thread count is lowered (by changing machines or even via
numactl
) the failures do not appear or are much more intermittent.Example failures:
1). GC crashes/asserts
The text was updated successfully, but these errors were encountered: