-
Notifications
You must be signed in to change notification settings - Fork 734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenJDK java/lang/Thread/virtual/stress/TimedGet Invalid JIT return address #19249
Comments
20x x 5 grinder on single test https://openj9-jenkins.osuosl.org/job/Grinder/3426/ - passed 1x x 8 grinder on jdk_lang_j9_0 https://openj9-jenkins.osuosl.org/job/Grinder/3427/ - passed |
@BradleyWood, may I ask you to take a look at this failure? It's targeted for the 0.44 release, so it's high priority. |
I added it while we investigate since the failure occurred in a release build. If the frequency of the failure is low, we can move it out. |
@pshipton Has there been any failures on platforms other than PPC AIX? |
Not recently. There is also the closed issue #17163 |
@tajila fyi |
Perhaps related to #18910 |
No success yet on reproducing the problem locally |
https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/179 - p8-java1-ibm02
|
Trying a 5x x 5 jdk_lang_j9_0 internal grinder - passed, all machines are 7.2 p8 or p9 |
https://openj9-jenkins.osuosl.org/job/Test_openjdk22_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/44 |
Matching symptoms have been reported by a customer (on x86) |
Added the userRaised label due to #19249 (comment) |
@BradleyWood Any new updates to this one? |
This issue looks specific to AIX PPC. @JamesKingdon If you have any details that might indicate the customer issue is caused by the same problem as this issue, please let me know. Otherwise, I would like to ask @zl-wang to assign this to someone on the power team. |
Hi Brad, I'm going to have to start putting internal case numbers on comments like the one above. I'm currently not able to locate the case that prompted that comment. |
@zl-wang Hi Julian - could you assign someone to work on this one? (0.48 target) |
@rmnattas could you take this up? |
@tajila @dmitripivkine see the virtual-thread's stack back trace above: there is no interpreter frame at all ... every frame is with JIT-ed code. @rmnattas i am wondering if there were code changes in JIT metadata look-up (jitGetMapsFromPC and jitGetExceptionTable etc) recently ... causing jitInfo NULL. |
As before, using the continuation J9VMThread (generated on the stack for the purposes of stack-walking), we find the following stack-trace
The method
The first compilation entry-point was patched to the pre-prologue but the parked virtual-thread still have it live in its stack.
When the parked virtual thread stack is walked it crashes with Given that the first body metadata still exists (one indication is it being found by KCA walking the AVLTree), I don't see why it does crash. Looking at the core in a different way, To find the
The caller
Finding the VirtualThread
Showing the same stack-trace found before for the unmounted thread
And no carrier thread:
@babsingh, any suggestions given similarity to #18910 , possibly duplicate? |
Build got auto-removed but uploaded it here: https://ibm.box.com/s/jx79twimu30p1gm181ku1mymzxz0ozwm |
@rmnattas It's not a duplicate of #18910. Unlike #18910, the virtual thread is completed unmounted and not in an intermediate unsteady state in the above core file.
Here is some existing code which walks the stack of an unmounted virtual thread: openj9/runtime/jvmti/jvmtiHelpers.cpp Lines 2101 to 2109 in 96ba473
|
we indeed can see
now, with comment in #19249 (comment), it points to potential problems in the faked J9VMThread in order to walk the VThread. who set up that faked J9VMThread? |
@TobiAjila one thing we noticed early. the faked J9VMThread is not 256 aligned as below:
however, all low-order 8 bits have other meanings for locking (at least). has anybody made sure there are no interferences in stack walk from these bits set (effectively)? |
@babsingh @TobiAjila from dissecting the core file and observing the failure, the difference is as below:
in case, this difference dawns some ideas on you ... |
++ @fengxue-IS for insights. |
in @zl-wang @rmnattas is there any dependency on |
@fengxue-IS i don't think so. from walker perspective, there is no difference between recompiled and non-recompiled method. in the background, once a method is recompiled, the space occupied by the previous compilation is queued to the FaintBlockList (i.e. candidate to be freed as early as next GC, when stack-walking doesn't find it active on any thread. when it is freed, the associated metaData range is adjusted as well). in this case, obviously the FaintBlock hasn't been freed yet since it is still active on that VThread at least and KCA can still find its metaData. |
We have a new case TS017553893 with an assert from swalk.c:1629, is there anything I should look for in the corefiles to increase confidence that it's the same issue as being explored here? |
@JamesKingdon curious on which platform this new case is ...
|
The problem is happening on amd64, unfortunately I think the corefile is probably truncated - I'm having a lot of trouble making sense of it. The stack was
I can't identify the jit return address, but it looks like code. One thing that is catching my eye is that there are signs of native crypto code on the failing stack. |
@babsingh as mentioned All non-zero values:
Also, not sure if it was suppose to be for another field, but was told that it indicates type of what's being executed?
|
Suggesting moving to 0.49. |
The asserts from swalk.c in TS017553893 were caused by the new memory disclaiming feature. The backing files were being written to an nfs mounted directory and the majority of the jit data cache was reading as 0. Using |
is that a defined NFS behaviour? |
Not as far as I could find, it looks like mmap is supported on nfs with some caveats about distributed sharing. I'm wondering if there's some unexpected behaviour around us immediately unlinking the backing files. It took me a little longer to recognize the problem because the files didn't have our omrvmem* names, but had the special .nfs* names for pending deletes. |
@rmnattas, I am assuming this should move out to the 0.51 release. |
openjdk21_j9_sanity.openjdk_ppc64le_linux(
|
https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Release_testList_0/16/ - p8-java1-ibm07
jdk_lang_j9_0
java/lang/Thread/virtual/stress/TimedGet.java
https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Release_testList_0/16/openjdk_test_output.tar.gz
The text was updated successfully, but these errors were encountered: