Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenJDK java/lang/Thread/virtual/stress/TimedGet Invalid JIT return address #19249

Open
pshipton opened this issue Apr 1, 2024 · 44 comments
Open

Comments

@pshipton
Copy link
Member

pshipton commented Apr 1, 2024

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Release_testList_0/16/ - p8-java1-ibm07
jdk_lang_j9_0
java/lang/Thread/virtual/stress/TimedGet.java

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Release_testList_0/16/openjdk_test_output.tar.gz

18:33:51  STDOUT:
18:33:51  STDERR:
18:33:51  
18:33:51  
18:33:51  *** Invalid JIT return address 00000100114ABE44 in 0000010023C8BA70
18:33:51  
18:33:51  22:19:06.319 0x10023c92e00    j9vm.249    *   ** ASSERTION FAILED ** at /home/jenkins/workspace/Build_JDK21_ppc64_aix_Release/openj9/runtime/vm/swalk.c:1629: ((0 ))
@pshipton
Copy link
Member Author

pshipton commented Apr 1, 2024

20x x 5 grinder on single test https://openj9-jenkins.osuosl.org/job/Grinder/3426/ - passed

1x x 8 grinder on jdk_lang_j9_0 https://openj9-jenkins.osuosl.org/job/Grinder/3427/ - passed
1x x 8 grinder on jdk_lang_j9_0 https://openj9-jenkins.osuosl.org/job/Grinder/3432/ - 1 failure in java/lang/Thread/virtual/stress/Skynet.java#default

@hzongaro
Copy link
Member

hzongaro commented Apr 1, 2024

@BradleyWood, may I ask you to take a look at this failure? It's targeted for the 0.44 release, so it's high priority.

@pshipton
Copy link
Member Author

pshipton commented Apr 1, 2024

It's targeted for the 0.44 release, so it's high priority.

I added it while we investigate since the failure occurred in a release build. If the frequency of the failure is low, we can move it out.

@BradleyWood
Copy link
Member

@pshipton Has there been any failures on platforms other than PPC AIX?

@pshipton
Copy link
Member Author

pshipton commented Apr 1, 2024

Not recently. There is also the closed issue #17163

@pshipton
Copy link
Member Author

pshipton commented Apr 1, 2024

@tajila fyi

@pshipton
Copy link
Member Author

pshipton commented Apr 1, 2024

Perhaps related to #18910

@BradleyWood
Copy link
Member

No success yet on reproducing the problem locally

@pshipton
Copy link
Member Author

pshipton commented Apr 11, 2024

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/179 - p8-java1-ibm02
jdk_lang_j9_0
java/lang/Thread/virtual/stress/Skynet.java#default

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk21_j9_sanity.openjdk_ppc64_aix_Nightly_testList_0/179/openjdk_test_output.tar.gz

20:37:28  *** Invalid JIT return address 00000100114AE880 in 00000100236FBA70
20:37:28  
20:37:28  00:22:45.178 0x10023703100    j9vm.249    *   ** ASSERTION FAILED ** at /home/jenkins/workspace/Build_JDK21_ppc64_aix_Nightly/openj9/runtime/vm/swalk.c:1629: ((0 ))

@pshipton
Copy link
Member Author

pshipton commented Apr 11, 2024

Trying a 5x x 5 jdk_lang_j9_0 internal grinder - passed, all machines are 7.2 p8 or p9
4x x 5 on 7.1 machines grinder - no 7.1 machines available for test

@JamesKingdon
Copy link
Contributor

JamesKingdon commented Jul 10, 2024

Matching symptoms have been reported by a customer (on x86)

@pshipton
Copy link
Member Author

Added the userRaised label due to #19249 (comment)

@vij-singh
Copy link

@BradleyWood Any new updates to this one?

@BradleyWood
Copy link
Member

This issue looks specific to AIX PPC. @JamesKingdon If you have any details that might indicate the customer issue is caused by the same problem as this issue, please let me know. Otherwise, I would like to ask @zl-wang to assign this to someone on the power team.

@JamesKingdon
Copy link
Contributor

Hi Brad, I'm going to have to start putting internal case numbers on comments like the one above. I'm currently not able to locate the case that prompted that comment.

@vij-singh
Copy link

@zl-wang Hi Julian - could you assign someone to work on this one? (0.48 target)

@zl-wang
Copy link
Contributor

zl-wang commented Sep 26, 2024

@rmnattas could you take this up?

@zl-wang
Copy link
Contributor

zl-wang commented Oct 2, 2024

@tajila @dmitripivkine see the virtual-thread's stack back trace above: there is no interpreter frame at all ... every frame is with JIT-ed code.

@rmnattas i am wondering if there were code changes in JIT metadata look-up (jitGetMapsFromPC and jitGetExceptionTable etc) recently ... causing jitInfo NULL.

@0xdaryl 0xdaryl assigned rmnattas and unassigned BradleyWood Oct 8, 2024
@rmnattas
Copy link
Contributor

rmnattas commented Oct 9, 2024

As before, using the continuation J9VMThread (generated on the stack for the purposes of stack-walking), we find the following stack-trace

0x0000010011b3aadc {java/lang/VirtualThread.park} JIT [0x100315fbeb0]
0x000001001192c7a4 {java/util/concurrent/LinkedTransferQueue$DualNode.await} JIT [0x100315fbfb0]
0x0000010011b2f760 {java/util/concurrent/SynchronousQueue$Transferer.xferLifo} JIT [0x100315fc0e0]
0x0000010011b2f068 {java/util/concurrent/SynchronousQueue.xfer} JIT [0x100315fc1a0]
0x0000010011b34790 {java/util/concurrent/SynchronousQueue.take} JIT [0x100315fc1f0]
0x000001001192b2fc {Skynet$Channel.receive} JIT [0x100315fc280]
0x00000100117458fc {Skynet.skynet} JIT [0x100315fc2b0]
0x0000010011b341ec {Skynet.lambda$skynet$1} JIT [0x100315fc320]      <-- walkState->pc / Invalid JIT return address
0x0000010011b34044 {Skynet$$Lambda/0x00000000243074b8.run} JIT [0x100315fc370]
0x0000010011742834 {java/lang/Thread.runWith} JIT [0x100315fc3a0]
0x0000010011931b9c {java/lang/VirtualThread.run} JIT [0x100315fc450]
0x0000010011b337e0 {jdk/internal/vm/Continuation.enter} JIT [0x100315fc520]
0x090000001ac61334 {libj9jit29.so}{} [0x100315fbeb0]

The method {Skynet.lambda$skynet$1} is recompiled and exist twice

Searching for JIT'ed methods for the J9Method 0x0000010024325848 {Skynet.lambda$skynet$1}
    J9Class            J9Method               Start          Len {ClassPath/Name.MethodName}
--------------------------------------------------------------------------------------------
(0x0000010024325600 0x0000010024325848) 0x0000010011b34104    87 {Skynet.lambda$skynet$1(LSkynet$Channel;III)V}
(0x0000010024325600 0x0000010024325848) 0x0000010011b47f04   538 {Skynet.lambda$skynet$1(LSkynet$Channel;III)V}

The first compilation entry-point was patched to the pre-prologue but the parked virtual-thread still have it live in its stack.

0x10011b34154 {Skynet.lambda$skynet$1} -7                 00000000 invalid instruction
0x10011b34158 {Skynet.lambda$skynet$1} -6            >    7c0802a6 mflr      r0 <<< ^+4
0x10011b3415c {Skynet.lambda$skynet$1} -5            |    481efff5 bl        0x10011d24150 Trampoline {libj9jit29.so}{_samplingPatchCallSite} +0   
0x10011b34160 {Skynet.lambda$skynet$1} -4            |    00000100 invalid instruction
0x10011b34164 {Skynet.lambda$skynet$1} -3            |    11494670 maddhd    r10, r9, r8, r25
0x10011b34168 {Skynet.lambda$skynet$1} -2            |    e96f0050 ld        r11, 0x50(r15) J9VMThread.stackOverflowMark // stack or async check 
0x10011b3416c {Skynet.lambda$skynet$1} -1            |    00100250 invalid instruction
0x10011b34170 {Skynet.lambda$skynet$1} +0            |    e86e0018 ld        r3, 0x18(r14)
0x10011b34174 {Skynet.lambda$skynet$1} +1            |    808e0010 lwz       r4, 0x10(r14)
0x10011b34178 {Skynet.lambda$skynet$1} +2            |    80ae0008 lwz       r5, 8(r14)
0x10011b3417c {Skynet.lambda$skynet$1} +3            |    80ce0000 lwz       r6, 0(r14)
0x10011b34180 {Skynet.lambda$skynet$1} +4            *    4bffffd8 b         0x10011b34158 U>> ^-6   // <-- jitEntry

When the parked virtual thread stack is walked it crashes with *** Invalid JIT return address 0000010011B341EC in 0000010024004A80.

Given that the first body metadata still exists (one indication is it being found by KCA walking the AVLTree), I don't see why it does crash.
There's a cache that's used instead of walking the AVLTree, but its not used for walking the continuation VThreads hence its not the cause.


Looking at the core in a different way,

To find the VThreadContinuation, I looked through the stack frames.

0x090000001a2ef578 {libj9vm29.so}{invalidJITReturnAddress} [0x10024003c00]
0x090000001ac7cd64 {libj9jit29.so}{jitWalkStackFrames} [0x10024003c70]
0x090000001a2ed058 {libj9vm29.so}{walkStackFrames} [0x10024003d00]
0x090000001a392a8c {libj9vm29.so}{walkContinuationStackFrames} [0x10024003df0]
0x090000001c8b8f9c {libj9gc_full29.so}{_ZN28GC_VMThreadStackSlotIterator21scanContinuationSlotsEP10J9VMThreadP8J9ObjectPvPFvP8J9JavaVMPS3_S4_P16J9StackWalkStatePKvEbb} [0x10024004a10]
0x090000001cb17830 {libj9gc_full29.so}{_ZN22MM_GlobalMarkingScheme10scanObjectEP19MM_EnvironmentVLHGCP8J9ObjectNS_10ScanReasonE} [0x10024004d40]
0x090000001cb18234 {libj9gc_full29.so}{_ZN22MM_GlobalMarkingScheme19markLiveObjectsScanEP19MM_EnvironmentVLHGC} [0x10024004e10]
...

The caller markLiveObjectsScan have the objectPtr (VThreadContinuation in this case) in r27, the callee stores r27 in 0xa8(r1). Using the SP of the callee we find the VThreadContinuation.

(kca) what (0x10024004d40+0xa8)
0x10024004de8: 0x0a00000007060f20 Obj - {java/lang/VirtualThread$VThreadContinuation}

Finding the VirtualThread

> !vthreads | grep -i 7060f20
!continuationstack 0x00000100315fabd0 !j9vmcontinuation 0x00000100315fabd0 !j9object 0x0A00000007060F20 (Continuation) !j9object 0x0A0000000829B110 (VThread) -

Showing the same stack-trace found before for the unmounted thread

> !continuationstack 0x00000100315fabd0
<100315fabd0>   !j9method 0x0000010023251020   jdk/internal/vm/Continuation.yieldImpl(Z)Z
<100315fabd0>   !j9method 0x0000010023250EC0   jdk/internal/vm/Continuation.yield0()Z
<100315fabd0>   !j9method 0x0000010023250EA0   jdk/internal/vm/Continuation.yield(Ljdk/internal/vm/ContinuationScope;)Z
<100315fabd0>   !j9method 0x000001002323F5E0   java/lang/VirtualThread.yieldContinuation()Z
<100315fabd0>   !j9method 0x000001002323F6C0   java/lang/VirtualThread.park()V
<100315fabd0>   !j9method 0x0000010023267C80   java/lang/Access.parkVirtualThread()V
<100315fabd0>   !j9method 0x00000100244FC218   jdk/internal/misc/VirtualThreads.park()V
<100315fabd0>   !j9method 0x00000100232369D0   java/util/concurrent/locks/LockSupport.park()V
<100315fabd0>   !j9method 0x00000100243ED0A8   java/util/concurrent/LinkedTransferQueue$DualNode.await(Ljava/lang/Object;JLjava/lang/Object;Z)Ljava/lang/Object;
<100315fabd0>   !j9method 0x00000100240E5F38   java/util/concurrent/SynchronousQueue$Transferer.xferLifo(Ljava/lang/Object;J)Ljava/lang/Object;
<100315fabd0>   !j9method 0x0000010024331BD8   java/util/concurrent/SynchronousQueue.xfer(Ljava/lang/Object;J)Ljava/lang/Object;
<100315fabd0>   !j9method 0x0000010024331C98   java/util/concurrent/SynchronousQueue.take()Ljava/lang/Object;
<100315fabd0>   !j9method 0x0000010024325BF8   Skynet$Channel.receive()Ljava/lang/Object;
<100315fabd0>   !j9method 0x0000010024325828   Skynet.skynet(LSkynet$Channel;III)V
<100315fabd0>   !j9method 0x0000010024325848   Skynet.lambda$skynet$1(LSkynet$Channel;III)V
<100315fabd0>   !j9method 0x0000010024325F68   Skynet$$Lambda/0x00000000243074b8.run()V
<100315fabd0>   !j9method 0x000001002321E480   java/lang/Thread.runWith(Ljava/lang/Object;Ljava/lang/Runnable;)V
<100315fabd0>   !j9method 0x000001002323F520   java/lang/VirtualThread.run(Ljava/lang/Runnable;)V
<100315fabd0>   !j9method 0x00000100244F0D50   java/lang/VirtualThread$VThreadContinuation$1.run()V
<100315fabd0>   !j9method 0x0000010023250E60   jdk/internal/vm/Continuation.enter(Ljdk/internal/vm/Continuation;)V
<100315fabd0>                           JNI call-in frame
<100315fabd0>                           Native method frame

And no carrier thread:

!j9object 0x0A0000000829B110 (VThread)
...
        Ljava/lang/Thread; carrierThread = !fj9object 0x0 (offset = 208) (java/lang/VirtualThread)
                I state = 0x00000004 (offset = 224) (java/lang/VirtualThread)  // PARKED

@babsingh, any suggestions given similarity to #18910 , possibly duplicate?

@babsingh
Copy link
Contributor

babsingh commented Oct 9, 2024

@babsingh, any suggestions given similarity to #18910 , possibly duplicate?

@rmnattas Can you provide me a link to the core file, which was used during the above analysis?

@rmnattas
Copy link
Contributor

rmnattas commented Oct 9, 2024

Build got auto-removed but uploaded it here: https://ibm.box.com/s/jx79twimu30p1gm181ku1mymzxz0ozwm
Core in: aqa-tests/TKG/output_17271688367563/jdk_lang_j9_0/work/java/lang/Thread/virtual/stress/Skynet_default

@babsingh
Copy link
Contributor

babsingh commented Oct 9, 2024

@rmnattas It's not a duplicate of #18910. Unlike #18910, the virtual thread is completed unmounted and not in an intermediate unsteady state in the above core file.

> !j9object 0x0A0000000829B110
!J9Object 0x0A0000000829B110 {
	struct J9Class* clazz = !j9class 0x1002323FF00 // java/lang/VirtualThread
	...
	Ljdk/internal/vm/Continuation; cont = !fj9object 0xa00000007060f20 (offset = 192) (java/lang/VirtualThread) <--- Contains the unmounted VThread's stack
	Ljava/lang/Thread; carrierThread = !fj9object 0x0 (offset = 208) (java/lang/VirtualThread) <--- NULL means the VThread is unmounted
	J inspectorCount = 0x0000000000000000 (offset = 176) (java/lang/VirtualThread) <hidden> <--- 0 means the VThread is in a steady state

> !fj9object 0xa00000007060f20
!J9Object 0x0A00000007060F20 {
	struct J9Class* clazz = !j9class 0x100244F1800 // java/lang/VirtualThread$VThreadContinuation
	J vmRef = 0x00000100315FABD0 (offset = 8) (jdk/internal/vm/Continuation) <--- Struct that stores the unmounted VThread's stack
 	...
}

> !j9vmcontinuation 0x00000100315FABD0
J9VMContinuation at 0x100315fabd0 { <--- Is JIT walking the stack through using walkContinuationStackFrames?
  Fields for J9VMContinuation:
	0x0: UDATA* arg0EA = !j9x 0x00000100315FBEB0
	0x8: UDATA* bytecodes = !j9x 0x0000000000000000
	0x10: UDATA* sp = !j9x 0x00000100315FBE88
	0x18: U8* pc = !j9x 0x0000000000000003
	0x20: struct J9Method* literals = !j9method 0x0000000000000000
	0x28: UDATA* stackOverflowMark = !j9x 0x00000100315FBDB0
	0x30: UDATA* stackOverflowMark2 = !j9x 0x00000100315FBDB0
	0x38: struct J9JavaStack* stackObject = !j9javastack 0x00000100315FADB0
	0x40: struct J9JITDecompilationInfo* decompilationStack = !j9jitdecompilationinfo 0x0000000000000000
	0x48: UDATA* j2iFrame = !j9x 0x0000000000000000
	0x50: struct J9JITGPRSpillArea jitGPRs = !j9jitgprspillarea 0x00000100315FAC20
	0x160: struct J9I2JState i2jState = !j9i2jstate 0x00000100315FAD30
	0x180: struct J9VMEntryLocalStorage* oldEntryLocalStorage = !j9vmentrylocalstorage 0x0000000000000000
	0x188: UDATA dropFlags = 0x0000000000000000 (0)

When the parked virtual thread stack is walked it crashes with ...

  • Can you confirm if the JIT is walking the unmounted vthread's stack using walkContinuationStackFrames?

Here is some existing code which walks the stack of an unmounted virtual thread:

if (IS_JAVA_LANG_VIRTUALTHREAD(currentThread, threadObject)) {
if (NULL != targetThread) {
walkState->walkThread = targetThread;
rc = vm->walkStackFrames(currentThread, walkState);
} else {
j9object_t contObject = (j9object_t)J9VMJAVALANGVIRTUALTHREAD_CONT(currentThread, threadObject);
J9VMContinuation *continuation = J9VMJDKINTERNALVMCONTINUATION_VMREF(currentThread, contObject);
vm->internalVMFunctions->walkContinuationStackFrames(currentThread, continuation, threadObject, walkState);
}

  • If walkContinuationStackFrames is being used by the JIT, can you verify if the JIT specific fields in J9VMContinuation are valid?
  • J9VMContinuation contains fields from J9VMThread in order to store the vthread's state in an unmounted with minimal data. Are any JIT specific fields from J9VMThread missing in J9VMContinuation, which are needed to preserve the vthread's state?

@zl-wang
Copy link
Contributor

zl-wang commented Oct 9, 2024

we indeed can see walkContinuationStackFrames is being used by GC above in #19249 (comment) as below:

0x090000001a2ef578 {libj9vm29.so}{invalidJITReturnAddress} [0x10024003c00]
0x090000001ac7cd64 {libj9jit29.so}{jitWalkStackFrames} [0x10024003c70]
0x090000001a2ed058 {libj9vm29.so}{walkStackFrames} [0x10024003d00]
0x090000001a392a8c {libj9vm29.so}{walkContinuationStackFrames} [0x10024003df0]
0x090000001c8b8f9c {libj9gc_full29.so}{_ZN28GC_VMThreadStackSlotIterator21scanContinuationSlotsEP10J9VMThreadP8J9ObjectPvPFvP8J9JavaVMPS3_S4_P16J9StackWalkStatePKvEbb} [0x10024004a10]
0x090000001cb17830 {libj9gc_full29.so}{_ZN22MM_GlobalMarkingScheme10scanObjectEP19MM_EnvironmentVLHGCP8J9ObjectNS_10ScanReasonE} [0x10024004d40]
0x090000001cb18234 {libj9gc_full29.so}{_ZN22MM_GlobalMarkingScheme19markLiveObjectsScanEP19MM_EnvironmentVLHGC} [0x10024004e10]
...

now, with comment in #19249 (comment), it points to potential problems in the faked J9VMThread in order to walk the VThread. who set up that faked J9VMThread? walkContinuationStackFrames? has it make sure everything needed for the walk captured/copied in the set-up?

@zl-wang
Copy link
Contributor

zl-wang commented Oct 10, 2024

@TobiAjila one thing we noticed early. the faked J9VMThread is not 256 aligned as below:

struct J9VMThread* walkThread = 0x0000010024003ec0

however, all low-order 8 bits have other meanings for locking (at least). has anybody made sure there are no interferences in stack walk from these bits set (effectively)?

@zl-wang
Copy link
Contributor

zl-wang commented Oct 10, 2024

@babsingh @TobiAjila from dissecting the core file and observing the failure, the difference is as below:

  1. it asserted in the run with "Invalid JIT return address" because metaData cannot be retrieved for code addr: 0000010011B341EC
    while ((walkState->jitInfo = jitGetExceptionTable(walkState)) != NULL) { // i.e. returning NULL on this line
  2. however, KCA can retrieve the metaData in core without problems on the same address as below (assembly listing with GCMap metaData inserted):
0x10011b341e8 GCMap  Bytecode -1:6  StackMap: 
0x10011b341e8 {Skynet.lambda$skynet$1} +30   -1:6    |||| 4bc11599 bl        0x10011745780 ^{Skynet.skynet} +4    <<< +28 
0x10011b341ec {Skynet.lambda$skynet$1} +31           |||| e80e0048 ld        r0, 0x48(r14) (SP+72) = 0x10011b34044 

in case, this difference dawns some ideas on you ...

@babsingh
Copy link
Contributor

++ @fengxue-IS for insights.

@fengxue-IS
Copy link
Contributor

in walkContinuationStackFrames vm uses J9_STACKWALK_NO_JIT_CACHE to disable thread cache to avoid having to manage a single use cache.

@zl-wang @rmnattas is there any dependency on vmThread->jitArtifactSearchCache for such recompiled methods?
Looking at the code path, the carrierThread which the vthread mounted to before unmount could be still holding the jitInfo in its jitArtifactSearchCache but the continuation stack will not have any reference to that cache since we are using a temp stack thread for walking.

@zl-wang
Copy link
Contributor

zl-wang commented Oct 10, 2024

is there any dependency on vmThread->jitArtifactSearchCache for such recompiled methods?

@fengxue-IS i don't think so. from walker perspective, there is no difference between recompiled and non-recompiled method. in the background, once a method is recompiled, the space occupied by the previous compilation is queued to the FaintBlockList (i.e. candidate to be freed as early as next GC, when stack-walking doesn't find it active on any thread. when it is freed, the associated metaData range is adjusted as well). in this case, obviously the FaintBlock hasn't been freed yet since it is still active on that VThread at least and KCA can still find its metaData.

@JamesKingdon
Copy link
Contributor

We have a new case TS017553893 with an assert from swalk.c:1629, is there anything I should look for in the corefiles to increase confidence that it's the same issue as being explored here?

@zl-wang
Copy link
Contributor

zl-wang commented Oct 10, 2024

@JamesKingdon curious on which platform this new case is ...
that assert can be called from two places: jitWalkStackFrames and jitWalkResolveMethodFrame both appeared to call that assert upon failing to look up the metaData for a jit-ed code address. so, at least, you can identify in the core file below to increase the confidence it is a dup to this one:

  1. the complained jit-ed code address is still valid (the containing compilation might have been recompiled as in this one)
  2. KCA at least can still find its metaData
  3. not sure if this is significant to the happening though: the java application is using virtual-thread

@JamesKingdon
Copy link
Contributor

JamesKingdon commented Oct 10, 2024

The problem is happening on amd64, unfortunately I think the corefile is probably truncated - I'm having a lot of trouble making sense of it. The stack was

0x00007f2c63888aff {libj9vm29.so}{invalidJITReturnAddress} [0x7f2c63fe3390]
0x00007f2c629ff87c {libj9jit29.so}{jitWalkStackFrames} [0x7f2c63fe33b0]
0x00007f2c63887836 {libj9vm29.so}{walkStackFrames} [0x7f2c63fe3410]
0x00007f2c63842870 {libj9vm29.so}{Fast_java_lang_Throwable_fillInStackTrace} [0x7f2c63fe3490]

I can't identify the jit return address, but it looks like code. One thing that is catching my eye is that there are signs of native crypto code on the failing stack.

@rmnattas
Copy link
Contributor

@babsingh as mentioned walkContinuationStackFrames is being used, and I don't see the stackWalking code needing other values from J9VMThread.
Re valid values, the J9VMThreads has vmThread.pc = 0x03 which can be similar to the mac/x86 issue, but can't reproduce failure to log the value. Not sure what would changes it here if its walking JITed frames, aka should continue being in jitWalkStackFrames.

All non-zero values:

(kca) s J9VMThread 0x0000010024003ec0
J9VMThread (2832 bytes)
                             struct J9JavaVM*  javaVM = 0x0000010010131e10 (offset: 8)
                                       UDATA*  arg0EA = 0x00000100315fbeb0 (offset: 16)
                                           UDATA*  sp = 0x00000100315fbe88 (offset: 32)
                                              U8*  pc = 0x0000000000000003 (offset: 40) // Warning: ptr is not correctly aligned!
                            UDATA*  stackOverflowMark = 0x00000100315fbdb0 (offset: 80)
                           UDATA*  stackOverflowMark2 = 0x00000100315fbdb0 (offset: 88)
                     struct J9JavaStack*  stackObject = 0x00000100315fadb0 (offset: 312)
     struct J9VMEntryLocalStorage*  entryLocalStorage = 0x0000010024003e60 (offset: 600)
        struct J9VMGCSublistFragment  gcRememberedSet   (addr: 0x0000010024004128) (offset: 616)
 struct MM_GCRememberedSetFragment  sATBBarrierRememberedSetFragment   (addr: 0x0000010024004158) (offset: 664)
        struct J9StackWalkState  inlineStackWalkState   (addr: 0x00000100240041c0) (offset: 768)
 struct J9ModronThreadLocalHeap  allocateThreadLocalHeap   (addr: 0x0000010024004480) (offset: 1472)
 struct J9ModronThreadLocalHeap  nonZeroAllocateThreadLocalHeap   (addr: 0x00000100240044b0) (offset: 1520)
               struct J9DLTInformationBlock  dltBlock   (addr: 0x0000010024004590) (offset: 1744)
        j9objectmonitor_t[]  objectMonitorLookupCache   (addr: 0x0000010024004790) (offset: 2256)
                        void*  jitArtifactSearchCache = 0x0000000000000001 (offset: 2600) // Warning: ptr is not correctly aligned!
                  struct J9GSParameters  gsParameters   (addr: 0x0000010024004938) (offset: 2680)

Also, not sure if it was suppose to be for another field, but was told that it indicates type of what's being executed?

                        J9SF_FRAME_TYPE_NATIVE_METHOD = 0x0 / 0x3 (constant)
                      J9SF_FRAME_TYPE_JIT_JNI_CALLOUT = 0x0 / 0x6 (constant)
                          J9SF_FRAME_TYPE_JIT_RESOLVE = 0x0 / 0x5 (constant)

@rmnattas
Copy link
Contributor

Suggesting moving to 0.49.
Given that this is not a duplicate of the other POWER DAA issues, it's not as frequent or high-risk failure.
Continuing investigating in the meantime.

@JamesKingdon
Copy link
Contributor

The asserts from swalk.c in TS017553893 were caused by the new memory disclaiming feature. The backing files were being written to an nfs mounted directory and the majority of the jit data cache was reading as 0. Using -Xjit:disableDataCacheDisclaiming,disableIprofilerDataDisclaiming avoided the crashes.

@zl-wang
Copy link
Contributor

zl-wang commented Oct 21, 2024

The backing files were being written to an nfs mounted directory and the majority of the jit data cache was reading as 0.

is that a defined NFS behaviour?

@JamesKingdon
Copy link
Contributor

JamesKingdon commented Oct 21, 2024

Not as far as I could find, it looks like mmap is supported on nfs with some caveats about distributed sharing. I'm wondering if there's some unexpected behaviour around us immediately unlinking the backing files. It took me a little longer to recognize the problem because the files didn't have our omrvmem* names, but had the special .nfs* names for pending deletes.
Another complexity, in case it's relevant; the nfs client and server were colocated, so while we would have been writing to an nfs mounted directory the core file was showing the underlying physical files in the server filesystem.

@hzongaro
Copy link
Member

@rmnattas, I am assuming this should move out to the 0.51 release.

@JasonFengJ9
Copy link
Member

openjdk21_j9_sanity.openjdk_ppc64le_linux(sles15le-rtp-rt7-1)

[2025-01-06T10:12:23.959Z] variation: -Xdump:system:none -Xdump:heap:none -Xdump:system:events=gpf+abort+traceassert+corruptcache -XX:-JITServerTechPreviewMessage Mode501 -XXgc:fvtest_forceCopyForwardHybridMarkCompactRatio=10
[2025-01-06T10:12:23.959Z] JVM_OPTIONS:  -Xdump:system:none -Xdump:heap:none -Xdump:system:events=gpf+abort+traceassert+corruptcache -XX:-JITServerTechPreviewMessage -Xjit -Xgcpolicy:balanced -Xnocompressedrefs -XXgc:fvtest_forceCopyForwardHybridMarkCompactRatio=10 -Xverbosegclog 

[2025-01-06T10:49:06.182Z] TEST: java/lang/Thread/virtual/stress/TimedGet.java

[2025-01-06T10:49:06.183Z] *** Invalid JIT return address 00007FFF292007FC in 00007FFF1275BA68
[2025-01-06T10:49:06.183Z] 
[2025-01-06T10:49:06.183Z] 10:48:50.819 0x7ffed8002400    j9vm.249    *   ** ASSERTION FAILED ** at /home/jenkins/workspace/build-scripts/jobs/jdk21u/jdk21u-linux-ppc64le-openj9-IBM/workspace/build/src/openj9/runtime/vm/swalk.c:1633: ((0 ))

[2025-01-06T10:49:06.184Z] TEST RESULT: Failed. Unexpected exit from test [exit code: 255]
[2025-01-06T10:49:06.184Z] --------------------------------------------------
[2025-01-06T10:54:34.133Z] Test results: passed: 936; failed: 1
[2025-01-06T10:55:09.930Z] Report written to /home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64le_linux_testList_0/aqa-tests/TKG/output_17361557429618/jdk_lang_j9_1/report/html/report.html
[2025-01-06T10:55:09.930Z] Results written to /home/jenkins/workspace/Test_openjdk21_j9_sanity.openjdk_ppc64le_linux_testList_0/aqa-tests/TKG/output_17361557429618/jdk_lang_j9_1/work
[2025-01-06T10:55:09.930Z] Error: Some tests failed or other problems occurred.
[2025-01-06T10:55:09.930Z] -----------------------------------
[2025-01-06T10:55:09.930Z] jdk_lang_j9_1_FAILED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants