Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug On Restore: invalidate all method bodies if needed #20741

Closed
wants to merge 1 commit into from

Conversation

dsouzai
Copy link
Contributor

@dsouzai dsouzai commented Dec 3, 2024

Under -XX:+DebugOnRestore, if -Xnoaot -Xjit is specified post restore, the compiler invalidates all compiled bodies. It does so by calling TR::Recompilation::invalidateMethodBody. This, in tandem with other factors described below, can lead to a hang:

  1. A method (foo) gets compiled pre-checkpoint.
  2. foo gets recompiled proactively in the checkpoint hook. This patches the startPC of the old body to jump to a helper that forwards execution to the new body.
  3. Post-restore, -Xnoaot -Xnojit is specified, resulting in the startPC of the recompiled foo getting patched to trigger a sync recompilation
  4. Because of -Xnoaot -Xnojit, the sync recompilation fails; this results in the startPC of the recompiled foo (from step 2) getting patched to revert to the interpreter.
  5. Once in the interpreter, the extra field of the J9Method of foo gets reset to 0x1 (J9_STARTPC_NOT_TRANSLATED)
  6. On the next (interpreted) invocation of foo, the interpreter triggers a compilation request (because the count is technically 0)
  7. The compilation fails (because of -Xnoaot -Xnojit), which results in the extra getting set to -3 (J9_JIT_NEVER_TRANSLATE)
  8. Some compiled method has a direct call to the original body of foo (step 1).
  9. The helper cannot determine the new body to forward execution to (because the extra of the J9method of foo is set to -3); so it requests a compilation.
  10. The compilation request sequence has an early return in j9jit_testarossa_err, because compilation happens asynchronously and because linkageInfo->isBeingCompiled() returns true; this flag never gets reset even after the recompilation succeeds in step 2.
  11. Execution returns back to the startPC; return to step 9.

The fix for this hang is to call TR::Recompilation::methodCannotBeRecompiled rather than TR::Recompilation::invalidateMethodBody, and to call it on all compiled bodies, including the old stubs of bodies that were recompiled. This is OK to do because this happens anyway for the "invalidated" recompiled body in step 4, and this would have happened in step 10 if the early return never occurred.

It remains to be determined whether resetting the flag in step 10 to prevent the early return is safe.

Fixes #20663

I'll open another issue to track fixing this in a more general case (as this issue is technically possible even outside of -XX:+DebugOnRestore and CRIU.

Under -XX:+DebugOnRestore, if -Xnoaot -Xjit is specified post restore,
the compiler invalidates all compiled bodies. It does so by calling
TR::Recompilation::invalidateMethodBody. This, in tandem with other
factors described below, can lead to a hang:

1. A method (foo) gets compiled pre-checkpoint.
2. foo gets recompiled proactively in the checkpoint hook. This patches
   the startPC of the old body to jump to a helper that forwards
   execution to the new body.
3. Post-restore, -Xnoaot -Xnojit is specified, resulting in the
   startPC of the recompiled foo getting patched to trigger a
   sync recompilation
4. Because of -Xnoaot -Xnojit, the sync recompilation fails; this
   results in the startPC of the recompiled foo (from step 2) getting
   patched to revert to the interpreter.
5. Once in the intepreter, the extra field of the J9Method of foo gets
   reset to 0x1 (J9_STARTPC_NOT_TRANSLATED)
6. On the next (interpreted) invocation of foo, the interpreter triggers
   a compilation request (because the count is technically 0)
7. The compilation fails (because of -Xnoaot -Xnojit), which results in
   the extra getting set to -3 (J9_JIT_NEVER_TRANSLATE)
8. Some compiled method has a direct call to the original body of foo
   (step 1).
9. The helper cannot determine the new body to forward execution to
   (because the extra of the J9method of foo is set to -3); so it
   requests a compilation.
10. The compilation request sequence has an early return in
    j9jit_testarossa_err, because compilation happens asynchronously and
    because linkageInfo->isBeingCompiled() returns true; this flag never
    gets reset even after the recompilation succeeds in step 2.
11. Execution returns back to the startPC; return to step 9.

The fix for this hang is to call
TR::Recompilation::methodCannotBeRecompiled rather than
TR::Recompilation::invalidateMethodBody, and to call it on all compiled
bodies, including the old stubs of bodies that were recompiled. This is
OK to do because this happens anyway for the "invalidated" recompiled
body in step 4, and this would have happened in step 10 if the early
return never occurred.

It remains to be determined whether resetting the flag in step 10 to
prevent the early return is safe.

Signed-off-by: Irwin D'Souza <[email protected]>
@dsouzai dsouzai added comp:jit criu Used to track CRIU snapshot related work labels Dec 3, 2024
@dsouzai dsouzai marked this pull request as draft December 3, 2024 20:50
@dsouzai
Copy link
Contributor Author

dsouzai commented Dec 4, 2024

Found a better solution; closing this.

@dsouzai dsouzai closed this Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:jit criu Used to track CRIU snapshot related work
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cmdLineTester_criu_jitPostRestore Test -Xnojit -Xnoaot hang
1 participant