-
Notifications
You must be signed in to change notification settings - Fork 730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apache Lucene CI builds sometimes fail with OpenJ9 specific issues #18400
Comments
Quick update: it turns out some of those failures were in fact a bug in Lucene! So please ignore the failures about But I think the above exception is possibly J9 specific. |
@mikemccand Thanks for letting us know. Do you have instructions on how we can reproduce the issue here |
Hi @tajila -- normally the Lucene build failures have a nice That said, if you clone Lucene's
It will try to reproduce. However, I just tried that, with OpenJ9 When this happens (failure to reproduce the test on one run) we can sometimes add another parameter to run test many times,
And the failure reproduced!:
This doesn't repro with OpenJDK 21 (I had to remove the |
Thanks @mikemccand I was able to reproduce this with:
|
Great! Here is another possibly J9 specific failure: https://jenkins.thetaphi.de/job/Lucene-main-Linux/45480/
|
There's some OpenJ9 specific discussion about this build failure: https://lists.apache.org/thread/svt6bqqwdkb4kq7b9zhx630n4sj27ovq |
This is unrelated to OpenJ9, it is caused by errorprone not working with forked compilers. |
Woops thanks for the correction @uschindler. This one looks maybe unique to OpenJ9? https://jenkins.thetaphi.de/job/Lucene-9.x-Linux/14098
|
Yes it is, This is clearly a bug, it looks like it is flipping bits in the vInt encoding due to some problems in code optimizations. It only happens with OpenJ9 and this is an ongoing issue (happens quite often). |
The above run also produced other errors. What is interesting: Once OpenJ9 gets into an invalid state it then produces more and more errors in tests as followup. The above run also broke LZ4 compression:
|
The main issue here looks like |
@tajila I am investigating the problem in the GitHub issue. |
@tajila how did you ensure that the
|
I am working on a solution to disable Java Flight Recorder Support when the module isn't there. This is just a build system issue and unrelated to the issues here. During Lucene builds we always use Hotspot VMs, but you can run the test suite with any other JVM. Just pass |
Hi, I created apache/lucene#12845 to work around the J9 VM not having Java Flight Recorder by checking for the module:
|
Code was merged, if you update Lucene's main branch, you should be able to test this with pure OpenJ9 (also to run Gradle). |
I was able to reproduce the test failure in
|
@tajila what could be the next step(s) to find the root cause of the problem based on the analysis below? It seems like creating a micro test that reproduces the issue or identifying the possible commit(s) in the lucene repo that may have caused the issue could be next steps.
openj9 jdk17
openj9 jdk21
hotspot jdk17
hotspot jdk21
|
I would start by looking at what causes the assertion failure
Since you have access to the source you can instrument it to see where the discrepency arises from. |
The intermittent assertion failure in |
Thanks for the report. Indeed the method can be synchronized, because newWriter() is synchronized anyways, so it won't add more contention. @mikemccand do you want to take the lead as you know IndexWriter better than I do? I think the more serious issues here are the ones corrupting data while reading; there's no concurrency involved: #18400 (comment) |
I was able to reproduce the failure in
@tajila the cause of the failure in
|
Thanks @singh264 and @uschindler -- I'll open a Lucene GitHub issue to track this on Lucene's side! Love the open source collaboration here! |
@tajila would it be appropriate to request the GC or JIT team to assist in finding the root cause of the failure in |
A workaround for the failure in |
@dmitripivkine requesting your feedback on this issue. This issue indicated two lucene test failures (testSegmentCountOnFlushRandom and testSortedVariableLengthBigVsStoredFields) with the cause that seemed specific to openj9: |
@hzongaro pls take a look, #18400 (comment) indicates a JIT issue. |
The fact that Balanced and Metronome work with -Xint pointed to JIT most likely. Both of these GC policies required JIT to be configured differently (disabled optimizations mostly). So, I think JIT investigation is a reasonable next point. If you have a reason to think it is GC related we need system core captured at assertion point. Even with this core it might be hard to connect failure to GC activity, so detected discrepancies explanation appreciated. |
A core file was generated with:
|
@BradleyWood, may I ask you to take a look at what are potentially JIT-related failures reported here? |
@singh264 I have no access to the file, you should change permissions in the box. Also please summarize the problem for this particular core (deeper than trivial output you have provided).
|
The core file can now be downloaded with the google drive link, I was unable to extend the expiration date of the box link.
A RuntimeException is thrown by
The expected behaviour is that
I'd like guidance on exploring these suggestions to understand how GC-related components impact the test behaviour. I will investigate the utilization of bytes array, with the default |
One specific for Balanced (and Metronome of course) detail is object
First @singh264 Would you please try this experiment - increase region size from current 512k (0x80000) to 1m or even higher. I don't know can test use arrays larger than current |
When running TestLucene90DocValuesFormat.testSortedVariableLengthBigVsStoredFields:
|
The test TestLucene90DocValuesFormat.testSortedVariableLengthBigVsStoredFields passes with
|
fyi @0xdaryl @hzongaro @jdmpapin @nbhuiyan, earlier in our meeting, I had referred to this user issue; the failure is most likely related to OpenJ9 MHs, and it is resolved with OJDK MHs as per #18400 (comment). |
How can I know the status of OJDK MHs as the failure is most likely related to OpenJ9 MHs, and it is resolved with OJDK MHs as per #18400 (comment)? |
It seems like currently OJDK MHs cannot be enabled until the performance issues are resolved. |
How can I find information about the performance issues that need to be resolved? |
For JDK8 and JDK11, OJDK MH perf is being tracked through #12728. We are trying different solutions to address the perf issues. In the meanwhile, workarounds mentioned in the previous comments can be used to address the failure. JDK17+ can also be used to address the failure since OJDK MHs are enabled in JDK17+. |
How can I know if it is currently feasible to utilize JDK17+ to address the failure since OJDK MHs are enabled in JDK17+? |
Your previous test in #18400 (comment) implies that the issue will be addressed with JDK17+. Nothing more needs to be done. |
Hello from Apache Lucene project, where we run CI builds with various JVMs and versions, including OpenJ9.
We sometimes see CI failures that seem to be OpenJ9 specific, and we don't have many dev resources to pursue them. You can see the recent failures in Apache mail list archives. Each test failure includes a
Reproduce with:
command-line that in theory will recreate the failure in your dev area.This one looks more interesting:
I'm not sure if this is actionable by any OpenJ9 devs, but I wanted at least to establish some contact between our two projects so that together we could somehow get to the root cause on some of these issues. The root cause might be in Lucene (e.g. failing to properly detect number of bytes for an object pointer in the current JVM -- there is some JVM vendor specific logic I think) or might be something specific to OpenJ9.
We seem to be testing this OpenJ9 version:
64bit/openj9/jdk-17.0.5
.The text was updated successfully, but these errors were encountered: