-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrency bug DocumentsWriterPerThreadPool.getAndLock()
uncovered by OpenJ9 test failures?
#12916
Comments
Hi, Just this one isn't and this causes trouble. I think the fix mentioned here is fine and fixes a real bug which may also happen on Hotspot on some other hardware platform. It does not fail on x86 with hotspot. ARM could fail due to other memory behaviour and order. I will provide a PR to add synchronized tomorrow. |
Does someone understand if adding synchronization is fixing a real bug of it it just helps hide a J9 bug? This method is subject to contention and #12199 was about avoiding locking on this method, which proved to help significantly when indexing cheap documents on several threads. I'm looking at the test and the code and can't see what sort of race condition may cause this test to fail. |
This is a real bug and not one of openj9. You can reproduce this bug with enough tries on hotspot, too. |
They made statistics in the linked issue. Hotspot also fails. So they rejected it as openj9 issue. |
Thanks I had missed that, I'll look more into it. |
See this comment: eclipse-openj9/openj9#18400 (comment) |
Thanks. I'm trying to reproduce failures locally with the following command, without luck so far with JDK 17 and JDK 21. I'll dig more tomorrow. If any of you manages to reproduce, I'm interested in the command that you used.
|
Thanks. Maybe the OpenJ9 people can help how they reproduced. You can use More looking into that code; I think the whole code here (and IndexWriter in general) should be freed of ancient "synchronized" blocks and methods and should instead use the more modern Java synchronization patterns inclusive volatile / opaque reads and barriers. The problematic concurrency could for sure be solved with some better algorithm that allows to read lockfree and only write with locks (e.g., ReadWriteLock instead of synchronized). |
I made test fail on my AMD Ryzen 3700 (the Policeman Jenkins Sever) with Hotspot 17.0.9: $ java -version
openjdk version "17.0.9" 2023-10-17
OpenJDK Runtime Environment Temurin-17.0.9+9 (build 17.0.9+9)
OpenJDK 64-Bit Server VM Temurin-17.0.9+9 (build 17.0.9+9, mixed mode, sharing)
$ ./gradlew -p lucene/core -Dtests.seed=F7B4CD7A5624D5EC beast --tests TestIndexWriterThreadsToSegments.testSegmentCountOnFlushRandom -Dtests.jvmargs="-XX:+UseCompressedOops" -Ptests.iters=1000 -Ptests.dups=100 It failed on the 3rd beasting round:
|
I was able to reproduce the failure in
|
…ll` on a non-empty queue. Before this change, `ConcurrentApproximatePriorityQueue#poll` could sometimes return `null` even though the queue was empty at no point in time. The practical implication is that we can end up with more DWPTs in memory than indexing threads, which, while not a bug, is probably not efficient. I ran luceneutil's `IndexGeonames` with this change and could not notice a slowdown while this benchmark has been good at exhibiting contention issues in the past. This is likely due to the fact that most of the time, a DWPT can be obtained from the pool using the optimistic path, which doesn't block. Closes apache#12649 apache#12916
…ll` on a non-empty queue. Before this change, `ConcurrentApproximatePriorityQueue#poll` could sometimes return `null` even though the queue was empty at no point in time. The practical implication is that we can end up with more DWPTs in memory than indexing threads, which, while not strictly a bug, may require doing more merging than we'd like later on. I ran luceneutil's `IndexGeonames` with this change, and `ConcurrentApproximatePriorityQueue#poll` was not the main source of contention. I instrumented the code to check how many DWPTs got pulled from the queue using the optimistic path vs. pessimistic path and got 8525598 for the optimistic path vs. 12247 for the pessimistic path. Closes apache#12649 apache#12916
I was not able to reproduce the failure, but could create an isolated test on |
Ok, maybe the issue happens more often with OpenJ9 or AMD Ryzen CPUs. Should I take your PR and check with beasting on Policeman's Ryzen to see if I can reproduce with the above command? I can also test OpenJ9. |
Is there an explanation why it works better when making the method synchronized as suggested by the OpenJ9 people? |
That would be great. I have a Ryzen too, though a AMD Ryzen 9 3900X. I'm not sure what factors influence the reproducibility of this bug. The good news is that the new test on #12959 fails 100% of the time for me without the fix.
My understanding is that the way that the bug manifests looks like this:
And then the following sequence of events: Making the method synchronized helps by forcing T2 to wait until T1 is done pulling a DWPT from the queue before starting doing the same. The PR tries to go a bit more granular to limit contention. |
OK, it is running now with above command line (beasting) on OpenJDK Temurin 17.0.9. |
Hi, It is a different assertion error without message.
|
Nah it is exactly same error. I verified I am on right branch:
|
Thanks I'll keep digging. |
Did you run my above "beasting" command? It runs everything 100 times with each 1000 iterations. For me it fails in most cases after 3rd or fourth try. |
Also keep to have the machine busy at same moment (Jenkins jobs are running at same time). This may increase chance to hit this. |
I tried it this morning and stopped it after some time (I don't remember how long, but tens of minutes). I just started it again and no failures in the first 6 rounds (40 minutes), I'll let it run overnight to see if I can get it to fail. However I can get the test to fail with OpenJ9. Maybe I can use it to better understand what is happening.
Good idea. |
It took a long time (3.5h) but it ended up failing. And then I could also get it to fail on my branch, with both Hotspot and J9. I'll keep looking, most likely after the holidays. |
The latest PR seems to fix the issue. |
…on a non-empty queue. (#12959) Before this change, `DocumentsWriterPerThread#getAndLock` could sometimes return `null` even though the queue was empty at no point in time. The practical implication is that we can end up with more DWPTs in memory than indexing threads, which, while not strictly a bug, may require doing more merging than we'd like later on. I ran luceneutil's `IndexGeonames` with this change, and `DocumentsWriterPerThread#getAndLock` was not the main source of contention. Closes #12649 #12916
…on a non-empty queue. (#12959) Before this change, `DocumentsWriterPerThread#getAndLock` could sometimes return `null` even though the queue was empty at no point in time. The practical implication is that we can end up with more DWPTs in memory than indexing threads, which, while not strictly a bug, may require doing more merging than we'd like later on. I ran luceneutil's `IndexGeonames` with this change, and `DocumentsWriterPerThread#getAndLock` was not the main source of contention. Closes #12649 #12916
…on a non-empty queue. (apache#12959) Before this change, `DocumentsWriterPerThread#getAndLock` could sometimes return `null` even though the queue was empty at no point in time. The practical implication is that we can end up with more DWPTs in memory than indexing threads, which, while not strictly a bug, may require doing more merging than we'd like later on. I ran luceneutil's `IndexGeonames` with this change, and `DocumentsWriterPerThread#getAndLock` was not the main source of contention. Closes apache#12649 apache#12916
Closing as this issue has been addressed. |
Description
There is an exciting upstream (OpenJ9) comment here (thank you @singh264), copied below:
The intermittent assertion failure in testSegmentCountOnFlushRandom seems to have started after bae5b33 was added in the lucene repo, and a solution to fix the intermittent assertion failure is to make the DocumentsWriterPerThreadPool.getAndLock() method synchronized as it seems to fix a race condition that occurs while multiple threads concurrently add documents to an index during the test.
I have not looked closely yet ...
Version and environment details
No response
The text was updated successfully, but these errors were encountered: