-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROCm memory issues affecting Microphysics codes #1386
Comments
I seem to recall seeing different behaviors when compiled in debug mode, which made me suspect it is a compiler issue. |
Ah, that's an interesting clue. @psharda you were going to try this, right? Did it ever finish building? |
Here's a simple test that generates a memory issue with ROCm 5.7.0:
then run on a single GPU, using the The output is:
|
I hestiate to ask... does this compile in less than a Hubble time in debug mode? |
with DEBUG=TRUE, I get:
|
and this runs fine with ROCm 5.3.0 |
|
with
|
Here's a backtrace from inside a thread:
|
These were causing memory errors in ROCm > 5.3.0 or so. (see AMReX-Astro#1386)
Just want to confirm: is it the case that #1422 and additional PRs will be needed to fully fix this? |
That's the thinking. We won't know until we do it though. Of course, ROCm could also just fix their issues... |
I really want ROCm 6.0 to be available for us to test with. |
@BenWibking @zingale could I meanwhile try our Quokka simulation with #1422 as the Microphysics submodule (since we have ROCm 6.0 available)? I guess we would also need to make changes in Quokka and/or Microphysics CMakeLists? |
We are still seeing the same memory error and crash that we were seeing before with ROCm 6.0, so something still appears to be wrong on their end. |
the |
The debug build is still linking... |
I don't think we have any more instances of pure Microphysics tests failing with ROCm > 5.3.0 |
Previously tracked as AMReX-Codes/amrex#3623.
Reproducer:
Error message:
According the Weiqun, the ASAN error is a false positive.
The {Castro, Quokka} production simulations crash and or produce error messages like this:
(See: AMReX-Astro/Castro#2569 and quokka-astro/quokka#447.)
In all cases, the errors are not seen on host-only builds or NVIDIA GPUs.
The text was updated successfully, but these errors were encountered: