Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROCm memory issues affecting Microphysics codes #1386

Closed
BenWibking opened this issue Nov 14, 2023 · 17 comments
Closed

ROCm memory issues affecting Microphysics codes #1386

BenWibking opened this issue Nov 14, 2023 · 17 comments

Comments

@BenWibking
Copy link
Collaborator

BenWibking commented Nov 14, 2023

Previously tracked as AMReX-Codes/amrex#3623.

Reproducer:

git clone https://github.com/AMReX-Astro/Microphysics.git
cd Microphysics/unit_test/burn_cell
export AMREX_HOME=/path/to/amrex
export AMREX_AMD_ARCH=gfx90a:xnack+
export HSA_XNACK=1
export LD_LIBRARY_PATH=/opt/rocm-5.7.0/llvm/lib/clang/17.0.0/lib/linux:$LD_LIBRARY_PATH
make USE_HIP=TRUE CXXFLAGS="-std=c++17 -m64 -fgpu-rdc --offload-arch=gfx90a:xnack+ -pthread -g -O3 -munsafe-fp-atomics -fsanitize=address -shared-libsan" LDFLAGS="-fsanitize=address -shared-libsan" -j16
./main3d.hip.HIP.ex inputs_vode_example

Error message:

==1548068==ERROR: AddressSanitizer: global-buffer-overflow on address 0x0000020f0b48 at pc 0x7f5b9dea5ea7 bp 0x7ffe0c5b6de0 sp 0x7ffe0c5b65a0
READ of size 32 at 0x0000020f0b48 thread T0
    #0 0x7f5b9dea5ea6 in __interceptor_memcpy (/opt/rocm-5.7.0/llvm/lib/clang/17.0.0/lib/linux/libclang_rt.asan-x86_64.so+0xa5ea6) (BuildId: e2f6676d7d0ade0de2c4ac32fa5856892b18b70a
)
    #1 0x7f5b997440a9  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x3440a9) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #2 0x7f5b997462f6  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x3462f6) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #3 0x7f5b997465a6  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x3465a6) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #4 0x7f5b99712434  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x312434) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #5 0x7f5b996dcc53  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x2dcc53) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #6 0x7f5b995835e9  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x1835e9) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #7 0x7f5b99489c0e  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x89c0e) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #8 0x7f5b995e650e  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x1e650e) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #9 0x7f5b99610bd9  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x210bd9) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #10 0x7f5b995e6f91  (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x1e6f91) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #11 0x7f5b995f13e7 in hipLaunchKernel (/opt/rocm-5.7.0/lib/libamdhip64.so.5+0x1f13e7) (BuildId: 7342fbe1c361ada40d7aa3c1da36c32f3fbe143d)
    #12 0xa49041 in std::enable_if<MaybeDeviceRunnable<(anonymous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int)>::value, void>::type amrex::ParallelFor<256, int, (anony
mous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int), void>(amrex::Gpu::KernelInfo const&, int, (anonymous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int)&&) /
home/bwibking/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H:878:5
    #13 0xa49041 in void amrex::ParallelFor<int, (anonymous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int), void>(int, (anonymous namespace)::ResizeRandomSeed(unsigned long)::'lambda'(int)&&) /home/bwibking/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H:1457:5
    #14 0xa49041 in (anonymous namespace)::ResizeRandomSeed(unsigned long) /home/bwibking/amrex/Src/Base/AMReX_Random.cpp:54:5
    #15 0xa49041 in amrex::InitRandom(unsigned long, int, unsigned long) /home/bwibking/amrex/Src/Base/AMReX_Random.cpp:104:5
    #16 0x987586 in amrex::Initialize(int&, char**&, bool, int, std::function<void ()> const&, std::ostream&, std::ostream&, void (*)(char const*)) /home/bwibking/amrex/Src/Base/AMReX.cpp:625:5
    #17 0x908243 in main /home/bwibking/Microphysics/unit_test/burn_cell/main.cpp:19:3
    #18 0x7f5b98c3feaf in __libc_start_call_main (/lib64/libc.so.6+0x3feaf) (BuildId: b39d468aead6d9ede227751ffe093da287488648)
    #19 0x7f5b98c3ff5f in __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x3ff5f) (BuildId: b39d468aead6d9ede227751ffe093da287488648)
    #20 0x8dc8c4 in _start (/home/bwibking/Microphysics/unit_test/burn_cell/main3d.hip.HIP.ex+0x8dc8c4)

0x0000020f0b48 is located 56 bytes before global variable 'helmholtz::itmax' defined in '../../EOS/helmholtz/actual_eos_data.cpp' (0x20f0b80) of size 8
0x0000020f0b48 is located 24 bytes before global variable 'helmholtz::input_is_constant' defined in '../../EOS/helmholtz/actual_eos_data.cpp' (0x20f0b60) of size 8
0x0000020f0b48 is located 0 bytes after global variable 'helmholtz::do_coulomb' defined in '../../EOS/helmholtz/actual_eos_data.cpp' (0x20f0b40) of size 8
SUMMARY: AddressSanitizer: global-buffer-overflow (/opt/rocm-5.7.0/llvm/lib/clang/17.0.0/lib/linux/libclang_rt.asan-x86_64.so+0xa5ea6) (BuildId: e2f6676d7d0ade0de2c4ac32fa5856892b18b70a) in __interceptor_memcpy

According the Weiqun, the ASAN error is a false positive.

The {Castro, Quokka} production simulations crash and or produce error messages like this:

Memory access fault by GPU node-8 (Agent handle: 0x2975b60) on address 0x800033773000. Reason: Unknown.

(See: AMReX-Astro/Castro#2569 and quokka-astro/quokka#447.)

In all cases, the errors are not seen on host-only builds or NVIDIA GPUs.

@zingale
Copy link
Member

zingale commented Nov 14, 2023

I seem to recall seeing different behaviors when compiled in debug mode, which made me suspect it is a compiler issue.

@BenWibking
Copy link
Collaborator Author

I seem to recall seeing different behaviors when compiled in debug mode, which made me suspect it is a compiler issue.

Ah, that's an interesting clue. @psharda you were going to try this, right? Did it ever finish building?

@zingale
Copy link
Member

zingale commented Dec 13, 2023

Here's a simple test that generates a memory issue with ROCm 5.7.0:

module load cpe/23.09
module load rocm/5.7.0
module load PrgEnv-gnu craype-accel-amd-gfx90a cray-mpich

cd Microphysics/unit_test/test_react
make NETWORK_DIR=subch_simple USE_HIP=TRUE COMP=gnu -j 4

then run on a single GPU, using the inputs_aprox13 inputs file

The output is:

Initializing AMReX (23.12-11-g064db4eaa599)...
Initializing HIP...
HIP initialized with 1 device.
AMReX (23.12-11-g064db4eaa599) initialized
reading extern runtime parameters ...
reading in network electron-capture / beta-decay tables...
Memory access fault by GPU node-4 (Agent handle: 0x1f677b0) on address 0x7fffd6ce5000. Reason: Unknown.
SIGABRT
See Backtrace.0 file for details
srun: error: frontier05193: task 0: Exited with exit code 1
srun: Terminating StepId=1533275.0

@BenWibking
Copy link
Collaborator Author

I hestiate to ask... does this compile in less than a Hubble time in debug mode?

@zingale
Copy link
Member

zingale commented Dec 13, 2023

with DEBUG=TRUE, I get:

:0:rocdevice.cpp            :2692: 719740276446 us: [pid:85303 tid:0x7fffde461700] Callback: Queue 0x7ffeaba00000 aborting w
ith error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal addres
s. code: 0x29

@zingale
Copy link
Member

zingale commented Dec 13, 2023

and this runs fine with ROCm 5.3.0

@yut23
Copy link
Collaborator

yut23 commented Dec 13, 2023

test_react appears to work fine with ROCm 5.4.0

@zingale
Copy link
Member

zingale commented Dec 14, 2023

with rocgdb, I get:

guration: Returned hipSuccess : 
:3:hip_module.cpp           :678 : 298664095624 us: [pid:16441 tid:0x7fffed9cda80]  hipLaunchKernel ( 0x221e30, {4,1,1}, {256,1,1}, 0x7fffffff3a10, 0, stream:0x87d36a0 ) 
:3:rocvirtual.cpp           :783 : 298664095630 us: [pid:16441 tid:0x7fffed9cda80] Arg0:   = val:140648402845968
:3:rocvirtual.cpp           :2897: 298664095632 us: [pid:16441 tid:0x7fffed9cda80] ShaderName : _ZN5amrex13launch_globalILi256EZNS_6launchILi256EZNS_9ReduceOpsIJNS_11ReduceOpMaxEEE4evalINS_10ReduceDataIJNS_10ValLocPairIi6burn_tEEEEEZNS4_4evalINS_8FabArrayINS_9FArrayBoxEEESA_Z9main_mainvEUliiiiE_EENSt9enable_ifIXaasr10IsFabArrayIT_EE5valuesr10IsCallableIT1_iiiiEE5valueEvE4typeERKSH_RKNS_7IntVectERT0_OSI_EUliiiE_EEvRKNS_3BoxERSH_OSQ_EUlvE_EEvimP12ihipStream_tSY_EUlvE_EEvSQ_.intern.14460905eb7cb0a1
:3:hip_module.cpp           :679 : 298664095639 us: [pid:16441 tid:0x7fffed9cda80] hipLaunchKernel: Returned hipSuccess : 
:3:hip_error.cpp            :27  : 298664095641 us: [pid:16441 tid:0x7fffed9cda80]  hipGetLastError (  ) 
:3:hip_error.cpp            :27  : 298664095644 us: [pid:16441 tid:0x7fffed9cda80]  hipGetLastError (  ) 
:3:hip_stream.cpp           :451 : 298664095648 us: [pid:16441 tid:0x7fffed9cda80]  hipStreamSynchronize ( stream:0x87d36a0 ) 
:3:rocdevice.cpp            :2651: 298664095650 us: [pid:16441 tid:0x7fffed9cda80] No HW event
:3:rocvirtual.hpp           :67  : 298664095653 us: [pid:16441 tid:0x7fffed9cda80] Host active wait for Signal = (0x7fffcbee4000) for -1 ns
Memory access fault by GPU node-4 (Agent handle: 0x42bfbc0) on address 0x7ff7e03d5000. Reason: Unknown.

Thread 2 "main3d.hip.x86-" hit Breakpoint 1, 0x00007fffe80f81de in abort () from /lib64/libc.so.6
(gdb) interrupt
(gdb) 
Thread 1 "main3d.hip.x86-" stopped.
0x00007fffdf6769f9 in ?? () from /opt/rocm-5.7.0/lib/libhsa-runtime64.so.1
bt
#0  0x00007fffdf6769f9 in ?? () from /opt/rocm-5.7.0/lib/libhsa-runtime64.so.1
#1  0x00007fffdf67684a in ?? () from /opt/rocm-5.7.0/lib/libhsa-runtime64.so.1
#2  0x00007fffdf669fa9 in ?? () from /opt/rocm-5.7.0/lib/libhsa-runtime64.so.1
#3  0x00007fffe9305793 in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#4  0x00007fffe92fc318 in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#5  0x00007fffe92ffcbf in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#6  0x00007fffe9301a03 in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#7  0x00007fffe92ff225 in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#8  0x00007fffe92d330b in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#9  0x00007fffe92d3920 in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#10 0x00007fffe92d39cc in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#11 0x00007fffe92d6b28 in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#12 0x00007fffe9239503 in ?? () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#13 0x00007fffe923992c in hipStreamSynchronize () from /opt/rocm-5.7.0/lib/libamdhip64.so.5
#14 0x0000000002f265c6 in amrex::Gpu::Device::streamSynchronize ()
    at /ccs/home/zingale/amrex/Src/Base/AMReX_GpuDevice.cpp:613
#15 0x0000000002fa45ec in amrex::Gpu::streamSynchronize ()
    at /ccs/home/zingale/amrex/Src/Base/AMReX_GpuDevice.H:241
#16 amrex::MFIter::Finalize (this=0x7fffffff3a70)
    at /ccs/home/zingale/amrex/Src/Base/AMReX_MFIter.cpp:242
#17 0x0000000002fa456c in amrex::MFIter::~MFIter (this=0x4292690)
    at /ccs/home/zingale/amrex/Src/Base/AMReX_MFIter.cpp:212
#18 0x0000000002ea2205 in amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::FabArray<amrex::FArrayBox>, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, main_main()::{lambda(int, int, int, int)#1}>(amrex::FabArray<amrex::FArrayBox> const&, amrex::IntVect const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, main_main()::{lambda(int, int, int, int)#1}&&) (this=<optimized out>, mf=..., 
    nghost=..., reduce_data=..., f=...) at /ccs/home/zingale/amrex/Src/Base/AMReX_Reduce.H:453
#19 amrex::ParReduce<amrex::ReduceOpMax, amrex::ValLocPair<int, burn_t>, amrex::FArrayBox, main_main()::{lambda(int, int, int, int)#1}, void>(amrex::TypeList<amrex::ReduceOpMax>, amrex::TypeList<amrex--Type <RET> for more, q to quit, c to continue without paging--
::ValLocPair<int, burn_t> >, amrex::FabArray<amrex::FArrayBox> const&, amrex::IntVect const&, main_main()::{lambda(int, int, int, int)#1}&&) (fa=..., nghost=..., operation_list=..., type_list=..., 
    f=...) at /ccs/home/zingale/amrex/Src/Base/AMReX_ParReduce.H:103
#20 amrex::ParReduce<amrex::ReduceOpMax, amrex::ValLocPair<int, burn_t>, amrex::FArrayBox, main_main()::{lambda(int, int, int, int)#1}, void>(amrex::TypeList<amrex::ReduceOpMax>, amrex::TypeList<amrex::ValLocPair<int, burn_t> >, amrex::FabArray<amrex::FArrayBox> const&, main_main()::{lambda(int, int, int, int)#1}&&) (fa=..., operation_list=..., type_list=..., f=...)
    at /ccs/home/zingale/amrex/Src/Base/AMReX_ParReduce.H:288
#21 main_main () at main.cpp:203
#22 0x0000000002ea0d41 in main (argc=<optimized out>, argv=<optimized out>) at main.cpp:26```

@yut23
Copy link
Collaborator

yut23 commented Dec 14, 2023

Here's a backtrace from inside a thread:

#0  0x00007ff7b0c54630 in dgesl<23> (a1=..., pivot1=..., b1=...) at ../../util/linpack.H:24
#1  dvnlsd<amrex::Array1D<short, 1, 23>, burn_t, dvode_t<23> > (pivot=..., NFLAG=<optimized out>, state=..., vstate=...) at ../../integration/VODE/vode_dvnlsd.H:117                                                                           
#2  dvstep<burn_t, dvode_t<23> > (state=..., vstate=...) at ../../integration/VODE/vode_dvstep.H:177
#3  dvode<burn_t, dvode_t<23> > (state=..., vstate=...) at ../../integration/VODE/vode_dvode.H:186
#4  actual_integrator<burn_t> (state=..., dt=<optimized out>) at ../../integration/VODE/actual_integrator.H:88
#5  integrator<burn_t> (state=..., dt=<optimized out>) at ../../integration/integrator.H:14
#6  burner<burn_t> (state=..., dt=<optimized out>) at ../../interfaces/burner.H:92
#7  do_react (i=<optimized out>, j=<optimized out>, k=<optimized out>, state=..., burn_state=..., n_rhs=..., p=...) at ./react_zones.H:49                                                                                                      
#8  main_main()::{lambda(int, int, int, int)#1}::operator()(int, int, int, int) const (this=<optimized out>, box_no=<optimized out>, i=<optimized out>, j=<optimized out>, k=<optimized out>) at main.cpp:211                                  
#9  amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::FabArray<amrex::FArrayBox>, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, main_main()::{lambda(int, int, int, int)#1}>(amrex::FabArray<amrex::FArrayBox> const&, amrex::IntVect const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, main_main()::{lambda(int, int, int, int)#1}&&)::{lambda(int, int, int)#1}::operator()(int, int, int) const (this=<optimized out>, i=<optimized out>, j=<optimized out>, k=<optimized 
out>) at /ccs/home/etjohnson/dev/amrex/Src/Base/AMReX_Reduce.H:459                                                     
#10 amrex::Reduce::detail::call_f<amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::FabArray<amrex::FArrayBox>, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, main_main()::{lambda(int, int, int, int)#1}>(amrex::FabArray<amrex::FArrayBox> const&, amrex::IntVect const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, main_main()::{lambda(int, int, int, int)#1}&&)::{lambda(int, int, int)#1}>(amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::FabArray<amrex::FArrayBox>
, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, main_main()::{lambda(int, int, int, int)#1}>(amrex::FabArray<amrex::FArrayBox> const&, amrex::IntVect const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, main_main()::{lambda(int, int, int, int)#1}&&)::{lambda(int, int, int)#1} const&, int, int, int, amrex::IndexType) (f=..., i=<optimized out>, j=<optimized out>, k=<optimized out>) at /ccs/home/etjohnson/dev/amrex/Src/Base/AMReX_Reduce.H:324
#11 amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::FabArray<amrex::FArrayBox>, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, main_main()::{lambda(int, int, int, int)#1}>(amrex::FabArray<amrex::FArrayBox> const&, amrex::IntVect const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, main_main()::{lambda(int, int, int, int)#1}&&)::{lambda(int, int, int)#1}>(amrex::Box 
const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&&)::{lambda()#1}::operator()() const (this=<optimized out>) at /ccs/home/etjohnson/dev/amrex/Src/Base/AMReX_Reduce.H:545       
#12 amrex::launch<256, amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, amrex::ReduceOps<amrex::ReduceOpMax>::eval<amrex::FabArray<amrex::FArrayBox>, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >, main_main()::{lambda(int, int, int, int)#1}>(amrex::FabArray<amrex::FArrayBox> const&, amrex::IntVect const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, main_main()::{lambda(int, int, int, int)#1}&&)::{lambda(int, int, i
nt)#1}>(amrex::Box const&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&&)::{lambda()#1}>(int, unsigned long, ihipStream_t*, amrex::ReduceData<amrex::ValLocPair<int, burn_t> >&&)::{lambda()#1}::operator()() const (this=<optimized out>) at /ccs/home/etjohnson/dev/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H:779
#13 _ZN5amrex13launch_globalILi256EZNS_6launchILi256EZNS_9ReduceOpsIJNS_11ReduceOpMaxEEE4evalINS_10ReduceDataIJNS_10ValLocPairIi6burn_tEEEEEZNS4_4evalINS_8FabArrayINS_9FArrayBoxEEESA_Z9main_mainvEUliiiiE_EENSt9enable_ifIXaasr10IsFabArrayIT_EE5valuesr10IsCallableIT1_iiiiEE5valueEvE4typeERKSH_RKNS_7IntVectERT0_OSI_EUliiiE_EEvRKNS_3BoxERSH_OSQ_EUlvE_EEvimP12ihipStream_tSY_EUlvE_EEvSQ_.intern.3d5caca8830a6260 () at /ccs/home/etjohnson/dev/amrex/Src/Base/AMReX_GpuLaunchGlobal.H:
16

yut23 added a commit to yut23/Microphysics that referenced this issue Dec 14, 2023
These were causing memory errors in ROCm > 5.3.0 or so.
(see AMReX-Astro#1386)
@BenWibking
Copy link
Collaborator Author

Just want to confirm: is it the case that #1422 and additional PRs will be needed to fully fix this?

@zingale
Copy link
Member

zingale commented Dec 30, 2023

That's the thinking. We won't know until we do it though. Of course, ROCm could also just fix their issues...

@zingale
Copy link
Member

zingale commented Dec 30, 2023

I really want ROCm 6.0 to be available for us to test with.

@psharda
Copy link
Collaborator

psharda commented Dec 30, 2023

@BenWibking @zingale could I meanwhile try our Quokka simulation with #1422 as the Microphysics submodule (since we have ROCm 6.0 available)? I guess we would also need to make changes in Quokka and/or Microphysics CMakeLists?

@BenWibking
Copy link
Collaborator Author

I really want ROCm 6.0 to be available for us to test with.

We are still seeing the same memory error and crash that we were seeing before with ROCm 6.0, so something still appears to be wrong on their end.

@zingale
Copy link
Member

zingale commented Jan 2, 2024

the test_react problem with subch_simple works now with ROCm 5.7.1 with the latest version of Microphysics. So we need to find another test problem.

@BenWibking
Copy link
Collaborator Author

the test_react problem with subch_simple works now with ROCm 5.7.1 with the latest version of Microphysics. So we need to find another test problem.

unit_test/burn_cell still reports the same false positive ASAN error with ROCm 6.0. It runs fine without ASAN, though.

The debug build is still linking...

@zingale
Copy link
Member

zingale commented Feb 10, 2024

I don't think we have any more instances of pure Microphysics tests failing with ROCm > 5.3.0
For Castro, we worked around an issue and Castro now runs with ROCm 6.0: AMReX-Astro/Castro#2749

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants