Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pack_size !=- 1 "Memory access fault" on Frontier #115

Open
pgrete opened this issue Sep 9, 2024 · 9 comments
Open

pack_size !=- 1 "Memory access fault" on Frontier #115

pgrete opened this issue Sep 9, 2024 · 9 comments

Comments

@pgrete
Copy link
Contributor

pgrete commented Sep 9, 2024

While running some tests on Frontier I noticed the following issue:

$ srun -N 1 -n 8 -c 1 --gpus-per-node=8 --gpu-bind=closest /ccs/proj/ast146/pgrete/src/athenapk/build-bump-parth/bin/athenaPK -i ./linear_wave3d.in parthenon/meshblock/nx1=256 parthenon/meshblock/nx2=256 parthenon/meshblock/nx3=256 parthenon/mesh/nx1=1024 parthenon/mesh/nx2=1024 parthenon/mesh/nx3=1024 parthenon/time/nlim=20 parthenon/time/integrator=rk2 parthenon/mesh/pack_size=4
Memory access fault by GPU node-8 (Agent handle: 0x61253f0) on address 0x7ff7f2522000. Reason: Unknown.
srun: error: frontier08577: task 0: Aborted
srun: Terminating StepId=2345368.15
slurmstepd: error: *** STEP 2345368.15 ON frontier08577 CANCELLED AT 2024-09-06T06:38:27 ***
^[[A^Csrun: interrupt (one more within 1 sec to abort)
srun: StepId=2345368.15 tasks 1-7: running
srun: StepId=2345368.15 task 0: exited abnormally

Should be confirmed if this is Frontier specific or more general AthenaPK or Parthenon.

@pgrete
Copy link
Contributor Author

pgrete commented Sep 9, 2024

Does work as expected on GH200, so it seems that the "Memory access fault" is one of the standard Frontier/Lumi/MI250X/Cray errors.

@BenWibking
Copy link
Contributor

It's probably an LLVM AMDGPU compiler bug. It's been known for years, but AMD has not been able to fix it: https://discourse.llvm.org/t/how-to-verify-correct-regalloc-for-a-kernel/80811

The cause is when register pressure is high, and there is conditional execution (virtually all of our kernels), it can produce incorrect machine code for restoring registers that have been spilled to memory (due to running out of hardware registers) that trashes the registers that hold memory addresses. Then, boom, memory error and crash.

For us, we've only seen it so far with reaction networks (that use ~1000s of registers), but it's as the AMD engineer says in the thread, it's not predictable when it happens, it cannot be verified that any given kernel is compiled correctly, and it's even difficult to see the bug when manually inspecting the generated machine code.

@BenWibking
Copy link
Contributor

Here's another example of this kind of compiler bug: llvm/llvm-project#96353

@pgrete
Copy link
Contributor Author

pgrete commented Sep 9, 2024

yikes... I guess we'll wait and see then.

@BenWibking
Copy link
Contributor

BenWibking commented Oct 1, 2024

The PR that was expected to fix (all?) of these kinds of bugs was just merged into LLVM: llvm/llvm-project#93526.

It may be possible to build a working compiler using Spack with spack install llvm@main target=zen3,amdgpu on Frontier.

@BenWibking
Copy link
Contributor

This should be fixed with rocm/6.3.1. I'll try running your example and test it.

@BenWibking
Copy link
Contributor

BenWibking commented Jan 21, 2025

Hmm, that doesn't work.

With rocm/6.3.1 and -DCMAKE_CXX_COMPILER=amdclang++:

wibking@login04:/ccs/proj/ast146/bwibking/athenapk_rocm_6.3.1> srun -q debug -t 00:05:00 -A ast146 -N 1 -n 8 -c 1 --gpus-per-node=8 --gpu-bind=closest /ccs/proj/ast146/bwibking/athenapk_rocm_6.3.1/athenaPK -i ./linear_wave3d.in parthenon/meshblock/nx1=256 parthenon/meshblock/nx2=256 parthenon/meshblock/nx3=256 parthenon/mesh/nx1=1024 parthenon/mesh/nx2=1024 parthenon/mesh/nx3=1024 parthenon/time/nlim=20 parthenon/time/integrator=rk2 parthenon/mesh/pack_size=4
cycle=0 time=0.0000000000000000e+00 dt=4.3945268554996863e-04 zone-cycles/wsec_step=0.00e+00 wsec_total=3.38e-02 wsec_step=6.81e+00
Memory access fault by GPU node-11 (Agent handle: 0x59799f0) on address 0x7f2dee42e000. Reason: Unknown.
Memory access fault by GPU node-7 (Agent handle: 0x59799f0) on address 0x7f9f2f000000. Reason: Unknown.
Memory access fault by GPU node-8 (Agent handle: 0x5979880) on address 0x7f91127c1000. Reason: Unknown.
Failed to read GPU memory: Input/output error
GPU core dump failed
Failed to allocate file: No space left on device
GPU core dump failed
srun: error: frontier00578: task 0: Aborted
srun: Terminating StepId=2956710.0

With rocm/6.3.1 and -DCMAKE_CXX_COMPILER=hipcc:

wibking@login04:/ccs/proj/ast146/bwibking/athenapk_rocm_6.3.1_hipcc> srun -q debug -t 00:05:00 -A ast146 -N 1 -n 8 -c 1 --gpus-per-node=8 --gpu-bind=closest /ccs/proj/ast146/bwibking/athenapk_rocm_6.3.1_hipcc/athenaPK -i ./linear_wave3d.in parthenon/meshblock/nx1=256 parthenon/meshblock/nx2=256 parthenon/meshblock/nx3=256 parthenon/mesh/nx1=1024 parthenon/mesh/nx2=1024 parthenon/mesh/nx3=1024 parthenon/time/nlim=20 parthenon/time/integrator=rk2 parthenon/mesh/pack_size=4
cycle=0 time=0.0000000000000000e+00 dt=4.3945268554996863e-04 zone-cycles/wsec_step=0.00e+00 wsec_total=3.45e-02 wsec_step=6.62e+00
Memory access fault by GPU node-11 (Agent handle: 0x5979a00) on address 0x7f56082b0000. Reason: Unknown.
Failed to read GPU memory: Input/output error
GPU core dump failed
srun: error: frontier00578: task 5: Aborted (core dumped)
srun: Terminating StepId=2956719.0

@BenWibking
Copy link
Contributor

BenWibking commented Jan 21, 2025

With rocm/6.3.1, -DCMAKE_CXX_COMPILER=hipcc, and pack_size=1:

wibking@login04:/ccs/proj/ast146/bwibking/athenapk_rocm_6.3.1_hipcc> srun -q debug -t 00:05:00 -A ast146 -N 1 -n 8 -c 1 --gpus-per-node=8 --gpu-bind=closest /ccs/proj/ast146/bwibking/athenapk_rocm_6.3.1_hipcc/athenaPK -i ./linear_wave3d.in parthenon/meshblock/nx1=256 parthenon/meshblock/nx2=256 parthenon/meshblock/nx3=256 parthenon/mesh/nx1=1024 parthenon/mesh/nx2=1024 parthenon/mesh/nx3=1024 parthenon/time/nlim=20 parthenon/time/integrator=rk2 parthenon/mesh/pack_size=1
Memory access fault by GPU node-9 (Agent handle: 0x5979a00) on address 0x7f051ee1b000. Reason: Unknown.
Memory access fault by GPU node-4 (Agent handle: 0x5979a00) on address 0x7f9691404000. Reason: Unknown.
Memory access fault by GPU node-11 (Agent handle: 0x5979a00) on address 0x7f731759a000. Reason: Unknown.
Failed to allocate file: No space left on device
GPU core dump failed
Failed to allocate file: No space left on device
GPU core dump failed
srun: error: frontier00578: task 6: Aborted
srun: Terminating StepId=2956722.0

I've verified it works correctly with pack_size=-1 and this binary.

@pgrete
Copy link
Contributor Author

pgrete commented Jan 21, 2025

I'm not sure I want this to be compiler bug (which means investing a lot of time only discovering there's little we can do) or an AthenaPK/Parthenon bug :D

btw Failed to allocate file: No space left on device is our project space on /ccs/proj above quota?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants