Rationalize Frontier compiler entries between SCREAM and E3SM. #6773

rljacob · 2024-11-22T20:33:40Z

Opening this issue to discuss how to make sure all scream cases build with the "frontier" machine description and at least one compiler entry so we can remove "frontier-scream-gpu" from config_machines.xml

See https://acme-climate.atlassian.net/wiki/spaces/EIDMG/pages/3446079573/How+to+describe+heterogenous+node+machines+with+CIME for background.

In E3SM-Project/scream#2969 (comment) it was noted that screams "craygnuamdgpu” tells the user that it "Uses Cray wrappers with Gnu compilers for the host, and uses the AMD Hip compiler for the GPU". We should follow that convention.

rljacob · 2024-11-22T20:34:28Z

Tagging @bartgol @jgfouca @ambrad @dqwu @brhillman @AaronDonahue

bartgol · 2024-11-22T21:00:20Z

Cosmetic comment: I personally find the string craygnuamdgpu hard to parse. For "combo" compilers such as this, it may be be more readable to use hostcompiler-devicecompiler, that is add some dashes in there. In this case, craygnu-amdgpu, so that the mach+compiler combo would be frontier_craygnu-amdgpu.

grnydawn · 2024-11-23T00:18:53Z

To drive the merging of Frontier compiler entries, I think it would be useful to choose one or two Scream test cases within the E3SM machine/compiler configurations and resolve any issues encountered during their build and execution on Frontier. If necessary, we may create a new Scream test case.

rljacob · 2024-11-23T05:01:54Z

Use any of the scream test cases that already exist such as ne30_ne30.F2010-SCREAMv1 which is in e3sm_scream_v1_medres in https://github.com/E3SM-Project/E3SM/blob/49fdbe3661f2b8c95d8459081f500fae3a069ba0/cime_config/tests.py#L704C6-L704C28

Those all pass on frontier when using --machine frontier-scream-gpu --compiler craygnuamdgpu

You could start by just copying the craygnuamdgpu compiler entry to the frontier machine file while we figure out how to name these.

sarats · 2024-12-09T19:10:26Z

FYI, discussed this topic extensively during the Perf/Infra call today: https://acme-climate.atlassian.net/wiki/spaces/EP/pages/4825645058/2024-12-09+Performance+Infrastructure+Meeting+Notes

@jgfouca and @grnydawn to coordinate and work together to consolidate config.

rljacob · 2025-01-09T17:23:44Z

During the EAMxx dev call, we decided on using a "-" between the cpu and gpu instead of putting "gpu" in the compiler name. If there is no dash, that means its a cpu-only compile.
craycray-amd = cray wrapper around cray compiler for host, amd compiler on gpu
craygnu-amd = cray wrapper around gnu compiler, amd compiler on gpu
gnu-amd = use gfortran directly on host, amd compiler on gpu.

ndkeen · 2025-01-09T17:39:42Z

I've tried to bring up these topics before, but wanted to again state for the record:
For example on perlmutter, we have gnu and gnugpu and while I don't think gpu belong in the compiler name -- the compiler name is gnu, we do need a way to know if exe running on GPU nodes. So I think we need a different way (that is not currently supported in CIME) to specify the hardware to be used. New variable can then be used to test on in cmake.
As we don't have that, making the change you suggest there (ie not having gpu string in compiler name) could lead to confusing cmake tests where you basically just need to ask if exe is for GPU or not.

I also think on frontier, we can simply use gnu and gnugpu and then build with GNU fortran/C++ in same way we do on perlmutter. Of course, I defer to POC and if they wanted, could build differently -- but even still, I'm not sure it is a good idea to have complicated-sounding compiler names. We could still have gnu or amd and the details of how it's built are in the config files.

bartgol · 2025-01-09T18:54:13Z

@rljacob what if we used openacc? Should the compiler be called gnu-openacc? gnu-gnu? And what if we used openacc for f90, and nvcc for C++? Should we call it gnu-nvcc? gnu-openacc-nvcc?

rljacob · 2025-01-09T19:13:03Z

We have 3 things: wrapper, host compiler, device compiler. That would add a fourth thing: GPU programming model. Why would that be necessary?

bartgol · 2025-01-09T19:28:58Z

I'm just trying to understand what happens if we use nvcc for C++ and openacc for f90, both running on device. NVCC is a compiler, not a prog model. So would you do gnu-gnu-nvcc, since you use two different GPU compilers?

Re: openacc. It is a prog model, so what would you use, gnu-gnu for code that uses openacc (or openmp-target) for the accelerator?

grnydawn · 2025-01-21T15:04:21Z

I am working on renaming the craygnuamdgpu and crayclang-scream compilers in the frontier-scream-gpu machine based on the basic approach Rob explained and discussions in the issue threads. The following summarizes the issues I encountered during the renaming process:

1. Compiler Wrapper

Not all compiler configurations use a compiler wrapper. For example, craygnuamdgpu uses the hipcc compiler as MPICXX.

2. Programming Language

Fortran compiler configurations differ from C++ compiler configurations. For example, MPIFC uses ftn, whereas MPICXX uses hipcc or mpicxx.

3. GPU Programming Framework

There is no way to express the use of CUDA, HIP, OpenACC, or OpenMP offload.

4. Model-Specific Compiler

The craygnuamdgpu compiler sets MPICXX as hipcc, which causes compiler errors on Omega because hipcc is not an MPI-enabled compiler.

Tentatively, I renamed craygnuamdgpu to craygnu-hipcc and crayclang-scream to craycray-hipcc. However, these new names still retain the issues listed above.

Thoughts on Renaming

A compiler wrapper may not be necessary in the CIME compiler name. Compiler wrappers are usually not an issue during compilation, and in most cases, we know what compiler wrapper is used based on the system name.
We may want to distinguish the Fortran compiler and the C++ compiler because they might be different, such as crayftn for Fortran and hipcc for C++.
The C++ compiler name provides hints about the GPU programming framework in use, such as nvcc or hipcc.
We may still need to express additional information, such as model-specific details or the GPU programming framework.

Proposed Naming Convention

Considering these factors, I came up with the following naming convention:

<Fortran compiler name without compiler wrapper>-<C++(GPU) compiler name without compiler wrapper>-<Optional specific information such as model name or GPU programming framework>

For example:

Rename crayclang-scream to cray-hipcc.
Rename craygnuamdgpu to gnu-hipcc-eamxx.
Rename crayclang for cpu to cray

The eamxx addition in the second name reflects that the compiler configuration uses hipcc for MPICXX and is likely incompatible with other C++ models, such as Omega.

rljacob · 2025-01-23T17:04:41Z

This is good but I would like to do away with "model-specific" compiler configs. Why does EAMxx need that?

ndkeen · 2025-01-23T17:32:31Z

Again for the record, I don't think this is the best approach. We don't need those longer names.

grnydawn · 2025-01-23T18:27:19Z

This is good but I would like to do away with "model-specific" compiler configs. Why does EAMxx need that?

Most compiler names might not include the third, model-specific part. The model name in "gnu-hipcc-eamxx" is added because it may not be able to compile the Omega code due to the "eamxx-specific" compiler configuration explained above. If we still want to remove the third part, I believe the "eamxx-specific" compiler configuration might need to be removed.

grnydawn · 2025-01-23T18:33:35Z

Again for the record, I don't think this is the best approach. We don't need those longer names.

As noted above, most compiler names might not include the third, model-specific part. So, in most cases, the length of this naming scheme would be similar to other suggestions. I think we need to decide which information to retain in the compiler name and which to discard.

rljacob added the Frontier label Nov 22, 2024

rljacob assigned grnydawn Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rationalize Frontier compiler entries between SCREAM and E3SM. #6773

Rationalize Frontier compiler entries between SCREAM and E3SM. #6773

rljacob commented Nov 22, 2024

rljacob commented Nov 22, 2024

bartgol commented Nov 22, 2024

grnydawn commented Nov 23, 2024

rljacob commented Nov 23, 2024

sarats commented Dec 9, 2024

rljacob commented Jan 9, 2025

ndkeen commented Jan 9, 2025

bartgol commented Jan 9, 2025

rljacob commented Jan 9, 2025

bartgol commented Jan 9, 2025 •

edited

Loading

grnydawn commented Jan 21, 2025

rljacob commented Jan 23, 2025

ndkeen commented Jan 23, 2025

grnydawn commented Jan 23, 2025

grnydawn commented Jan 23, 2025

Rationalize Frontier compiler entries between SCREAM and E3SM. #6773

Rationalize Frontier compiler entries between SCREAM and E3SM. #6773

Comments

rljacob commented Nov 22, 2024

rljacob commented Nov 22, 2024

bartgol commented Nov 22, 2024

grnydawn commented Nov 23, 2024

rljacob commented Nov 23, 2024

sarats commented Dec 9, 2024

rljacob commented Jan 9, 2025

ndkeen commented Jan 9, 2025

bartgol commented Jan 9, 2025

rljacob commented Jan 9, 2025

bartgol commented Jan 9, 2025 • edited Loading

grnydawn commented Jan 21, 2025

1. Compiler Wrapper

2. Programming Language

3. GPU Programming Framework

4. Model-Specific Compiler

Thoughts on Renaming

Proposed Naming Convention

rljacob commented Jan 23, 2025

ndkeen commented Jan 23, 2025

grnydawn commented Jan 23, 2025

grnydawn commented Jan 23, 2025

bartgol commented Jan 9, 2025 •

edited

Loading