Replies: 6 comments 6 replies
-
I would suggest to start looking at your MPI installation, and make sure that it supports GPU aware MPI. One often gets these segmentation faults if Neko is configured with device MPI, and the MPI installation doesn't support it. |
Beta Was this translation helpful? Give feedback.
-
Thank you very much Niclas Jansson. |
Beta Was this translation helpful? Give feedback.
-
Thank you. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% export PKG_CONFIG_PATH=/home/wzc/json-fortran-8.4.0/installation/lib/pkgconfig:$PKG_CONFIG_PATH cd gslib-master export GSLIB=/home/wzc/gslib-master/build tar xvf parmetis-4.0.3.tar cd neko-0.7.2 ./configure FC=gfortran CC=gcc %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH export CUDA_VISIBLE_DEVICES=0,1,2,3,5,6,7 mpirun --allow-run-as-root -np 4 ./neko pipe.case >& out & |
Beta Was this translation helpful? Give feedback.
-
Yes openmpi supports it, but it needs to be specifically configured for cuda when installed |
Beta Was this translation helpful? Give feedback.
-
Thank you Adperezm and Niclas. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% It was created by neko configure 0.7.2, which was $ ./configure FC=gfortran CC=gcc MPIFC=mpif90 MPICC=mpicc FCGLAGS=-02 -pedantic -std=f2008 --prefix=/home/wzc/neko-0.7.2/installation --with-gslib=/home/wzc/gslib-master/build --with-parmetis=/home/wzc/parmetis-4.0.3/installation --enable-contrib --with-cuda=/usr/local/cuda CUDA_FLAGS=-03 CUDA_ARCH=-arch=sm_80 NVCC=/usr/local/cuda/bin/nvcc ---------Platform.---------hostname = wzc-X11DPG-OT /usr/bin/uname -p = x86_64 /bin/arch = x86_64 PATH: /usr/local/sbin -----------Core tests.-----------configure:2646: checking for a BSD-compatible install configure:3567: $? = 0
Fatal Error: Coarrays disabled at (1), use '-fcoarray=' to enable configure:4664: $? = 0 configure:5394: $? = 0
Fatal Error: ac_nonexistent.h: No such file or directory
Error: Unclassifiable statement at (1)
Error: Unclassifiable statement at (1)
Error: Unclassifiable statement at (1)
Error: Unclassifiable statement at (1)
Error: Unclassifiable statement at (1)
Error: Unclassifiable statement at (1)
Error: Unclassifiable statement at (1)
Error: Unclassifiable statement at (1)
Error: Unclassifiable statement at (1)
Error: Unclassifiable statement at (1) ----------------------Running config.status.----------------------This file was extended by neko config.status 0.7.2, which was CONFIG_FILES = on wzc-X11DPG-OT config.status:927: creating Makefile ----------------Cache variables.----------------ac_cv_build=x86_64-pc-linux-gnu -----------------Output variables.-----------------ACLOCAL='${SHELL} /home/wzc/neko-0.7.2/missing aclocal-1.15' -----------confdefs.h.-----------/* confdefs.h */ configure: exit 0 |
Beta Was this translation helpful? Give feedback.
-
Hi All, I just figured out that if I compile neko in the docker container that was used for running nekRS can partially slove the segfault problem, ./makeneko pipe.f90 mpirun -np 4 ./neko pipe.case >& out & Any hints about multiple GPUs ? |
Beta Was this translation helpful? Give feedback.
-
Dear Neko developers and users,
Thank you for great job that makes NekO public available.
I built neko-0.7.2 on my local workstation with 8 RTX 4090 GPUs.
The building process was smooth, however when I try to run the simulation, it gives me segmentation errors posted below.
Any hints about the error ?
/ |/ / / / / /// / __
/ / / / / ,< / // /
//|/ // //|| ____/
(version: 0.7.2)
(build: 2024-05-07 on x86_64-pc-linux-gnu using gnu)
-------Job Information--------
Start time: 06:46 / 2024-05-07
Running on: 2 MPI ranks
CPU type : Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
Bcknd type: Accelerator (CUDA)
Dev. name : NVIDIA GeForce RTX 4090
Real type : double precision
-------------Case-------------
Reading case file pipe.case
-------------Mesh-------------
Reading a binary Neko file pipe.nmsh
gdim = 3, nelements = 36480
Reading elements
Reading BC/zone data
Reading deformation data
Mesh read, setting up connectivity
Done setting up mesh and connectivity
-----Material properties------
Read non-dimensional values:
Re : 2.650000E+03
Set dimensional values:
rho : 1.000000E+00
mu : 3.773585E-04
[wzc-X11DPG-OT:472013] Read -1, expected 159976, errno = 14
[wzc-X11DPG-OT:472012] Read -1, expected 159976, errno = 14
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x73ee96823960 in ???
#1 0x73ee96822ac5 in ???
#2 0x73ee9644251f in ???
at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#0 0x79fbf5c23960 in ???
#1 0x79fbf5c22ac5 in ???
#2 0x79fbf584251f in ???
at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
#3 0x73ee965aedcd in ???
at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:317
#4 0x73ee8b936243 in ???
#5 0x73ee8b85f555 in ???
#6 0x73ee8b85d810 in ???
#7 0x73ee8b93aae4 in ???
#8 0x73ee8b93af90 in ???
#3 0x79fbf59aedcd in ???
at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:317
#4 0x79fbeae24243 in ???
#5 0x79fbea847555 in ???
#6 0x79fbea845810 in ???
#7 0x79fbeae28ae4 in ???
#8 0x79fbeae28f90 in ???
#9 0x73ee96376713 in ???
#10 0x73ee9932f744 in ???
#9 0x79fbf5ab9713 in ???
#10 0x79fbf60c9744 in ???
#11 0x79fbf610a326 in ???
#11 0x73ee99370326 in ???
#12 0x6106d1e6ba06 in __gs_device_mpi_MOD_gs_device_mpi_nbwait
at gs/bcknd/device/gs_device_mpi.F90:424
#12 0x5ac32c9cea06 in __gs_device_mpi_MOD_gs_device_mpi_nbwait
at gs/bcknd/device/gs_device_mpi.F90:424
#13 0x6106d1e6dceb in __gather_scatter_MOD_gs_op_vector
at gs/gather_scatter.f90:1313
#14 0x6106d1e784fc in __gather_scatter_MOD_gs_init
at gs/gather_scatter.f90:209
#13 0x5ac32c9d0ceb in __gather_scatter_MOD_gs_op_vector
at gs/gather_scatter.f90:1313
#14 0x5ac32c9db4fc in __gather_scatter_MOD_gs_init
at gs/gather_scatter.f90:209
#15 0x6106d1f59920 in __fluid_scheme_MOD_fluid_scheme_init_common
at fluid/fluid_scheme.f90:305
#16 0x6106d1f5b406 in __fluid_scheme_MOD_fluid_scheme_init_all
at fluid/fluid_scheme.f90:528
#15 0x5ac32cabc920 in __fluid_scheme_MOD_fluid_scheme_init_common
at fluid/fluid_scheme.f90:305
#16 0x5ac32cabe406 in __fluid_scheme_MOD_fluid_scheme_init_all
at fluid/fluid_scheme.f90:528
#17 0x5ac32cad4564 in __fluid_pnpn_MOD_fluid_pnpn_init
at fluid/fluid_pnpn.f90:132
#17 0x6106d1f71564 in __fluid_pnpn_MOD_fluid_pnpn_init
at fluid/fluid_pnpn.f90:132
#18 0x6106d1f53d9b in case_init_common
at /home/wzc/neko-0.7.2/src/case.f90:206
#19 0x6106d1f567f6 in __case_MOD_case_init_from_file
#18 0x5ac32cab6d9b in case_init_common
at /home/wzc/neko-0.7.2/src/case.f90:206
#19 0x5ac32cab97f6 in __case_MOD_case_init_from_file
at /home/wzc/neko-0.7.2/src/case.f90:117
at /home/wzc/neko-0.7.2/src/case.f90:117
#20 0x5ac32cb212a6 in __neko_MOD_neko_init
at /home/wzc/neko-0.7.2/src/neko.f90:255
#20 0x6106d1fbe2a6 in __neko_MOD_neko_init
at /home/wzc/neko-0.7.2/src/neko.f90:255
#21 0x5ac32cb845e9 in turboneko
at /home/wzc/neko-0.7.2/src/driver.f90:6
#22 0x5ac32c999e2e in main
at /home/wzc/neko-0.7.2/src/driver.f90:3
#21 0x6106d20215e9 in turboneko
at /home/wzc/neko-0.7.2/src/driver.f90:6
#22 0x6106d1e36e2e in main
at /home/wzc/neko-0.7.2/src/driver.f90:3
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 0 with PID 0 on node wzc-X11DPG-OT exited on signal 11 (Segmentation fault).
Best Wishes,
Zhicheng Wang
Beta Was this translation helpful? Give feedback.
All reactions