Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test-boutpp-legacy-model fails #3041

Open
sagitter opened this issue Dec 8, 2024 · 4 comments
Open

test-boutpp-legacy-model fails #3041

sagitter opened this issue Dec 8, 2024 · 4 comments

Comments

@sagitter
Copy link

sagitter commented Dec 8, 2024

hi all.

This ticket is for reporting the bout++-5.1.1 test-boutpp-legacy-model failure in Fedora 42 with openmpi-5.0.6 sundials-7.1.1 petsc-3.22.2

 4/49 Test  #6: test-boutpp-legacy-model ............***Failed    1.85 sec
BOUT++ version 5.2.0
Revision: Unknown
Code compiled on Oct 25 2024 at 00:00:00

B.Dudson (University of York), M.Umansky (LLNL) 2007
Based on BOUT by Xueqiao Xu, 1999

Processor number: 0 of 1

pid: 19020

Compile-time options:
	Runtime error checking enabled, level 2
	Parallel NetCDF support disabled
	Metrics mode is 2D
	FFT support enabled
	Natural language support enabled
	LAPACK support enabled
	NetCDF support enabled (NetCDF4)
	PETSc support enabled
	Pretty function name support enabled
	PVODE support enabled
	Score-P support disabled
	SLEPc support disabled
	SUNDIALS support enabled
	Backtrace in exceptions enabled
	Colour in logs enabled
	OpenMP parallelisation disabled
	Extra debug output disabled
	Floating-point exceptions disabled
	Signal handling support enabled
	Field name tracking enabled
	Message stack enabled
	Compiled with flags : "-Wall -Wextra -Wnull-dereference -Wno-cast-function-type -DCHECK=2 -O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -march=x86-64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -mtls-dialect=gnu2 -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer-O2 -g -DNDEBUG"
	Command line options for this run : boutpp 
Reading options file data/BOUT.inp
	Option datadir = data (default)
	Option optionfile = BOUT.inp (default)
	Option settingsfile = BOUT.settings (default)
Writing options to file data/BOUT.settings
	Option mesh:type = bout (default)

Getting grid data from options
	Option mesh:calcParallelSlices_on_communicate = 1 (default)
	Option mesh:maxregionblocksize = 64 (default)
	Option mesh:staggergrids = true (data/BOUT.inp)
	Option mesh:include_corner_cells = 1 (default)
	Option mesh:ddz:fft_filter = 0 (default)
	Option mesh:symmetricGlobalX = 1 (default)
	Option optionfile = BOUT.inp (default)
WARNING: The default of this option has changed in release 4.1.
If you want the old setting, you have to specify mesh:symmetricGlobalY=false in BOUT.inp
	Option mesh:symmetricGlobalY = 1 (default)
Loading mesh
	Option input:transform_from_field_aligned = 1 (default)
	Option input:max_recursion_depth = 0 (default)
	Option mesh:n = 1 (data/BOUT.inp)
	Option mesh:MXG = 0 (data/BOUT.inp)
	Option mesh:nx = 1 (data/BOUT.inp)
	Option mesh:n = 1 (data/BOUT.inp)
	Option mesh:ny = 1 (data/BOUT.inp)
	Option mesh:n = 1 (data/BOUT.inp)
	Option mesh:nz = 1 (data/BOUT.inp)
	Read nz from input grid file
	Grid size: 1 x 1 x 1
	Option mesh:MXG = 0 (data/BOUT.inp)
	Option mesh:MYG = 0 (data/BOUT.inp)
	Guard cells (x,y,z): 0, 0, 0
Variable 'ixseps1' not in mesh options. Setting to 1
Variable 'ixseps2' not in mesh options. Setting to 1
Variable 'jyseps1_1' not in mesh options. Setting to -1
Variable 'jyseps1_2' not in mesh options. Setting to 0
Variable 'jyseps2_1' not in mesh options. Setting to 0
Variable 'jyseps2_2' not in mesh options. Setting to 0
Variable 'ny_inner' not in mesh options. Setting to 0
Finding value for NXPE (ideal = 1.000000)
	Candidate value: 1
	 -> Good value
	Domain split (NXPE=1, NYPE=1) into domains (localNx=1, localNy=1)
	Option IncIntShear = 0 (default)
	Option periodicX = 0 (default)
	Option async_send = 0 (default)
	Option ZMIN = 0 (default)
	Option ZMAX = 1 (default)
	EQUILIBRIUM IS SINGLE NULL (SND) 
Connection between top of Y processor 0 and bottom of 0 in range 0 <= x < 1
=> This processor sending in up
=> This processor sending in down
WARNING adding connection: poloidal index -1 out of range
	MYPE_IN_CORE = true
	DXS = 1, DIN = 0. DOUT = -1
	UXS = 1, UIN = 0. UOUT = -1
	XIN = -1, XOUT = -1
	Twist-shift: DI UI 
	Option twistshift = 0 (default)
Variable 'ShiftAngle' not in mesh options. Setting to empty vector
No boundary regions; domain is periodic
No boundary regions in this processor
Constructing default regions
	Boundary region inner X
	Boundary region outer X
	Option mesh:extrapolate_x = 0 (default)
	Option mesh:extrapolate_y = 0 (default)
Variable 'dx' not in mesh options. Setting to 1.000000e+00
	Option mesh:dy = 1/n/2/pi (data/BOUT.inp)
	Option mesh:n = 1 (data/BOUT.inp)
	Option ZMIN = 0 (default)
	Option ZMAX = 1 (default)
Variable 'dz' not in mesh options. Setting to 6.283185e+00
	Option mesh:paralleltransform:type = identity (default)
Variable 'g11' not in mesh options. Setting to 1.000000e+00
Variable 'g22' not in mesh options. Setting to 1.000000e+00
Variable 'g33' not in mesh options. Setting to 1.000000e+00
Variable 'g12' not in mesh options. Setting to 0.000000e+00
Variable 'g13' not in mesh options. Setting to 0.000000e+00
Variable 'g23' not in mesh options. Setting to 0.000000e+00
	Local maximum error in diagonal inversion is 0.000000e+00
	Local maximum error in off-diagonal inversion is 0.000000e+00
Variable 'J' not in mesh options. Setting to 0.000000e+00
	WARNING: Jacobian 'J' not found. Calculating from metric tensor
Variable 'Bxy' not in mesh options. Setting to 0.000000e+00
	WARNING: Magnitude of B field 'Bxy' not found. Calculating from metric tensor
Variable 'ShiftTorsion' not in mesh options. Setting to 0.000000e+00
	WARNING: No Torsion specified for zShift. Derivatives may not be correct
Calculating differential geometry terms
	Communicating connection terms
	Option non_uniform = 1 (default)
Variable 'd2x' not in mesh options. Setting to 0.000000e+00
	WARNING: differencing quantity 'd2x' not found. Calculating from dx
Variable 'd2y' not in mesh options. Setting to 0.000000e+00
	WARNING: differencing quantity 'd2y' not found. Calculating from dy
	done
	Option input:transform_from_field_aligned = true (default)
	Option input:max_recursion_depth = 0 (default)
	Option append = 0 (default)
	Option datadir = data (default)
	Option output:enabled = 1 (default)
	Option datadir = data (default)
	Option restart_files:enabled = 1 (default)
	Option solver:type = cvode (default)
	Option solver:monitor_timestep = 0 (default)
	Option solver:save_repeat_run_id = 0 (default)
	Option solver:is_nonsplit_model_diffusive = 1 (default)
	Option solver:mms = 0 (default)
	Option solver:mms_initialise = 0 (default)
	Option nout = 10 (data/BOUT.inp)
	Option solver:nout = 10 (default)
	Option timestep = 0.1 (data/BOUT.inp)
	Option solver:output_step = 0.1 (default)
	Option solver:diagnose = 0 (default)
	Option solver:adams_moulton = 0 (default)
	Option solver:func_iter = 0 (default)
	Option solver:cvode_max_order = -1 (default)
	Option solver:cvode_stability_limit_detection = 0 (default)
	Option solver:atol = 1e-18 (data/BOUT.inp)
	Option solver:rtol = 1e-14 (data/BOUT.inp)
	Option solver:use_vector_abstol = 0 (default)
	Option solver:mxstep = 500 (default)
	Option solver:max_timestep = 0 (default)
	Option solver:min_timestep = 0 (default)
	Option solver:start_timestep = 0 (default)
	Option solver:mxorder = -1 (default)
	Option solver:max_nonlinear_iterations = 3 (default)
	Option solver:apply_positivity_constraints = 0 (default)
	Option solver:maxl = 5 (default)
	Option solver:use_precon = 0 (default)
	Option solver:rightprec = 0 (default)
	Option solver:use_jacobian = 0 (default)
	Option solver:cvode_nonlinear_convergence_coef = 0.1 (default)
	Option solver:cvode_linear_convergence_coef = 0.05 (default)
Setting up output (experimental output) file
	Option restart = 0 (default)
Variable 'grid_id' not in mesh options. Setting to 
Variable 'hypnotoad_version' not in mesh options. Setting to 
Variable 'hypnotoad_git_hash' not in mesh options. Setting to 
Variable 'hypnotoad_git_diff' not in mesh options. Setting to 
Variable 'hypnotoad_geqdsk_filename' not in mesh options. Setting to 
Variable 'grid_id' not in mesh options. Setting to 
Variable 'hypnotoad_version' not in mesh options. Setting to 
Variable 'hypnotoad_git_hash' not in mesh options. Setting to 
Variable 'hypnotoad_git_diff' not in mesh options. Setting to 
Variable 'hypnotoad_geqdsk_filename' not in mesh options. Setting to 
	Option wall_limit = -1 (default)
	Option stopCheck = 0 (default)
	Option stopCheckName = BOUT.stop (default)
	Option datadir = data (default)
Setting boundary for variable n
	Option input:transform_from_field_aligned = true (default)
	Option input:max_recursion_depth = 0 (default)
	Option all:function = 0.0 (default)
	Option all:scale = 1 (default)
	Option all:evolve_bndry = 0 (default)
	Option n:evolve_bndry = 0 (default)
Solver running for 10 outputs with output timestep of 1.000000e-01
Initialising solver
Initialising SUNDIALS' CVODE solver
	3d fields = 1, 2d fields = 0 neq=1, local_N=1
	Using BDF method
	Using Newton iteration
	No preconditioning
	Using difference quotient approximation for Jacobian
Running simulation

Run ID: 485f3aef-59ff-4137-83c7-d47312cb242b

Run started at  : Sun Dec  8 20:26:53 2024
	Option restart = false (default)
	Option append = false (default)
	Option dump_on_restart = 1 (default)
	Option input:validate = 0 (default)
	Option input:error_on_unused_options = 1 (default)
	Option optionfile = BOUT.inp (default)
	Option datadir = data (default)
Sim Time  |  RHS evals  | Wall Time |  Calc    Inv   Comm    I/O   SOLVER

0.000e+00          1       2.40e-02     0.7    0.0    0.3  147.5  -48.4
1.000e-01         16       2.84e-02     1.8    0.0    0.0   67.4   30.8
2.000e-01          1       2.07e-02     0.3    0.0    0.0   63.0   36.7
3.000e-01          1       2.06e-02     0.3    0.0    0.0   63.3   36.5
4.000e-01          1       2.08e-02     0.3    0.0    0.0   63.8   35.9
5.000e-01          1       2.05e-02     0.3    0.0    0.0   63.2   36.5
6.000e-01          2       2.07e-02     0.4    0.0    0.0   63.6   36.0
7.000e-01          1       2.09e-02     0.3    0.0    0.0   63.5   36.2
8.000e-01          1       2.09e-02     0.3    0.0    0.0   63.5   36.3
9.000e-01          1       2.16e-02     0.3    0.0    0.0   65.4   34.4
1.000e+00          1       2.13e-02     0.5    0.0    0.0   63.2   36.3
-  Step 10 of 10. Elapsed 0:00:00.2 ETA 0:00:-0.0
Run finished at  : Sun Dec  8 20:26:53 2024
Run time : 0 s
	Option datadir = data (default)
	Option settingsfile = BOUT.settings (default)
Writing options to file data/BOUT.settings
	Option time_report:show = 0 (default)
*** The MPI_Comm_rank() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[b296200f4c484f7ab5e8cf496fd97116:19020] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

Full build log: https://download.copr.fedorainfracloud.org/results/sagitter/ForTesting/fedora-rawhide-x86_64/08364151-bout++/builder-live.log.gz

@dschwoerer
Copy link
Contributor

Thanks for the heads up. I will try to debug this ...

@dschwoerer
Copy link
Contributor

Great, this seems to be a race condition / heisenbug ...

While I was not able to reproduce this with gdb, I think it could be caused by this executing after MPI_Finalize:

Thread 1 "python3" hit Breakpoint 2.1, 0x0000fffff469d058 in PMPI_Comm_rank () from /usr/lib64/openmpi/lib/libmpi.so.40
#0  0x0000fffff469d058 in PMPI_Comm_rank () from /usr/lib64/openmpi/lib/libmpi.so.40
#1  0x0000fffff495afe4 [PAC] in SUNLogger_Destroy () from /usr/lib64/openmpi/lib/libsundials_core.so.7
#2  0x0000fffff4956a00 [PAC] in SUNContext_Free () from /usr/lib64/openmpi/lib/libsundials_core.so.7
#3  0x0000fffff71cf4d4 [PAC] in sundials::Context::~Context (this=<optimized out>, this=<optimized out>) at /usr/include/c++/14/bits/unique_ptr.h:464
#4  CvodeSolver::~CvodeSolver (this=<optimized out>, this=<optimized out>) at /builddir/build/BUILD/bout++-5.2.0.dev574+ge3f21cd2f-build/BOUT++-v5.2.0.dev574+ge3f21cd2f/src/solver/impls/cvode/cvode.cxx:165
#5  0x0000fffff71cf538 [PAC] in CvodeSolver::~CvodeSolver (this=<optimized out>, this=<optimized out>) at /builddir/build/BUILD/bout++-5.2.0.dev574+ge3f21cd2f-build/BOUT++-v5.2.0.dev574+ge3f21cd2f/src/solver/impls/cvode/cvode.cxx:165
#6  0x0000fffff73f33bc [PAC] in std::default_delete<Solver>::operator() (this=<optimized out>, __ptr=<optimized out>) at /usr/include/c++/14/bits/unique_ptr.h:93
#7  std::unique_ptr<Solver, std::default_delete<Solver> >::~unique_ptr (this=<optimized out>, this=<optimized out>) at /usr/include/c++/14/bits/unique_ptr.h:399
#8  PythonModel::~PythonModel (this=<optimized out>, this=<optimized out>) at /builddir/build/BUILD/bout++-5.2.0.dev574+ge3f21cd2f-build/BOUT++-v5.2.0.dev574+ge3f21cd2f/build_openmpi/tools/pylib/_boutpp_build/helper.h:123
#9  PythonModel::~PythonModel (this=<optimized out>, this=<optimized out>) at /builddir/build/BUILD/bout++-5.2.0.dev574+ge3f21cd2f-build/BOUT++-v5.2.0.dev574+ge3f21cd2f/build_openmpi/tools/pylib/_boutpp_build/helper.h:123
#10 0x0000fffff73ee3d4 [PAC] in __pyx_pf_9libboutpp_16PhysicsModelBase_12_boutpp_dealloc (__pyx_v_self=0xfffff76cb570) at /builddir/build/BUILD/bout++-5.2.0.dev574+ge3f21cd2f-build/BOUT++-v5.2.0.dev574+ge3f21cd2f/build_openmpi/tools/pylib/_boutpp_build/libboutpp.cpp:53135
#11 __pyx_pw_9libboutpp_16PhysicsModelBase_13_boutpp_dealloc (__pyx_v_self=0xfffff76cb570, __pyx_args=<optimized out>, __pyx_nargs=<optimized out>, __pyx_kwds=<optimized out>) at /builddir/build/BUILD/bout++-5.2.0.dev574+ge3f21cd2f-build/BOUT++-v5.2.0.dev574+ge3f21cd2f/build_openmpi/tools/pylib/_boutpp_build/libboutpp.cpp:53096
#12 0x0000fffff73efd28 [PAC] in __pyx_pf_9libboutpp_12PhysicsModel_10_boutpp_dealloc (__pyx_self=<optimized out>, __pyx_v_self=<optimized out>) at /builddir/build/BUILD/bout++-5.2.0.dev574+ge3f21cd2f-build/BOUT++-v5.2.0.dev574+ge3f21cd2f/build_openmpi/tools/pylib/_boutpp_build/libboutpp.cpp:55123
#13 __pyx_pw_9libboutpp_12PhysicsModel_11_boutpp_dealloc (__pyx_self=<optimized out>, __pyx_args=<optimized out>, __pyx_nargs=<optimized out>, __pyx_kwds=<optimized out>) at /builddir/build/BUILD/bout++-5.2.0.dev574+ge3f21cd2f-build/BOUT++-v5.2.0.dev574+ge3f21cd2f/build_openmpi/tools/pylib/_boutpp_build/libboutpp.cpp:55059
#14 0x0000fffff73f3e5c [PAC] in __pyx_pf_9libboutpp_2finalise (__pyx_self=<optimized out>) at /builddir/build/BUILD/bout++-5.2.0.dev574+ge3f21cd2f-build/BOUT++-v5.2.0.dev574+ge3f21cd2f/build_openmpi/tools/pylib/_boutpp_build/libboutpp.cpp:56134
#15 __pyx_pw_9libboutpp_3finalise (__pyx_self=<optimized out>, unused=<optimized out>) at /builddir/build/BUILD/bout++-5.2.0.dev574+ge3f21cd2f-build/BOUT++-v5.2.0.dev574+ge3f21cd2f/build_openmpi/tools/pylib/_boutpp_build/libboutpp.cpp:55887
#16 0x0000fffff7c36470 [PAC] in atexit_callfuncs () from /lib64/libpython3.13.so.1.0
#17 0x0000fffff7c031f8 [PAC] in _Py_Finalize.constprop.0 () from /lib64/libpython3.13.so.1.0
#18 0x0000fffff7c2e938 [PAC] in Py_RunMain () from /lib64/libpython3.13.so.1.0
#19 0x0000fffff7bc0160 [PAC] in Py_BytesMain () from /lib64/libpython3.13.so.1.0
#20 0x0000fffff78362dc [PAC] in __libc_start_call_main () from /lib64/libc.so.6
#21 0x0000fffff78363bc [PAC] in __libc_start_main_impl () from /lib64/libc.so.6
#22 0x0000aaaaaaab00f0 [PAC] in _start ()

@sagitter do you think it would be reasonable to not call MPI_Comm_rank in a destructor, and if this is needed cache it from an earlier call?

Otherwise I am not sure how to avoid this race condition.
I already have a loop that deallocates before calling finalise, but for some reason this seems to be done none sequentially:
https://github.com/boutproject/BOUT-dev/blob/master/tools/pylib/_boutpp_build/boutpp.pyx.jinja#L1205

@sagitter
Copy link
Author

sagitter commented Dec 9, 2024

@sagitter do you think it would be reasonable to not call MPI_Comm_rank in a destructor, and if this is needed cache it from an earlier call?

I don't know how to help you; i'm sorry.

@dschwoerer
Copy link
Contributor

I looked at the code, the MPI_Comm_rank could be avoided, but there is another MPI call, to free a communicator. I doubt that can reasonable moved out, so this needs to be fixed in BOUT++ :-(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants