Possible regression in jaxlib 0.4.25+ causing training deadlocks on GPU #25453

connorsteph · 2024-12-12T23:11:14Z

connorsteph
Dec 12, 2024

We have a model training script that began to experience deadlocks during GPU computation since upgrading from jax 0.4.13 --> 0.4.25+. In particular, these issues emerge with jaxlib 0.4.25, disappear with 0.4.26, and are then present from jaxlib 0.4.27 onwards. We'd appreciate any insights into how we can further understand what's going on. We're attempting to create an MRE in the meantime, but our training code is quite complicated and we're working on bisecting the issue. These issues are present with both single and multi-GPU training runs, and during testing the single-GPU case, removing all sharding-related code does not resolve the issue.

Regression description:

At some point during training we call into a jitted single_step function (computing loss and gradients) and this function never exits (nor does it crash), as evidenced by a py-spy trace. This happens non-deterministically minutes to hours into training runs. We're using weight and biases for logging, and from system resource logs we can see that at the time of the deadlock our GPU power usage decreases to a nontrivial amount and stays at that level with extremely low variation (image below), and looking at the python process we can see that it's waiting for control to return. To reiterate, this appears to be a regression. Our training seems to run just fine on jaxlib 0.4.24 and below.

Here's what the GPU power usage looks like, with the hang occuring at around ~26k on the x-axis:

For reference, our H100s idle at ~100W, so something is happening.

Attempt at diagnostics:

When I exec into the training pod after the hang has occurred, I see that the training python process is alive (PID 1)

> root@cray-zwhpd:/home/app# ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  392  2.0 358606700 22123180 ?  Ssl  Nov21 4528:54 /tmp/tmprxjrqp4z/app.runfiles/python3_10_x86_64-unknown-linux-gnu/bin/python
root         504  0.5  0.0 1247260 30324 ?       Ssl  Nov21   5:58 /tmp/tmprxjrqp4z/app.runfiles/atomic_python_deps_wandb/site-packages/wandb/bi
root         515  132  0.0 7263072 138608 ?      Sl   Nov21 1530:43 /app.runfiles/atomic_python_deps_wandb/site-packages/wandb/bin/nvidia_gpu_st
root       19092  0.0  0.0   4324  3896 pts/0    Ss   10:27   0:00 bash
root       19290  0.0  0.0   6352  3744 pts/0    R+   10:30   0:00 ps -aux

but it's waiting on a FUTEX

> root@cray-zwhpd:/home/app# strace -p 1
strace: Process 1 attached
futex(0x7f6b4e85d140, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY^Cstrace: Process 1 detached
 <detached ...>

I obtained a backtrace from gdb

(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007f6b49c19b88 in absl::lts_20230802::synchronization_internal::FutexWaiter::WaitUntil(std::atomic<int>*, int, absl::lts_20230802::synchronization_internal::KernelTimeout) ()
   from /app.runfiles/atomic_python_deps_jax_cuda12_pjrt/site-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#2  0x00007f6b49c19c27 in absl::lts_20230802::synchronization_internal::FutexWaiter::Wait(absl::lts_20230802::synchronization_internal::KernelTimeout) () from /app.runfiles/atomic_python_deps_jax_cuda12_pjrt/site-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#3  0x00007f6b49c19adb in AbslInternalPerThreadSemWait_lts_20230802 ()
   from /app.runfiles/atomic_python_deps_jax_cuda12_pjrt/site-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#4  0x00007f6b49c178fd in absl::lts_20230802::Mutex::Block(absl::lts_20230802::base_internal::PerThreadSynch*) ()
   from /app.runfiles/atomic_python_deps_jax_cuda12_pjrt/site-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#5  0x00007f6b49c17d82 in absl::lts_20230802::Mutex::AwaitCommon(absl::lts_20230802::Condition const&, absl::lts_20230802::synchronization_internal::KernelTimeout) () from /app.runfiles/atomic_python_deps_jax_cuda12_pjrt/site-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#6  0x00007f6b49c17c81 in absl::lts_20230802::Mutex::Await(absl::lts_20230802::Condition const&) ()
   from /app.runfiles/atomic_python_deps_jax_cuda12_pjrt/site-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#7  0x00007f6b42060a98 in xla::PjRtStreamExecutorLoadedExecutable::Execute(absl::lts_20230802::Span<std::vector<xla::PjRtBuffer*, std::allocator<xla::PjRtBuffer*> > const>, xla::ExecuteOptions const&, std::optional<std::vector<xla::PjRtFuture<void>, std::allocator<xla::PjRtFuture<void> > > >&) () from /app.runfiles/atomic_python_deps_jax_cuda12_pjrt/site-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#8  0x00007f6b41fb301f in pjrt::PJRT_LoadedExecutable_Execute(PJRT_LoadedExecutable_Execute_Args*) ()
   from /app.runfiles/atomic_python_deps_jax_cuda12_pjrt/site-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#9  0x00007f6b56a7bd2c in xla::PjRtCApiLoadedExecutable::Execute(absl::lts_20230802::Span<std::vector<xla::PjRtBuffer*, std::allocator<xla::PjRtBuffer*> > const>, xla::ExecuteOptions const&, std::optional<std::vector<xla::PjRtFuture<void>, std::allocator<xla::PjRtFuture<void> > > >&) ()
   from /tmp/tmprxjrqp4z/app.runfiles/atomic_python_deps_jaxlib/site-packages/jaxlib/xla_extension.so
#10 0x00007f6b5c74a0b3 in xla::ifrt::PjRtLoadedExecutable::Execute(absl::lts_20230802::Span<tsl::RCReference<xla::ifrt::Array> >, xla::ifrt::ExecuteOptions const&, std::optional<tsl::RCReference<xla::ifrt::DeviceList> >) ()
   from /tmp/tmprxjrqp4z/app.runfiles/atomic_python_deps_jaxlib/site-packages/jaxlib/xla_extension.so
#11 0x00007f6b5699dc30 in jax::(anonymous namespace)::PjitFunction::Call(nanobind::handle, _object* const*, unsigned long, _object*) ()
   from /tmp/tmprxjrqp4z/app.runfiles/atomic_python_deps_jaxlib/site-packages/jaxlib/xla_extension.so
#12 0x00007f6b5699aa30 in PjitFunction_tp_vectorcall ()
   from /tmp/tmprxjrqp4z/app.runfiles/atomic_python_deps_jaxlib/site-packages/jaxlib/xla_extension.so
#13 0x00007f6c01be0a10 in _PyObject_VectorcallTstate (tstate=0x5625aa78a5d0, callable=0x7f6b4e85d140, args=0x5625dd917950,
    nargsf=<optimized out>, kwnames=0x0) at ./Include/cpython/abstract.h:114
#14 PyObject_Vectorcall (callable=0x7f6b4e85d140, args=0x5625dd917950, nargsf=<optimized out>, kwnames=0x0) at ./Include/cpython/abstract.h:123
#15 call_function (tstate=0x5625aa78a5d0, trace_info=0x7ffe051b1280, oparg=<optimized out>, kwnames=0x0, pp_stack=<optimized out>)
    at Python/ceval.c:5891
#16 _PyEval_EvalFrameDefault (tstate=0x5625aa78a5d0, f=0x5625dd917750, throwflag=<optimized out>) at Python/ceval.c:4181
#17 0x00007f6c01b8296f in _PyEval_EvalFrame (tstate=0x5625aa78a5d0, f=0x5625dd917750, throwflag=0) at ./Include/internal/pycore_ceval.h:46
...

I can provide the rest of the bt if anyone thinks it would be helpful.

Obligatory environment dump:

$ jax.print_environment_info()
jax:    0.4.35
jaxlib: 0.4.34
numpy:  1.26.4
python: 3.10.9 (main, Jan 16 2023, 22:37:31) [Clang 15.0.7 ]
device info: NVIDIA H100 80GB HBM3-1, 1 local devices"
process_count: 1
platform: uname_result(system='Linux', node='cray-fjhl4', release='5.15.0-126-generic', version='#136-Ubuntu SMP Wed Nov 6 10:38:22 UTC 2024', machine='x86_64')
$ nvidia-smi
Fri Nov 22 13:38:57 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          Off |   00000000:9B:00.0 Off |                    0 |
| N/A   43C    P0             85W /  700W |     536MiB /  81559MiB |      2%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
None

Any insights into what's happening or suggestions for debugging this issue would be massively appreciated!

hawkinsp · 2024-12-20T16:01:36Z

hawkinsp
Dec 20, 2024
Maintainer

Unfortunately it's impossible to say what's going wrong with only this information. It looks like a deadlock, but we don't know how or why without knowing more.

I can think of two things that might help:
a) collect all of the thread backtraces, and do so on all of the processes if this is a multiprocess job. bt only obtains a backtrace from one thread, you want thread apply all bt to collect all thread stacks. The output will be long, and you'll want to save the output to a file and attach it here. (I forget exactly how to do that: you turn off pagination in gdb and tell gdb to log to a file, if I remember correctly.)

b) a reproducer that we could run would also help.

3 replies

mattjj Dec 20, 2024
Maintainer

@connorsteph is there any sort of host callback going on?

connorsteph Dec 20, 2024
Author

Thank you both for your replies! I know that we're not giving you much to go off of here; we're working hard to hone in on a shareable example but as you can imagine the feedback-loop here is very slow, and we haven't been able to get the problem to emerge yet with 'dummy' inputs that are trivial arrays of the correct shape.

@mattjj no, there are no host callbacks anywhere in the training code.

@hawkinsp thank you for the suggestion -- this is a single-process job. I'll prepare another deadlock and collect a full backtrace on all threads.

connorsteph Dec 21, 2024
Author

@hawkinsp here's the full thread apply all bt dump: thread_apply_all_bt_dump.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible regression in jaxlib 0.4.25+ causing training deadlocks on GPU #25453

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Possible regression in jaxlib 0.4.25+ causing training deadlocks on GPU #25453

connorsteph Dec 12, 2024

Regression description:

Attempt at diagnostics:

Replies: 1 comment · 3 replies

hawkinsp Dec 20, 2024 Maintainer

mattjj Dec 20, 2024 Maintainer

connorsteph Dec 20, 2024 Author

connorsteph Dec 21, 2024 Author

connorsteph
Dec 12, 2024

Replies: 1 comment 3 replies

hawkinsp
Dec 20, 2024
Maintainer

mattjj Dec 20, 2024
Maintainer

connorsteph Dec 20, 2024
Author

connorsteph Dec 21, 2024
Author