-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepSpeed with ZeRO3 strategy cannot build 'fused_adam' #6892
Comments
@LeonardoZini, can you please share the log showing the
With zero stage 3 model sharding, special handling is required to access parameters. See following links for more details.
|
The log are this one
and this
Thank you for the references! |
@LeonardoZini, I noticed a compiler error in your build log Can you try the following to see if it would reproduce the compile-time error? >>> import torch, deepspeed
>>> from deepspeed.ops.adam.fused_adam import FusedAdam
>>> x = FusedAdam([torch.empty(100)])
`` |
@LeonardoZini, were you able to confirm the compile-time error? |
Sorry for being late. I attach you the logs. STDERR
STDOUT
GCC version
|
I know that the error comes from the unsupported compiler version.
I don't have access to an older compiler on the cluster I am working on, so this is now a relatively low priority for me. I don't know if this is expected behavior, but I suggest extending compatibility with newer compilers. |
@LeonardoZini, thanks for the update. I am glad that you are unblocked. I don't think the problem is a lack of compatibility for newer compilers since it works in my environment with gcc-13. (torch_2_cuda) tjruwase@IronLambda:~$ gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
(torch_2_cuda) tjruwase@IronLambda:~$ python --version
Python 3.12.3
(torch_2_cuda) tjruwase@IronLambda:~$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0
(torch_2_cuda) tjruwase@IronLambda:~$ python -c "import torch, deepspeed; from deepspeed.ops.adam.fused_adam import FusedAdam; x = FusedAdam([torch.empty(100)])" I suspect that the problem in your environment might stem from your cuda 11.8. Is it okay to close this ticket? |
Nvidia's installation guide says minor version of GCC11 are supported by cuda 11.8. GCC12.x are likely unsupported. |
Closing as a version incompatibility problem is identified. Please reopen if needed. |
Describe the bug
I am using Deepspeed with the huggingface trainer to fine-tune an llm. While with ZeRO2 strategy I don't have any problem I need to shard also the parameters since i'm working on long-context sequences.
When using ZeRO3 the trainer at the beginning of the training, raise me an excpetion
RuntimeError: Error building extension 'fused_adam'
I installed DeepSpeed with the command
TORCH_CUDA_ARCH_LIST="8.0;8.6;9.0" DS_BUILD_OPS=1 pip install deepspeed --global-option="build_ext"
(i tried also the 0.15.4 version)I tried also
TORCH_CUDA_ARCH_LIST="8.0;8.6;9.0" DS_BUILD_FUSED_ADAM=1 pip install deepspeed --global-option="build_ext"
.What changes is that if i specify in the deepspeed json config file the
"offload_optimizer"
and"offload_param"
it doesn't throws me any error, but i lose any reference to the model parameters (the weights are void tensors).I am using a SLURM scheduler, and one thing i noticed is that the ds_report output are different. Outisde the SLURM
fused_adam
seems installed, while inside SLURM no.pip env
ds_report output
output
ds_report output in SLURM
output
output log
Ass enlighten in this logs, the number of parameter goes from 4568002560 before the training loop, to 266240 after the training loop (the voice Parameter Offload makes me thinking..).
System info :
Launcher context
I am launching with torchrun,
srun torchrun --nnodes=1 --nproc-per-node=2 --rdzv-endpoint=$MASTER_ADDR:$MASTER_PORT --rdzv-id=$SLURM_JOB_NAME --rdzv-backend="c10d" --max_restarts=$MAX_RESTARTS trainer.py
The text was updated successfully, but these errors were encountered: