-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
what is wrong when it execute “sh train.sh” #13
Comments
Hi, do you use one GPU or multiple GPUs? |
thanks so much for your quick reply. I reinstall pytorch framework corresponds to the newer cuda 12.2 version. it is ok. I only use the single GPU device. Do you think there will be any problems with a single GPU? |
We run successfully using a single GPU. It needs to change the corresponding parameters in the script (we descript it in READEME.md). By the way, what's the version of pytorch in your environment? |
sorry, later for reply. my computer's disk crash. pytorch in my environment is 2.2.2 |
Hi, your version of pytorch seems to be high. We recommend using pytorch version 1.7.1. |
OK. I will try with prtorch 1.7.1. thanks so much for your comments. |
I think the above nonexisting volume file result in the following error, is it right? how to solve it? thanks. Merger: Similar Token Merger (STM) CPU: registered at /opt/conda/conda-bld/pytorch_1640811792945/work/build/aten/src/ATen/RegisterCPU.cpp:18433 [kernel] ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 946051) of binary: /home/ubuntu/anaconda3/envs/lrgt_env/bin/python
|
sh train.sh
ubun:2375340:2375340 [0] NCCL INFO Bootstrap : Using [0]enp4s0:10.214.24.190<0>
ubun:2375340:2375340 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
ubun:2375340:2375340 [0] NCCL INFO NET/IB : No device found.
ubun:2375340:2375340 [0] NCCL INFO NET/Socket : Using [0]enp4s0:10.214.24.190<0>
ubun:2375340:2375340 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.2
ubun:2375341:2375341 [0] NCCL INFO Bootstrap : Using [0]enp4s0:10.214.24.190<0>
ubun:2375341:2375341 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
ubun:2375341:2375341 [0] NCCL INFO NET/IB : No device found.
ubun:2375341:2375341 [0] NCCL INFO NET/Socket : Using [0]enp4s0:10.214.24.190<0>
ubun:2375341:2375341 [0] NCCL INFO Using network Socket
ubun:2375341:2375387 [0] init.cc:573 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1000
ubun:2375341:2375387 [0] NCCL INFO init.cc:840 -> 5
ubun:2375341:2375387 [0] NCCL INFO group.cc:73 -> 5 [Async thread]
ubun:2375340:2375386 [0] init.cc:573 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1000
ubun:2375340:2375386 [0] NCCL INFO init.cc:840 -> 5
ubun:2375340:2375386 [0] NCCL INFO group.cc:73 -> 5 [Async thread]
Traceback (most recent call last):
File "runner.py", line 70, in
Traceback (most recent call last):
File "runner.py", line 70, in
main()
File "runner.py", line 50, in main
main()
File "runner.py", line 50, in main
init_distributed_mode()
File "runner.py", line 31, in init_distributed_mode
init_distributed_mode()
File "runner.py", line 31, in init_distributed_mode
torch.distributed.init_process_group(backend='nccl')
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
torch.distributed.init_process_group(backend='nccl')
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
barrier()
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/launch.py", line 260, in
main()
File "/home/ubuntu/anaconda3/envs/lrgt/lib/python3.6/site-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ubuntu/anaconda3/envs/lrgt/bin/python', '-u', 'runner.py', '--local_rank=1']' returned non-zero exit status 1.
The text was updated successfully, but these errors were encountered: