You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
I have read the README carefully and no error occured during the installation process. (Otherwise, we recommand that you can ask a question using the Question template) 我已经仔细阅读了README上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting
I have searched the DAMO-YOLO issues and found no similar bugs. 我已经在issue列表中搜索但是没有发现类似的bug报告。
OS
Ubuntu
Device
Colab T4
CUDA version
12.2
TensorRT version
No response
Python version
Python 3.10.12
PyTorch version
2.3.0+cu121
torchvision version
0.18.0+cu121
Describe the bug
I'm trying to finetune the model on custom dataset on colab
run this command:
export PYTHONPATH=/content/damo-yolo && CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch --nproc_per_node=1 tools/train.py --node_rank=0 -f configs/damoyolo_tinynasL20_T.py
Hyper-parameters/Configs
-f configs/damoyolo_tinynasL20_T.py
nproc_per_node=1
Logs
/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
usage: Damo-Yolo train parser [-h] [-f CONFIG_FILE] [--local_rank LOCAL_RANK]
[--tea_config TEA_CONFIG] [--tea_ckpt TEA_CKPT]
...
Damo-Yolo train parser: error: unrecognized arguments: --local-rank=0 --node_rank=0
E0625 18:21:09.782000 137977072861184 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 2) local_rank: 0 (pid: 9042) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 198, in
main()
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 194, in main
launch(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 179, in launch
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
it's look like there are some problem with the args. I have to fix the tools/train.py file to run the model
change line number 31: 'local_rank' -> 'local-rank'
Before Reporting
I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
I have read the README carefully and no error occured during the installation process. (Otherwise, we recommand that you can ask a question using the Question template) 我已经仔细阅读了README上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting
OS
Ubuntu
Device
Colab T4
CUDA version
12.2
TensorRT version
No response
Python version
Python 3.10.12
PyTorch version
2.3.0+cu121
torchvision version
0.18.0+cu121
Describe the bug
I'm trying to finetune the model on custom dataset on colab
!cd damo-yolo/ && export PYTHONPATH=/content/damo-yolo && CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch --nproc_per_node=1 tools/train.py --node_rank=0 -f configs/damoyolo_tinynasL20_T.py
and end up with this error:
To Reproduce
export PYTHONPATH=/content/damo-yolo && CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch --nproc_per_node=1 tools/train.py --node_rank=0 -f configs/damoyolo_tinynasL20_T.py
Hyper-parameters/Configs
-f configs/damoyolo_tinynasL20_T.py
nproc_per_node=1
Logs
/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects
--local-rank
argument to be set, pleasechange it to read from
os.environ['LOCAL_RANK']
instead. Seehttps://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
usage: Damo-Yolo train parser [-h] [-f CONFIG_FILE] [--local_rank LOCAL_RANK]
[--tea_config TEA_CONFIG] [--tea_ckpt TEA_CKPT]
...
Damo-Yolo train parser: error: unrecognized arguments: --local-rank=0 --node_rank=0
E0625 18:21:09.782000 137977072861184 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 2) local_rank: 0 (pid: 9042) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 198, in
main()
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 194, in main
launch(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 179, in launch
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
tools/train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-06-25_18:21:09
host : 64f93cf95e56
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 9042)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Screenshots
datasets folder
Additional
No response
The text was updated successfully, but these errors were encountered: