[Bug]: #147

sceddd · 2024-06-25T18:32:24Z

Before Reporting

I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
I have read the README carefully and no error occured during the installation process. (Otherwise, we recommand that you can ask a question using the Question template) 我已经仔细阅读了README上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting

I have searched the DAMO-YOLO issues and found no similar bugs. 我已经在issue列表中搜索但是没有发现类似的bug报告。

OS

Ubuntu

Device

Colab T4

CUDA version

12.2

TensorRT version

No response

Python version

Python 3.10.12

PyTorch version

2.3.0+cu121

torchvision version

0.18.0+cu121

Describe the bug

I'm trying to finetune the model on custom dataset on colab

!cd damo-yolo/ && export PYTHONPATH=/content/damo-yolo && CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch --nproc_per_node=1 tools/train.py --node_rank=0 -f configs/damoyolo_tinynasL20_T.py

and end up with this error:

To Reproduce

follow the README to custom dataset
run this command:
export PYTHONPATH=/content/damo-yolo && CUDA_LAUNCH_BLOCKING=1 python -m torch.distributed.launch --nproc_per_node=1 tools/train.py --node_rank=0 -f configs/damoyolo_tinynasL20_T.py

Hyper-parameters/Configs

-f configs/damoyolo_tinynasL20_T.py

nproc_per_node=1

Logs

/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
usage: Damo-Yolo train parser [-h] [-f CONFIG_FILE] [--local_rank LOCAL_RANK]
[--tea_config TEA_CONFIG] [--tea_ckpt TEA_CKPT]
...
Damo-Yolo train parser: error: unrecognized arguments: --local-rank=0 --node_rank=0
E0625 18:21:09.782000 137977072861184 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 2) local_rank: 0 (pid: 9042) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 198, in
main()
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 194, in main
launch(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 179, in launch
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-06-25_18:21:09
host : 64f93cf95e56
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 9042)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Screenshots

datasets folder

Additional

No response

The text was updated successfully, but these errors were encountered:

sceddd · 2024-06-25T18:37:57Z

it's look like there are some problem with the args. I have to fix the tools/train.py file to run the model

change line number 31: 'local_rank' -> 'local-rank'

sceddd added the bug Something isn't working label Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: #147

[Bug]: #147

sceddd commented Jun 25, 2024

sceddd commented Jun 25, 2024

[Bug]: #147

[Bug]: #147

Comments

sceddd commented Jun 25, 2024

Before Reporting

Search before reporting

OS

Device

CUDA version

TensorRT version

Python version

PyTorch version

torchvision version

Describe the bug

To Reproduce

Hyper-parameters/Configs

Logs

tools/train.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-06-25_18:21:09 host : 64f93cf95e56 rank : 0 (local_rank: 0) exitcode : 2 (pid: 9042) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Screenshots

Additional

sceddd commented Jun 25, 2024

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-06-25_18:21:09
host : 64f93cf95e56
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 9042)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html