You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/user/Music/workspace/CogVideo-main/finetune/train.py", line 19, in
[rank0]: main()
[rank0]: File "/home/user/Music/workspace/CogVideo-main/finetune/train.py", line 15, in main
[rank0]: trainer.fit()
[rank0]: File "/home/user/Music/workspace/CogVideo-main/finetune/trainer.py", line 613, in fit
[rank0]: self.prepare_for_training()
[rank0]: File "/home/user/Music/workspace/CogVideo-main/finetune/trainer.py", line 323, in prepare_for_training
[rank0]: self.components.transformer, self.optimizer, self.data_loader, self.lr_scheduler = self.accelerator.prepare(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/accelerate/accelerator.py", line 1333, in prepare
[rank0]: result = self._prepare_deepspeed(*args)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/accelerate/accelerator.py", line 1773, in _prepare_deepspeed
[rank0]: deepspeed_plugin.set_moe_leaf_modules(model)
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/accelerate/utils/dataclasses.py", line 1365, in set_moe_leaf_modules
[rank0]: raise Exception(
[rank0]: Exception: Could not find a transformer layer class called '' to wrap in the model.
[rank0]:[W114 11:56:53.702274058 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
E0114 11:56:55.162000 346242 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 346290) of binary: /home/user/anaconda3/bin/python
Traceback (most recent call last):
File "/home/user/anaconda3/bin/accelerate", line 8, in
sys.exit(main())
^^^^^^
File "/home/user/anaconda3/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/user/anaconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1153, in launch_command
deepspeed_launcher(args)
File "/home/user/anaconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 846, in deepspeed_launcher
distrib_run.run(args)
File "/home/user/anaconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/user/anaconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures:
[1]:
time : 2025-01-14_11:56:55
host : user
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 346291)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2025-01-14_11:56:55
host : user
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 346290)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
this is the complete error
The text was updated successfully, but these errors were encountered:
System Info / 系統信息
Ubuntu
Information / 问题信息
Reproduction / 复现过程
Run the fine-tuning code on 2b parameters model
Expected behavior / 期待表现
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/user/Music/workspace/CogVideo-main/finetune/train.py", line 19, in
[rank0]: main()
[rank0]: File "/home/user/Music/workspace/CogVideo-main/finetune/train.py", line 15, in main
[rank0]: trainer.fit()
[rank0]: File "/home/user/Music/workspace/CogVideo-main/finetune/trainer.py", line 613, in fit
[rank0]: self.prepare_for_training()
[rank0]: File "/home/user/Music/workspace/CogVideo-main/finetune/trainer.py", line 323, in prepare_for_training
[rank0]: self.components.transformer, self.optimizer, self.data_loader, self.lr_scheduler = self.accelerator.prepare(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/accelerate/accelerator.py", line 1333, in prepare
[rank0]: result = self._prepare_deepspeed(*args)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/accelerate/accelerator.py", line 1773, in _prepare_deepspeed
[rank0]: deepspeed_plugin.set_moe_leaf_modules(model)
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/accelerate/utils/dataclasses.py", line 1365, in set_moe_leaf_modules
[rank0]: raise Exception(
[rank0]: Exception: Could not find a transformer layer class called '' to wrap in the model.
[rank0]:[W114 11:56:53.702274058 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
E0114 11:56:55.162000 346242 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 346290) of binary: /home/user/anaconda3/bin/python
Traceback (most recent call last):
File "/home/user/anaconda3/bin/accelerate", line 8, in
sys.exit(main())
^^^^^^
File "/home/user/anaconda3/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/user/anaconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1153, in launch_command
deepspeed_launcher(args)
File "/home/user/anaconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 846, in deepspeed_launcher
distrib_run.run(args)
File "/home/user/anaconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/user/anaconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/anaconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures:
[1]:
time : 2025-01-14_11:56:55
host : user
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 346291)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2025-01-14_11:56:55
host : user
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 346290)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
this is the complete error
The text was updated successfully, but these errors were encountered: