Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could you open-source the implementation code of train_one_epoch_ddp? Data parallel #15

Open
1759780295 opened this issue Dec 25, 2024 · 0 comments

Comments

@1759780295
Copy link

我自己实现数据并行会报错如下:
Traceback (most recent call last):
File "/home/user01/hzz/MLIC++/train.py", line 158, in
main()
File "/home/user01/hzz/MLIC++/train.py", line 119, in main
current_step = train_one_epoch(
File "/home/user01/hzz/MLIC++/utils/training.py", line 18, in train_one_epoch
out_net = model(d)
File "/home/user01/anaconda3/envs/tcm310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user01/anaconda3/envs/tcm310/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/user01/anaconda3/envs/tcm310/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/user01/anaconda3/envs/tcm310/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/user01/anaconda3/envs/tcm310/lib/python3.10/site-packages/torch/_utils.py", line 461, in reraise
raise exception
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/user01/anaconda3/envs/tcm310/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/user01/anaconda3/envs/tcm310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user01/hzz/MLIC++/models/mlicpp.py", line 93, in forward
self.update_resolutions(x.size(2) // 16, x.size(3) // 16)
File "/home/user01/hzz/MLIC++/models/mlicpp.py", line 191, in update_resolutions
self.local_context[i].update_resolution(H, W, next(self.parameters()).device, mask=None)
StopIteration
作者你能开源一下你的train_one_epoch_ddp实现吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant