You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am computing the features using multiple GPUs on the same node using DeepFeatureExtractor
My code for extracting features is pretty much the same as shown in the new notebook showing the feature extraction process: #887
What I Did
nn.DataParallel built-in within tiatoolbox handles the multi-gpu computations. I pulled the changes that introduced torch.compile and changed from ON_GPU to using device.
I updated the argument in the DeepFeatureExtractor's predict method to use device instead of on_gpu.
Errors traceback is very long to paste it all. But here are some of the errors (from the single run).
File "/tmp/torchinductor_qun786/vv/cvvkeueuq2m4jcjzub4hcfpkhpogtc5b2xddykdgxvsxcvnpfa2w.py", line 173, in call
buf2 = extern_kernels.convolution(buf0, buf1, stride=(14, 14), padding=(0, 0), dilation=(1, 1), transposed=False, output_padding=(0, 0), groups=1, bias=Non
e)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in
method wrapper_CUDA__cudnn_convolution)
...
raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
...
RuntimeError: Triton Error [CUDA]: invalid device context
What I can gather is that torch.compile is not working well with nn.DataParallel.
Please let me know if you can reproduce the error by simply running the DeepFeatureExtractor feature extraction code with rcParam["torch_compile_mode"] = "default" on a node with at least 2 devices.
Description
I am computing the features using multiple GPUs on the same node using
DeepFeatureExtractor
My code for extracting features is pretty much the same as shown in the new notebook showing the feature extraction process: #887
What I Did
nn.DataParallel
built-in withintiatoolbox
handles the multi-gpu computations. I pulled the changes that introducedtorch.compile
and changed fromON_GPU
to usingdevice
.I updated the argument in the DeepFeatureExtractor's
predict
method to usedevice
instead ofon_gpu
.Errors traceback is very long to paste it all. But here are some of the errors (from the single run).
What I can gather is that
torch.compile
is not working well withnn.DataParallel
.Please let me know if you can reproduce the error by simply running the
DeepFeatureExtractor
feature extraction code withrcParam["torch_compile_mode"] = "default"
on a node with at least 2 devices.Maybe
nn.DistributedDataParallel
is a better option to use: https://pytorch.org/docs/stable/notes/cuda.html#cuda-nn-ddp-insteadtiatoolbox/tiatoolbox/models/models_abc.py
Lines 42 to 61 in 5f1cecb
The text was updated successfully, but these errors were encountered: