Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update train_ssd.py to support multiple GPUs #4

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

NISHANTSHRIVASTAV
Copy link

@NISHANTSHRIVASTAV NISHANTSHRIVASTAV commented Aug 27, 2021

Hello @dusty-nv,

According to your suggestion here Support multiple GPU and the issue referenced here
@Mystique-orca and I have enabled multiple GPUs to support training SSD-based Object Detection Model using PyTorch Framework.

We have tested the modified train_ssd.py on our environment for Object Detection using 3 Nvidia Tesla T4 GPUs. We can pass the number of GPUs we need using the argument --gpu-devices .

For e.g:

python train_ssd.py --dataset-type=voc --data=<path-to-dataset-dir> --model-dir=<path-to-model-dir> --batch-size=12 --epochs=400 --workers=0 --use-cuda=True --gpu-devices 0 1 2

Please let us know if we could provide more information.

Hope this will help the community!

Thanks

nishant_shrivastav23 and others added 3 commits August 27, 2021 17:48
…tiple GPUs to work on default single GPU i.e Jetson

Co-authored-by: @Mystique-orca <sumeshrmeppadath@gmail.com>
@Mystique-orca and I have enabled multiple GPUs to support for training SSD based Object Detection Model in PyTorch
…tiple GPUs to work on default single GPU i.e Jetson

Co-authored-by: @Mystique-orca <sumeshrmeppadath@gmail.com>
@dusty-nv
Copy link
Owner

Thanks @NISHANTSHRIVASTAV - can you make this work on a single GPU (i.e. Jetson) just the same that it did previously? If it required no changes in CLI arguments/ect on the single-GPU use-case I would merge it.

@NISHANTSHRIVASTAV
Copy link
Author

Thanks @NISHANTSHRIVASTAV - can you make this work on a single GPU (i.e. Jetson) just the same that it did previously? If it required no changes in CLI arguments/ect on the single-GPU use-case I would merge it.

@dusty-nv Yes, it will work on a single GPU using the same CLI argument i.e --gpu-devices where we just need to pass the index of the GPU

For e.g:

For single GPU

python train_ssd.py --dataset-type=voc --data=<path-to-dataset-dir> --model-dir=<path-to-model-dir> --batch-size=4 --epochs=400 --workers=0 --use-cuda=True --gpu-devices 0

For 2 GPUs

python train_ssd.py --dataset-type=voc --data=<path-to-dataset-dir> --model-dir=<path-to-model-dir> --batch-size=4 --epochs=400 --workers=0 --use-cuda=True --gpu-devices 0 1

For n GPUs

python train_ssd.py --dataset-type=voc --data=<path-to-dataset-dir> --model-dir=<path-to-model-dir> --batch-size=4 --epochs=400 --workers=0 --use-cuda=True --gpu-devices 0 1 .. n

@dusty-nv
Copy link
Owner

The default should be --gpu-devices 0. I also meant that I would prefer it not to use net.DataParallel() if only 1 GPU is being used, as I don't want there to be any unintended side-effects when running on Jetson systems (especially the memory-limited Nano 2GB device)

@NISHANTSHRIVASTAV
Copy link
Author

NISHANTSHRIVASTAV commented Aug 28, 2021

The default should be --gpu-devices 0. I also meant that I would prefer it not to use net.DataParallel() if only 1 GPU is being used, as I don't want there to be any unintended side-effects when running on Jetson systems (especially the memory-limited Nano 2GB device)

Hi @dusty-nv,

We have modified the SSD-based Object Detection Training implementation using Multiple GPUs to work on the default single GPU i.e Jetson according to your suggestions in the latest commit. For training with multiple GPUs, it will use the net.DataParallel model and for training, with a single GPU specifically on Jetson, it will use the default net model without any change in the CLI arguments.

Please let us know if we could provide more information.

Thanks

@Mystique-orca
Copy link

Hi @dusty-nv
As @NISHANTSHRIVASTAV mentioned, the code will work as it did before, when CLI argument for gpu-devices is not provided or default command is used. The net.DataParallel model will be used, only when there are more than one gpu-devices provided.

Can you let us know if this request can be merged?
If there are some suggestions or changes required, we are open to incorporate those too.

Many thanks!

@Gcardoso233
Copy link

Gcardoso233 commented Sep 23, 2021

Hello, i've been trying to apply these changes into my 1_train_ssd as i also want to apply a MultiGPU training, but have been facing the recurrent error:
'RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm)'

Did you had a similar issue or know where i'm making a mistake?

This is my first Computer Vision project and i would really appreciate your input! Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants