Update train_ssd.py to support multiple GPUs #4

NISHANTSHRIVASTAV · 2021-08-27T14:55:01Z

According to your suggestion here Support multiple GPU and the issue referenced here
@Mystique-orca and I have enabled multiple GPUs to support training SSD-based Object Detection Model using PyTorch Framework.

We have tested the modified train_ssd.py on our environment for Object Detection using 3 Nvidia Tesla T4 GPUs. We can pass the number of GPUs we need using the argument --gpu-devices .

For e.g:

python train_ssd.py --dataset-type=voc --data=<path-to-dataset-dir> --model-dir=<path-to-model-dir> --batch-size=12 --epochs=400 --workers=0 --use-cuda=True --gpu-devices 0 1 2

Please let us know if we could provide more information.

Hope this will help the community!

Thanks

@Mystique-orca

…tiple GPUs to work on default single GPU i.e Jetson Co-authored-by: @Mystique-orca <sumeshrmeppadath@gmail.com>

@Mystique-orca

@Mystique-orca and I have enabled multiple GPUs to support for training SSD based Object Detection Model in PyTorch

@Mystique-orca

…tiple GPUs to work on default single GPU i.e Jetson Co-authored-by: @Mystique-orca <sumeshrmeppadath@gmail.com>

dusty-nv · 2021-08-27T15:17:14Z

Thanks @NISHANTSHRIVASTAV - can you make this work on a single GPU (i.e. Jetson) just the same that it did previously? If it required no changes in CLI arguments/ect on the single-GPU use-case I would merge it.

NISHANTSHRIVASTAV · 2021-08-27T15:40:41Z

Thanks @NISHANTSHRIVASTAV - can you make this work on a single GPU (i.e. Jetson) just the same that it did previously? If it required no changes in CLI arguments/ect on the single-GPU use-case I would merge it.

@dusty-nv Yes, it will work on a single GPU using the same CLI argument i.e --gpu-devices where we just need to pass the index of the GPU

For e.g:

For single GPU

python train_ssd.py --dataset-type=voc --data=<path-to-dataset-dir> --model-dir=<path-to-model-dir> --batch-size=4 --epochs=400 --workers=0 --use-cuda=True --gpu-devices 0

For 2 GPUs

python train_ssd.py --dataset-type=voc --data=<path-to-dataset-dir> --model-dir=<path-to-model-dir> --batch-size=4 --epochs=400 --workers=0 --use-cuda=True --gpu-devices 0 1

For n GPUs

python train_ssd.py --dataset-type=voc --data=<path-to-dataset-dir> --model-dir=<path-to-model-dir> --batch-size=4 --epochs=400 --workers=0 --use-cuda=True --gpu-devices 0 1 .. n

dusty-nv · 2021-08-27T16:03:34Z

The default should be --gpu-devices 0. I also meant that I would prefer it not to use net.DataParallel() if only 1 GPU is being used, as I don't want there to be any unintended side-effects when running on Jetson systems (especially the memory-limited Nano 2GB device)

NISHANTSHRIVASTAV · 2021-08-28T12:57:14Z

The default should be --gpu-devices 0. I also meant that I would prefer it not to use net.DataParallel() if only 1 GPU is being used, as I don't want there to be any unintended side-effects when running on Jetson systems (especially the memory-limited Nano 2GB device)

Hi @dusty-nv,

We have modified the SSD-based Object Detection Training implementation using Multiple GPUs to work on the default single GPU i.e Jetson according to your suggestions in the latest commit. For training with multiple GPUs, it will use the net.DataParallel model and for training, with a single GPU specifically on Jetson, it will use the default net model without any change in the CLI arguments.

Please let us know if we could provide more information.

Thanks

Mystique-orca · 2021-09-16T06:51:55Z

Hi @dusty-nv
As @NISHANTSHRIVASTAV mentioned, the code will work as it did before, when CLI argument for gpu-devices is not provided or default command is used. The net.DataParallel model will be used, only when there are more than one gpu-devices provided.

Can you let us know if this request can be merged?
If there are some suggestions or changes required, we are open to incorporate those too.

Many thanks!

Gcardoso233 · 2021-09-23T10:22:32Z

Hello, i've been trying to apply these changes into my 1_train_ssd as i also want to apply a MultiGPU training, but have been facing the recurrent error:
'RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm)'

Did you had a similar issue or know where i'm making a mistake?

This is my first Computer Vision project and i would really appreciate your input! Thanks

nishant_shrivastav23 and others added 3 commits August 27, 2021 17:48

Modified SSD based Object Detection Training implementation using Mul…

6caa5de

…tiple GPUs to work on default single GPU i.e Jetson Co-authored-by: @Mystique-orca <sumeshrmeppadath@gmail.com>

Update train_ssd.py to support multiple GPUs

1fadb25

@Mystique-orca and I have enabled multiple GPUs to support for training SSD based Object Detection Model in PyTorch

Modified SSD based Object Detection Training implementation using Mul…

398c25a

…tiple GPUs to work on default single GPU i.e Jetson Co-authored-by: @Mystique-orca <sumeshrmeppadath@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update train_ssd.py to support multiple GPUs #4

Update train_ssd.py to support multiple GPUs #4

NISHANTSHRIVASTAV commented Aug 27, 2021 •

edited

Loading

dusty-nv commented Aug 27, 2021

NISHANTSHRIVASTAV commented Aug 27, 2021

dusty-nv commented Aug 27, 2021

NISHANTSHRIVASTAV commented Aug 28, 2021 •

edited

Loading

Mystique-orca commented Sep 16, 2021

Gcardoso233 commented Sep 23, 2021 •

edited

Loading

Update train_ssd.py to support multiple GPUs #4

Are you sure you want to change the base?

Update train_ssd.py to support multiple GPUs #4

Conversation

NISHANTSHRIVASTAV commented Aug 27, 2021 • edited Loading

dusty-nv commented Aug 27, 2021

NISHANTSHRIVASTAV commented Aug 27, 2021

dusty-nv commented Aug 27, 2021

NISHANTSHRIVASTAV commented Aug 28, 2021 • edited Loading

Mystique-orca commented Sep 16, 2021

Gcardoso233 commented Sep 23, 2021 • edited Loading

NISHANTSHRIVASTAV commented Aug 27, 2021 •

edited

Loading

NISHANTSHRIVASTAV commented Aug 28, 2021 •

edited

Loading

Gcardoso233 commented Sep 23, 2021 •

edited

Loading