Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About training time #3

Open
ZJWang9928 opened this issue Mar 14, 2024 · 16 comments
Open

About training time #3

ZJWang9928 opened this issue Mar 14, 2024 · 16 comments

Comments

@ZJWang9928
Copy link

Hello, I am running the training code of lanegraph2seq on nuScenes. Each batch takes about 3.6 seconds, and hence the total training process will take about 20 days. Is this speed normal?
image

BTW, would it be possible for you to release the pre-trained checkpoint ckpts/lssego_segmentation_48x32_b4x8_resnet_adam_24e_ponsplit_19.pth?

@VictorLlu
Copy link
Collaborator

Apologies for the delay due to the ECCV submission deadline. In compliance with our company's confidentiality policies, the original code cannot be published. This version of the code has been independently reproduced. While the original code is known for its speed, I'm currently unable to determine the cause of performance issues in this version. I am actively investigating the matter.

@VictorLlu
Copy link
Collaborator

The checkpoint is the pretraining checkpoint, Please refine to issue
#2 (comment)

@ZJWang9928
Copy link
Author

Apologies for the delay due to the ECCV submission deadline. In compliance with our company's confidentiality policies, the original code cannot be published. This version of the code has been independently reproduced. While the original code is known for its speed, I'm currently unable to determine the cause of performance issues in this version. I am actively investigating the matter.

https://github.com/fudan-zvg/RoadNet/blob/9a83cf6aa09896e6df6c36c3a534e9b9ab075a7b/RoadNetwork/rntr/init.py#L24C1-L24C52

Hi! @VictorLlu Thank your for your update. But the following file seems still to be missing... Can you please update it?
from .data import nuscenes_converter_pon_centerline
#5

@ZJWang9928
Copy link
Author

Apologies for the delay due to the ECCV submission deadline. In compliance with our company's confidentiality policies, the original code cannot be published. This version of the code has been independently reproduced. While the original code is known for its speed, I'm currently unable to determine the cause of performance issues in this version. I am actively investigating the matter.

@VictorLlu Comparing training with 1 GPU and 8 GPUs, I found that the batch time almost equals to NUM_GPUs*batch_time_per_GPU + $\Delta$. Is this phenomenon abnormal?
NUM_GPUs = 1:
image

NUM_GPUs = 2:
image

NUM_GPUs = 4:
image

NUM_GPUs = 8:
image

@ZJWang9928
Copy link
Author

@VictorLlu
FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations.
af2983a31d61f9c1b09cbd5a2c8a30a8
BTW, could you please update the nuscenes_converter_pon_centerline file recently? It would be of great help! Thank you.

@VictorLlu
Copy link
Collaborator

I've made a minor modification to the image loading process:

img_bytes = [
    get(name, backend_args=self.backend_args) for name in filename
]
img = [
    mmcv.imfrombytes(img_byte, flag=self.color_type)
    for img_byte in img_bytes
]
img = np.stack(img, axis=-1)

This approach replaces the use of mmcv.imread. It has provided some improvement, yet the loading time remains significantly long.

@VictorLlu
Copy link
Collaborator

VictorLlu commented Apr 3, 2024

I find it highly related to the num_workers

@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations. af2983a31d61f9c1b09cbd5a2c8a30a8 BTW, could you please update the nuscenes_converter_pon_centerline file recently? It would be of great help! Thank you.

I've noticed that the delay between iterations directly corresponds to the num_workers setting in multi-GPU training scenarios. Despite eliminating every time-consuming element in the dataloader, it still experiences delays at intervals consistent with the num_workers count. This suggests that the issue might stem from mmdetection3d rather than the dataloader itself.

@VictorLlu
Copy link
Collaborator

VictorLlu commented Apr 3, 2024

@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations. af2983a31d61f9c1b09cbd5a2c8a30a8 BTW, could you please update the nuscenes_converter_pon_centerline file recently? It would be of great help! Thank you.

Here's a polished version:

The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version.
I will push a mmdetection 0.17.1 version in the next few days.

@EchoQiHeng
Copy link

When I use a single 2080GPU, it takes 59 days to complete the training......

@ZJWang9928
Copy link
Author

@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations. af2983a31d61f9c1b09cbd5a2c8a30a8 BTW, could you please update the nuscenes_converter_pon_centerline file recently? It would be of great help! Thank you.

Here's a polished version:

The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.

Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.

@raimberri
Copy link

@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations. af2983a31d61f9c1b09cbd5a2c8a30a8 BTW, could you please update the nuscenes_converter_pon_centerline file recently? It would be of great help! Thank you.

Here's a polished version:
The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.

Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.

Hi, I have found a solution in MMdetection issues open-mmlab/mmdetection#11503, which is updating your pytorch version >= 2.2. I have tested it and it successfully reduced the training time from 25 days to 4 days
20240617-143821
20240617-143826
Hopefully this works for you.
BTW, I used latest pytorch-2.3.1 by conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia with other environment unchanged.

@y352184741
Copy link

@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations. af2983a31d61f9c1b09cbd5a2c8a30a8 BTW, could you please update the nuscenes_converter_pon_centerline file recently? It would be of great help! Thank you.

Here's a polished version:
The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.

Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.

Hi, I have found a solution in MMdetection issues open-mmlab/mmdetection#11503, which is updating your pytorch version >= 2.2. I have tested it and it successfully reduced the training time from 25 days to 4 days 20240617-143821 20240617-143826 Hopefully this works for you. BTW, I used latest pytorch-2.3.1 by conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia with other environment unchanged.

Hi, i have updated pytorch to the latest version and reduced the training time successfully. But, the grads become NaN after some certain iterations and the losses become 0. Did you have this problem during training?
image

image

@raimberri
Copy link

@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations. af2983a31d61f9c1b09cbd5a2c8a30a8 BTW, could you please update the nuscenes_converter_pon_centerline file recently? It would be of great help! Thank you.

Here's a polished version:
The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.

Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.

Hi, I have found a solution in MMdetection issues open-mmlab/mmdetection#11503, which is updating your pytorch version >= 2.2. I have tested it and it successfully reduced the training time from 25 days to 4 days 20240617-143821 20240617-143826 Hopefully this works for you. BTW, I used latest pytorch-2.3.1 by conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia with other environment unchanged.

Hi, i have updated pytorch to the latest version and reduced the training time successfully. But, the grads become NaN after some certain iterations and the losses become 0. Did you have this problem during training? image

image

I met the same problem, what I have tried was changing batch_size from 2 to 4 and lr_rate from 2e-4 to 1e-4, then the problem was gone and the model is normally training.
original hyperparameter batch_size=2, lr_rate=2e-4 log
Screenshot from 2024-06-19 14-41-50
modified hyperparameter batch_size=4, lr_rate=1e-4 log
Screenshot from 2024-06-19 14-42-42
However, I didn't dive into it in detail since the model isn't finishing training yet, so I can only offer a shallow assumption that this problem was caused by some abnormal data input/gt, and expand the batch_size may mitigate the impact.

@y352184741
Copy link

@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations. af2983a31d61f9c1b09cbd5a2c8a30a8 BTW, could you please update the nuscenes_converter_pon_centerline file recently? It would be of great help! Thank you.

Here's a polished version:
The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.

Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.

Hi, I have found a solution in MMdetection issues open-mmlab/mmdetection#11503, which is updating your pytorch version >= 2.2. I have tested it and it successfully reduced the training time from 25 days to 4 days 20240617-143821 20240617-143826 Hopefully this works for you. BTW, I used latest pytorch-2.3.1 by conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia with other environment unchanged.

Hi, i have updated pytorch to the latest version and reduced the training time successfully. But, the grads become NaN after some certain iterations and the losses become 0. Did you have this problem during training? image
image

I met the same problem, what I have tried was changing batch_size from 2 to 4 and lr_rate from 2e-4 to 1e-4, then the problem was gone and the model is normally training. original hyperparameter batch_size=2, lr_rate=2e-4 log Screenshot from 2024-06-19 14-41-50 modified hyperparameter batch_size=4, lr_rate=1e-4 log Screenshot from 2024-06-19 14-42-42 However, I didn't dive into it in detail since the model isn't finishing training yet, so I can only offer a shallow assumption that this problem was caused by some abnormal data input/gt, and expand the batch_size may mitigate the impact.

Hi~ Have you finished the training and successfully reproduced the results from the paper?

@wangpinzhi
Copy link

@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations. af2983a31d61f9c1b09cbd5a2c8a30a8 BTW, could you please update the nuscenes_converter_pon_centerline file recently? It would be of great help! Thank you.

Here's a polished version:
The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.

Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.

Hi, I have found a solution in MMdetection issues open-mmlab/mmdetection#11503, which is updating your pytorch version >= 2.2. I have tested it and it successfully reduced the training time from 25 days to 4 days 20240617-143821 20240617-143826 Hopefully this works for you. BTW, I used latest pytorch-2.3.1 by conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia with other environment unchanged.

Hi, i have updated pytorch to the latest version and reduced the training time successfully. But, the grads become NaN after some certain iterations and the losses become 0. Did you have this problem during training? image
image

I met the same problem, what I have tried was changing batch_size from 2 to 4 and lr_rate from 2e-4 to 1e-4, then the problem was gone and the model is normally training. original hyperparameter batch_size=2, lr_rate=2e-4 log Screenshot from 2024-06-19 14-41-50 modified hyperparameter batch_size=4, lr_rate=1e-4 log Screenshot from 2024-06-19 14-42-42 However, I didn't dive into it in detail since the model isn't finishing training yet, so I can only offer a shallow assumption that this problem was caused by some abnormal data input/gt, and expand the batch_size may mitigate the impact.

hello,did you reproduce the results in paper?

@raimberri
Copy link

raimberri commented Jul 30, 2024

@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations. af2983a31d61f9c1b09cbd5a2c8a30a8 BTW, could you please update the nuscenes_converter_pon_centerline file recently? It would be of great help! Thank you.

Here's a polished version:
The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.

Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.

Hi, I have found a solution in MMdetection issues open-mmlab/mmdetection#11503, which is updating your pytorch version >= 2.2. I have tested it and it successfully reduced the training time from 25 days to 4 days 20240617-143821 20240617-143826 Hopefully this works for you. BTW, I used latest pytorch-2.3.1 by conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia with other environment unchanged.

Hi, i have updated pytorch to the latest version and reduced the training time successfully. But, the grads become NaN after some certain iterations and the losses become 0. Did you have this problem during training? image
image

I met the same problem, what I have tried was changing batch_size from 2 to 4 and lr_rate from 2e-4 to 1e-4, then the problem was gone and the model is normally training. original hyperparameter batch_size=2, lr_rate=2e-4 log Screenshot from 2024-06-19 14-41-50 modified hyperparameter batch_size=4, lr_rate=1e-4 log Screenshot from 2024-06-19 14-42-42 However, I didn't dive into it in detail since the model isn't finishing training yet, so I can only offer a shallow assumption that this problem was caused by some abnormal data input/gt, and expand the batch_size may mitigate the impact.

hello,did you reproduce the results in paper?

FYI, I do have some results, not so good, shown below. Since the model didn't converge well(probably caused by the hyperparameter settings and limited GPU resources) and I didn't spend much time on optimizing it and implementing well-designed visualization script, waiting for official released model weights and visualization script would be an ultimate solution.
Screenshot from 2024-07-30 17-13-35

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants