About training time #3

ZJWang9928 · 2024-03-14T12:04:08Z

Hello, I am running the training code of lanegraph2seq on nuScenes. Each batch takes about 3.6 seconds, and hence the total training process will take about 20 days. Is this speed normal?

BTW, would it be possible for you to release the pre-trained checkpoint ckpts/lssego_segmentation_48x32_b4x8_resnet_adam_24e_ponsplit_19.pth?

VictorLlu · 2024-03-18T17:14:59Z

Apologies for the delay due to the ECCV submission deadline. In compliance with our company's confidentiality policies, the original code cannot be published. This version of the code has been independently reproduced. While the original code is known for its speed, I'm currently unable to determine the cause of performance issues in this version. I am actively investigating the matter.

VictorLlu · 2024-03-18T17:16:38Z

The checkpoint is the pretraining checkpoint, Please refine to issue
#2 (comment)

ZJWang9928 · 2024-03-21T08:35:27Z

https://github.com/fudan-zvg/RoadNet/blob/9a83cf6aa09896e6df6c36c3a534e9b9ab075a7b/RoadNetwork/rntr/init.py#L24C1-L24C52

Hi! @VictorLlu Thank your for your update. But the following file seems still to be missing... Can you please update it?
from .data import nuscenes_converter_pon_centerline
#5

ZJWang9928 · 2024-03-22T11:42:29Z

@VictorLlu Comparing training with 1 GPU and 8 GPUs, I found that the batch time almost equals to NUM_GPUs*batch_time_per_GPU + $\Delta$. Is this phenomenon abnormal?
NUM_GPUs = 1:

NUM_GPUs = 2:

NUM_GPUs = 4:

NUM_GPUs = 8:

ZJWang9928 · 2024-03-26T03:37:05Z

@VictorLlu
FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations.

BTW, could you please update the nuscenes_converter_pon_centerline file recently? It would be of great help! Thank you.

VictorLlu · 2024-04-03T15:10:38Z

I've made a minor modification to the image loading process:

img_bytes = [
    get(name, backend_args=self.backend_args) for name in filename
]
img = [
    mmcv.imfrombytes(img_byte, flag=self.color_type)
    for img_byte in img_bytes
]
img = np.stack(img, axis=-1)

This approach replaces the use of mmcv.imread. It has provided some improvement, yet the loading time remains significantly long.

VictorLlu · 2024-04-03T17:50:35Z

I find it highly related to the num_workers

I've noticed that the delay between iterations directly corresponds to the num_workers setting in multi-GPU training scenarios. Despite eliminating every time-consuming element in the dataloader, it still experiences delays at intervals consistent with the num_workers count. This suggests that the issue might stem from mmdetection3d rather than the dataloader itself.

VictorLlu · 2024-04-03T20:38:42Z

Here's a polished version:

The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version.
I will push a mmdetection 0.17.1 version in the next few days.

EchoQiHeng · 2024-04-12T03:03:36Z

When I use a single 2080GPU, it takes 59 days to complete the training......

ZJWang9928 · 2024-04-15T02:35:21Z

Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.

raimberri · 2024-06-17T06:43:51Z

Hi, I have found a solution in MMdetection issues open-mmlab/mmdetection#11503, which is updating your pytorch version >= 2.2. I have tested it and it successfully reduced the training time from 25 days to 4 days

Hopefully this works for you.
BTW, I used latest pytorch-2.3.1 by conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia with other environment unchanged.

y352184741 · 2024-06-19T03:24:39Z

Hi, i have updated pytorch to the latest version and reduced the training time successfully. But, the grads become NaN after some certain iterations and the losses become 0. Did you have this problem during training?

raimberri · 2024-06-19T06:51:21Z

I met the same problem, what I have tried was changing batch_size from 2 to 4 and lr_rate from 2e-4 to 1e-4, then the problem was gone and the model is normally training.
original hyperparameter batch_size=2, lr_rate=2e-4 log

modified hyperparameter batch_size=4, lr_rate=1e-4 log

However, I didn't dive into it in detail since the model isn't finishing training yet, so I can only offer a shallow assumption that this problem was caused by some abnormal data input/gt, and expand the batch_size may mitigate the impact.

y352184741 · 2024-06-25T03:19:09Z

Hi~ Have you finished the training and successfully reproduced the results from the paper?

wangpinzhi · 2024-06-28T15:26:41Z

hello，did you reproduce the results in paper?

raimberri · 2024-07-30T09:13:56Z

FYI, I do have some results, not so good, shown below. Since the model didn't converge well(probably caused by the hyperparameter settings and limited GPU resources) and I didn't spend much time on optimizing it and implementing well-designed visualization script, waiting for official released model weights and visualization script would be an ultimate solution.

ZJWang9928 mentioned this issue Apr 2, 2024

Inference code for lanegraph2seq seems wrong #6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About training time #3

About training time #3

ZJWang9928 commented Mar 14, 2024

VictorLlu commented Mar 18, 2024

VictorLlu commented Mar 18, 2024

ZJWang9928 commented Mar 21, 2024

ZJWang9928 commented Mar 22, 2024

ZJWang9928 commented Mar 26, 2024

VictorLlu commented Apr 3, 2024

VictorLlu commented Apr 3, 2024 •

edited

Loading

VictorLlu commented Apr 3, 2024 •

edited

Loading

EchoQiHeng commented Apr 12, 2024

ZJWang9928 commented Apr 15, 2024

raimberri commented Jun 17, 2024

y352184741 commented Jun 19, 2024

raimberri commented Jun 19, 2024

y352184741 commented Jun 25, 2024

wangpinzhi commented Jun 28, 2024

raimberri commented Jul 30, 2024 •

edited

Loading

About training time #3

About training time #3

Comments

ZJWang9928 commented Mar 14, 2024

VictorLlu commented Mar 18, 2024

VictorLlu commented Mar 18, 2024

ZJWang9928 commented Mar 21, 2024

ZJWang9928 commented Mar 22, 2024

ZJWang9928 commented Mar 26, 2024

VictorLlu commented Apr 3, 2024

VictorLlu commented Apr 3, 2024 • edited Loading

VictorLlu commented Apr 3, 2024 • edited Loading

EchoQiHeng commented Apr 12, 2024

ZJWang9928 commented Apr 15, 2024

raimberri commented Jun 17, 2024

y352184741 commented Jun 19, 2024

raimberri commented Jun 19, 2024

y352184741 commented Jun 25, 2024

wangpinzhi commented Jun 28, 2024

raimberri commented Jul 30, 2024 • edited Loading

VictorLlu commented Apr 3, 2024 •

edited

Loading

VictorLlu commented Apr 3, 2024 •

edited

Loading

raimberri commented Jul 30, 2024 •

edited

Loading