-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About training time #3
Comments
Apologies for the delay due to the ECCV submission deadline. In compliance with our company's confidentiality policies, the original code cannot be published. This version of the code has been independently reproduced. While the original code is known for its speed, I'm currently unable to determine the cause of performance issues in this version. I am actively investigating the matter. |
The checkpoint is the pretraining checkpoint, Please refine to issue |
Hi! @VictorLlu Thank your for your update. But the following file seems still to be missing... Can you please update it? |
@VictorLlu Comparing training with 1 GPU and 8 GPUs, I found that the batch time almost equals to NUM_GPUs*batch_time_per_GPU + |
@VictorLlu |
I've made a minor modification to the image loading process: img_bytes = [
get(name, backend_args=self.backend_args) for name in filename
]
img = [
mmcv.imfrombytes(img_byte, flag=self.color_type)
for img_byte in img_bytes
]
img = np.stack(img, axis=-1) This approach replaces the use of |
I find it highly related to the
I've noticed that the delay between iterations directly corresponds to the |
Here's a polished version: The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. |
When I use a single 2080GPU, it takes 59 days to complete the training...... |
Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help. |
Hi, I have found a solution in MMdetection issues open-mmlab/mmdetection#11503, which is updating your pytorch version >= 2.2. I have tested it and it successfully reduced the training time from 25 days to 4 days |
Hi, i have updated pytorch to the latest version and reduced the training time successfully. But, the grads become NaN after some certain iterations and the losses become 0. Did you have this problem during training? |
I met the same problem, what I have tried was changing batch_size from 2 to 4 and lr_rate from 2e-4 to 1e-4, then the problem was gone and the model is normally training. |
Hi~ Have you finished the training and successfully reproduced the results from the paper? |
hello,did you reproduce the results in paper? |
FYI, I do have some results, not so good, shown below. Since the model didn't converge well(probably caused by the hyperparameter settings and limited GPU resources) and I didn't spend much time on optimizing it and implementing well-designed visualization script, waiting for official released model weights and visualization script would be an ultimate solution. |
Hello, I am running the training code of lanegraph2seq on nuScenes. Each batch takes about 3.6 seconds, and hence the total training process will take about 20 days. Is this speed normal?
BTW, would it be possible for you to release the pre-trained checkpoint ckpts/lssego_segmentation_48x32_b4x8_resnet_adam_24e_ponsplit_19.pth?
The text was updated successfully, but these errors were encountered: