Training time #3

youweiliang · 2020-11-10T04:08:58Z

Hi, thanks for the implementation. Could you provide the (approximate) training time to produce the result in your table in the Readme?

guanfuchen · 2020-11-16T06:49:16Z

same problem, it seems slow using the code for training BYOL, how can we optim it for fast training?

yaox12 · 2020-11-17T03:25:07Z

@guanfuchen
First, training BYOL is slow naturally, due to:

With the same architecture, ResNet-50 for example, BYOL does more than twice forward calculations in comparison to supervised learning.
BYOL, as well as other self-supervised methods such as SimCLR, requires more training epochs to converge. Most of the ablation study results in the paper are reported at 300 epochs while the best performance is reached at 1000 epochs.

Second, self-supervised methods usually use a large batch size, e.g., 4096, for better performance. And there are a lot of efficiency challenges in large-scale distributed training.

BYOL occupies more than twice GPU memory than supervised training. That's to say, it needs twice the amount of GPU cards to ensure the same batch size. Unfortunately, communication costs or throughput will be a major bottleneck under this circumstance.
I adopt synchronized batch normalization (syncBN) in my code. It does an extra synchronization among all the GPU cards during the forward calculation, which is very time-consuming. While the authors have not stressed using syncBN in BYOL, I found that it is hard to reproduce the reported results without syncBN. So I add it back referring to the implementation of SimCLR.

Besides, there are some points that could be optimized. The most important one is data loading. The dataloader assumes reading raw images and then doing data augmentations, which is rather slow. You can:

Use better optimized data augmentation library, e.g., albumentations (improves a little). An example is here.
Use newer torchvision versions (with pytorch>=1.7.0) which can apply augmentations to tensors (in batches and on GPUs) (I have not tried)
Use DALI for reading (converting raw images to tfrecords accelerates a lot) and augmentation (however, some augmentation used by BYOL may not be supported by DALI).

guanfuchen · 2020-11-18T12:46:12Z

Thxs, great answer. DALI is not very flexiable, yes, syncbn is important. I am considering the MOCO, it seems more quick than BYOL, and it does not use SyncBN. @yaox12

guanfuchen · 2020-12-10T07:21:02Z

another problem about transformation is slow, the cv2.GaussianBlur is very slow. It seems low linear evaluation accuracy using PIL GaussianBlur. This is OpenSelfSup Code about BYOL GaussianBlur transformation.
Can we use byol_transform_a to get same result as byol_transform?

yaox12 · 2020-12-10T07:48:31Z

@guanfuchen

The official implementation of MoCo also uses PIL GaussianBlur, however I haven't tried it because it accepts a single parameter sigma. I kept the implementation as close to the BYOL paper as possible.
albumentations in byol_transform_a.py leads to a similar linear evaluation result but I remember it didn't accelerate much. The color adjustment in byol_transform_a.py may not be correct. You can refer to How to make an equivalent migration from torchvision.transforms.ColorJitter to albumentations albumentations-team/albumentations#672 and Is there a HueSaturationLightness (to recreate torchvision.transforms.ColorJitter)? albumentations-team/albumentations#698 for help. BTW, GaussianBlur has been added to torchvision now, I think it deserves a try.

guanfuchen · 2020-12-10T08:15:12Z

Thxs @yaox12 , you code is really excelent!!! I will try but the cost is high. I modify the weight decay to 0 using linear evaluation, then the acc is almost same as the origin cv2 implementaion.

youweiliang · 2021-01-21T07:05:34Z

@yaox12 Hi, I notice you actually did not use mixed precision in the BYOL training since opt_level is "O0" in your train_config.yaml. Have you tried using opt_level: "O1"? Would opt_level: "O1" cause a significant accuracy drop?

yaox12 · 2021-01-21T12:59:29Z

@youweiliang See #7

youweiliang · 2021-01-21T13:06:28Z

Thx. I have just tried using torch.cuda.amp (PyTorch 1.6) for the mixed prescision training and also observed similar computation time with/without amp. I tend to agree with you on the communication bottleneck.

youweiliang closed this as completed Dec 14, 2020

yaox12 pinned this issue Dec 16, 2020

yaox12 reopened this Dec 16, 2020

yaox12 unpinned this issue Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training time #3

Training time #3

youweiliang commented Nov 10, 2020

guanfuchen commented Nov 16, 2020

yaox12 commented Nov 17, 2020 •

edited

Loading

guanfuchen commented Nov 18, 2020

guanfuchen commented Dec 10, 2020 •

edited

Loading

yaox12 commented Dec 10, 2020

guanfuchen commented Dec 10, 2020

youweiliang commented Jan 21, 2021

yaox12 commented Jan 21, 2021

youweiliang commented Jan 21, 2021

Training time #3

Training time #3

Comments

youweiliang commented Nov 10, 2020

guanfuchen commented Nov 16, 2020

yaox12 commented Nov 17, 2020 • edited Loading

guanfuchen commented Nov 18, 2020

guanfuchen commented Dec 10, 2020 • edited Loading

yaox12 commented Dec 10, 2020

guanfuchen commented Dec 10, 2020

youweiliang commented Jan 21, 2021

yaox12 commented Jan 21, 2021

youweiliang commented Jan 21, 2021

yaox12 commented Nov 17, 2020 •

edited

Loading

guanfuchen commented Dec 10, 2020 •

edited

Loading