strongly memeory-efficient than normal ViT? #108

wuyouliaoxi · 2024-12-17T10:07:12Z

Dear authors,

I recently found that the RADIO model saves almost 1/2 of GPU memory compared to the normal pre-trained ViT, even with the same model size as ViT-L. I am just wondering why. Do you use any efficient attention operations?

Thanks a lot :)

gheinrich · 2024-12-17T12:46:02Z

Hello, perhaps that is because RADIO defaults to BF16 and you compared against an FP32 model?

wuyouliaoxi · 2024-12-17T15:06:49Z

Hello, perhaps that is because RADIO defaults to BF16 and you compared against an FP32 model?

Thank you for reply! I open the AMP mode for both cases and use the RADIOv2.5-L, which is denoted by FP32 on your homepage.
So it's strange to me...

mranzinger · 2024-12-22T15:48:36Z

So the reason behind this is that we store the checkpoint weights in bf16. If you directly load RADIO, and you don't cast its dtype, then you'll only be occupying half the memory as compared to a traditional ViT in fp32. AMP-BF16 still traditionally keeps the model weights in fp32, which is why you won't see that specific memory saving in this mode w.r.t. model weights.

We store the weights in bf16 to make storage, downloading, and loading, cheaper, and the change in model accuracy seems to be negligible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strongly memeory-efficient than normal ViT? #108

strongly memeory-efficient than normal ViT? #108

wuyouliaoxi commented Dec 17, 2024

gheinrich commented Dec 17, 2024

wuyouliaoxi commented Dec 17, 2024

mranzinger commented Dec 22, 2024

strongly memeory-efficient than normal ViT? #108

strongly memeory-efficient than normal ViT? #108

Comments

wuyouliaoxi commented Dec 17, 2024

gheinrich commented Dec 17, 2024

wuyouliaoxi commented Dec 17, 2024

mranzinger commented Dec 22, 2024