Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strongly memeory-efficient than normal ViT? #108

Open
wuyouliaoxi opened this issue Dec 17, 2024 · 3 comments
Open

strongly memeory-efficient than normal ViT? #108

wuyouliaoxi opened this issue Dec 17, 2024 · 3 comments

Comments

@wuyouliaoxi
Copy link

Dear authors,

I recently found that the RADIO model saves almost 1/2 of GPU memory compared to the normal pre-trained ViT, even with the same model size as ViT-L. I am just wondering why. Do you use any efficient attention operations?

Thanks a lot :)

@gheinrich
Copy link
Collaborator

Hello, perhaps that is because RADIO defaults to BF16 and you compared against an FP32 model?

@wuyouliaoxi
Copy link
Author

Hello, perhaps that is because RADIO defaults to BF16 and you compared against an FP32 model?

Thank you for reply! I open the AMP mode for both cases and use the RADIOv2.5-L, which is denoted by FP32 on your homepage.
So it's strange to me...

@mranzinger
Copy link
Collaborator

So the reason behind this is that we store the checkpoint weights in bf16. If you directly load RADIO, and you don't cast its dtype, then you'll only be occupying half the memory as compared to a traditional ViT in fp32. AMP-BF16 still traditionally keeps the model weights in fp32, which is why you won't see that specific memory saving in this mode w.r.t. model weights.

We store the weights in bf16 to make storage, downloading, and loading, cheaper, and the change in model accuracy seems to be negligible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants