You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I recently found that the RADIO model saves almost 1/2 of GPU memory compared to the normal pre-trained ViT, even with the same model size as ViT-L. I am just wondering why. Do you use any efficient attention operations?
Thanks a lot :)
The text was updated successfully, but these errors were encountered:
So the reason behind this is that we store the checkpoint weights in bf16. If you directly load RADIO, and you don't cast its dtype, then you'll only be occupying half the memory as compared to a traditional ViT in fp32. AMP-BF16 still traditionally keeps the model weights in fp32, which is why you won't see that specific memory saving in this mode w.r.t. model weights.
We store the weights in bf16 to make storage, downloading, and loading, cheaper, and the change in model accuracy seems to be negligible.
Dear authors,
I recently found that the RADIO model saves almost 1/2 of GPU memory compared to the normal pre-trained ViT, even with the same model size as ViT-L. I am just wondering why. Do you use any efficient attention operations?
Thanks a lot :)
The text was updated successfully, but these errors were encountered: