Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RADIO with lora #83

Open
aries-young opened this issue Aug 25, 2024 · 4 comments
Open

RADIO with lora #83

aries-young opened this issue Aug 25, 2024 · 4 comments

Comments

@aries-young
Copy link

I used RADIO-L as the visual encoder for LLaVA, and added LoRA to RADIO-L in both the pretraining and finetuning stages. However, we found the following two intriguing conclusions:

  1. RADIO-L uses 768 resolution and center crop to encode images, and the results of the trained LLaVA model on evaluation sets like MMBench are similar to those of LLaVA1.5 with CLIP-L-336.
  2. When RADIO-L uses 336 resolution or even smaller resolutions like 224 and center crop to encode images, the training of LLaVA is more likely to experience a sudden increase in loss, leading to abnormal training results.

I'm not certain whether the issue is caused by RADIO-L's sensitivity to resolution, or the way RADIO-L is integrated with LoRA. I am looking forward to discussing this in more depth with you.

@aries-young
Copy link
Author

The detailed parameters of RADIO-L integrated with LoRA in our experiments are as follows:
visual_encoder.base_model.model.radio_model.model.blocks.0.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.0.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.0.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.0.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.0.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.0.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.0.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.0.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.1.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.1.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.1.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.1.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.1.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.1.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.1.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.1.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.2.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.2.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.2.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.2.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.2.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.2.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.2.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.2.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.3.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.3.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.3.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.3.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.3.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.3.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.3.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.3.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.4.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.4.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.4.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.4.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.4.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.4.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.4.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.4.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.5.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.5.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.5.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.5.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.5.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.5.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.5.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.5.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.6.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.6.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.6.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.6.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.6.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.6.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.6.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.6.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.7.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.7.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.7.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.7.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.7.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.7.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.7.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.7.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.8.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.8.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.8.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.8.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.8.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.8.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.8.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.8.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.9.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.9.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.9.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.9.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.9.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.9.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.9.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.9.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.10.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.10.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.10.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.10.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.10.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.10.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.10.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.10.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.11.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.11.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.11.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.11.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.11.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.11.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.11.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.11.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.12.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.12.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.12.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.12.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.12.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.12.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.12.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.12.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.13.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.13.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.13.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.13.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.13.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.13.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.13.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.13.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.14.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.14.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.14.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.14.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.14.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.14.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.14.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.14.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.15.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.15.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.15.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.15.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.15.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.15.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.15.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.15.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.16.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.16.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.16.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.16.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.16.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.16.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.16.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.16.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.17.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.17.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.17.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.17.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.17.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.17.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.17.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.17.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.18.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.18.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.18.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.18.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.18.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.18.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.18.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.18.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.19.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.19.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.19.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.19.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.19.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.19.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.19.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.19.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.20.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.20.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.20.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.20.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.20.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.20.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.20.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.20.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.21.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.21.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.21.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.21.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.21.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.21.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.21.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.21.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.22.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.22.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.22.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.22.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.22.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.22.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.22.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.22.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.23.attn.qkv.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.23.attn.qkv.lora_B.default.weight torch.Size([3072, 64])
visual_encoder.base_model.model.radio_model.model.blocks.23.attn.proj.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.23.attn.proj.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.blocks.23.mlp.fc1.lora_A.default.weight torch.Size([64, 1024])
visual_encoder.base_model.model.radio_model.model.blocks.23.mlp.fc1.lora_B.default.weight torch.Size([4096, 64])
visual_encoder.base_model.model.radio_model.model.blocks.23.mlp.fc2.lora_A.default.weight torch.Size([64, 4096])
visual_encoder.base_model.model.radio_model.model.blocks.23.mlp.fc2.lora_B.default.weight torch.Size([1024, 64])
visual_encoder.base_model.model.radio_model.model.patch_generator.embedder.lora_A.default.weight torch.Size([64, 768])
visual_encoder.base_model.model.radio_model.model.patch_generator.embedder.lora_B.default.weight torch.Size([1024, 64])

@gheinrich
Copy link
Collaborator

Hello, how are you pre-processing inputs into RADIO-L? Is the data passed as RGB values in a [0,1] range?

@aries-young
Copy link
Author

aries-young commented Aug 27, 2024

Hello, how are you pre-processing inputs into RADIO-L? Is the data passed as RGB values in a [0,1] range?

We initialized the image preprocessor in the following manner:

dynamic_image_processor = dict(
    type=CLIPImageProcessor.from_pretrained,
    pretrained_model_name_or_path=visual_encoder_name_or_path,
    trust_remote_code=True,
    do_resize=True,
    do_center_crop=True
)

And this is an example of the input image which we fed to the RADIO model.

tensor([[[[0.3412, 0.1765, 0.2118,  ..., 0.4275, 0.4353, 0.0863],
          [0.3255, 0.4980, 0.3176,  ..., 0.5804, 0.4314, 0.3608],
          [0.6784, 0.6275, 0.5020,  ..., 0.4510, 0.2863, 0.2353],
          ...,
          [0.5020, 0.4784, 0.4941,  ..., 0.6118, 0.6314, 0.6706],
          [0.4784, 0.4824, 0.5255,  ..., 0.6000, 0.6549, 0.6706],
          [0.4706, 0.4824, 0.4980,  ..., 0.6078, 0.6745, 0.6471]],

         [[0.7098, 0.5608, 0.5843,  ..., 0.7686, 0.9020, 0.7804],
          [0.6392, 0.8118, 0.6471,  ..., 0.9608, 0.9176, 0.9647],
          [0.8941, 0.8196, 0.7529,  ..., 0.8863, 0.7647, 0.6980],
          ...,
          [0.4549, 0.4353, 0.4510,  ..., 0.6275, 0.6392, 0.6784],
          [0.4314, 0.4353, 0.4784,  ..., 0.6157, 0.6627, 0.6824],
          [0.4235, 0.4353, 0.4510,  ..., 0.6235, 0.6824, 0.6588]],

         [[0.6588, 0.4706, 0.4667,  ..., 0.7843, 0.8980, 0.7373],
          [0.5961, 0.7373, 0.5333,  ..., 0.9608, 0.9059, 0.9294],
          [0.8667, 0.7647, 0.6627,  ..., 0.8667, 0.7373, 0.6706],
          ...,
          [0.4078, 0.3882, 0.4039,  ..., 0.7059, 0.7216, 0.7608],
          [0.3882, 0.3922, 0.4314,  ..., 0.6941, 0.7412, 0.7608],
          [0.3804, 0.3882, 0.4078,  ..., 0.7059, 0.7647, 0.7373]]]])

@gheinrich
Copy link
Collaborator

Hello, yes the dynamic range of your inputs looks OK indeed. In our LLaVA1.5 experiments we don't do a center crop. Instead we resize the image such that the longest edge becomes 768 long, keeping the aspect ratio of the input image, and padding the shortest edge to the nearest multiple of 16 pixels. We did not evaluate the model on MMBench however our results on TextVQA, VQAv2, GQA, POPE were very much in favor of RADIO-L (see the README at the root of this repository). We didn't use LoRA but we kept RADIO-L frozen, training only the projector and LLM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants