In the following, we list the models acquired in Vision Search Assistant. It's expected that those models will be automatically downloaded to your device when you run the code for the first time. If you encounter any problems, you can manually download them.
We use GroundingDINO as the gounding model.
Model | Box AP on COCO | Weights |
---|---|---|
GroundingDINO-Tiny | 48.4 | Huggingface |
GroundingDINO-Base | 56.7 | Huggingface |
We use LLaVA-v1.6 as the core Vision Language Model.
Version | LLM | LLaVA-Bench-Wild | Weights |
---|---|---|---|
LLaVA-1.6 | Vicuna-7B | 81.6 | Huggingface |
LLaVA-1.6 | Vicuna-13B | 87.3 | Huggingface |
LLaVA-1.6 | Mistral-7B | 83.2 | Huggingface |
LLaVA-1.6 | Hermes-Yi-34B | 89.6 | Huggingface |
We use InternLM as the searching model.
Model | CMMLU | Weights |
---|---|---|
InternLM2.5-1.8B-Chat | - | Huggingface |
InternLM2.5-7B-Chat | 78.0 | Huggingface |
InternLM2.5-20B-Chat | - | Huggingface |