We evaluate LLaVA-Mini on 11 image benchmarks and 7 video benchmarks. Here, we provide the evaluation script. To make the evaluation process much faster, we provide the multi-GPU parallel version.
The evaluation pipelines for all image-based benchmarks are consistent with those used in LLaVA-v1.5. Before preparing task-specific data, first download eval.zip. It contains custom annotations, scripts, and the prediction files with LLaVA v1.5. Extract to ./playground/data/eval
. This also provides a general structure for all datasets.
- Download
test2015
and put it under./playground/data/eval/vqav2
. - Inference.
bash scripts/llavamini/eval_image/vqav2.sh
- Submit the results to the evaluation server:
./playground/data/eval/vqav2/answers_upload
.
- Download the data and evaluation scripts following the official instructions and put under
./playground/data/eval/gqa/data
. You may need to modifyeval.py
as this due to the missing assets in the GQA v1.2 release. - Inference and evaluate.
bash scripts/llavamini/eval_image/gqa.sh
- Download
test.json
and extracttest.zip
totest
. Put them under./playground/data/eval/vizwiz
. - Inference.
bash scripts/llavamini/eval_image/vizwiz.sh
- Submit the results to the evaluation server:
./playground/data/eval/vizwiz/answers_upload
.
- Under
./playground/data/eval/scienceqa
, downloadimages
,pid_splits.json
,problems.json
from thedata/scienceqa
folder of the ScienceQA repo. - Inference and evaluate.
bash scripts/llavamini/eval_image/sqa.sh
- Download
TextVQA_0.5.1_val.json
and images and extract to./playground/data/eval/textvqa
. - Inference and evaluate.
bash scripts/llavamini/eval_image/textvqa.sh
- Download
coco
from POPE and put under./playground/data/eval/pope
. - Inference and evaluate.
bash scripts/llavamini/eval_image/pope.sh
- Download the data following the official instructions here.
- Downloaded images to
MME_Benchmark_release_version
. - put the official
eval_tool
andMME_Benchmark_release_version
under./playground/data/eval/MME
. - Inference and evaluate.
bash scripts/llavamini/eval_image/mme.sh
- Download
mmbench_dev_20230712.tsv
and put under./playground/data/eval/mmbench
. - Inference.
bash scripts/llavamini/eval_image/mmbench.sh
- Submit the results to the evaluation server:
./playground/data/eval/mmbench/answers_upload/mmbench_dev_20230712
.
- Download
mmbench_dev_cn_20231003.tsv
and put under./playground/data/eval/mmbench
. - Inference.
bash scripts/llavamini/eval_image/mmbench_cn.sh
- Submit the results to the evaluation server:
./playground/data/eval/mmbench/answers_upload/mmbench_dev_cn_20231003
.
- Following the official instructions to download the images and the videos. Put images under
./playground/data/eval/seed_bench/SEED-Bench-image
. - Extract the video frame in the middle from the downloaded videos, and put them under
./playground/data/eval/seed_bench/SEED-Bench-video-image
. We provide our scriptextract_video_frames.py
modified from the official one. - Inference and evaluate.
bash scripts/llavamini/eval_image/seed.sh
- Optionally, submit the results to the leaderboard:
./playground/data/eval/seed_bench/answers_upload
using the official jupyter notebook.
- Extract contents of
llava-bench-in-the-wild
to./playground/data/eval/llava-bench-in-the-wild
. - Inference and evaluate.
bash scripts/llavamini/eval_image/llavabench.sh
- Extract
mm-vet.zip
to./playground/data/eval/mmvet
. - Inference.
bash scripts/llavamini/eval_image/mmvet.sh
- Evaluate the predictions in
./playground/data/eval/mmvet/results
using the official jupyter notebook.
bash scripts/llavamini/eval_video/run_general_benckmarking.sh
bash scripts/llavamini/eval_video/run_temporal_benckmarking.sh
bash scripts/llavamini/eval_video/run_consistency_benckmarking.sh
- Evaluate using gpt-3.5-turbo.
bash scripts/llavamini/eval_video/eval_benchmark_1_correctness.sh
bash scripts/llavamini/eval_video/eval_benchmark_2_detail.sh
bash scripts/llavamini/eval_video/eval_benchmark_3_contextual.sh
bash scripts/llavamini/eval_video/eval_benchmark_4_temporal.sh
bash scripts/llavamini/eval_video/eval_benchmark_5_consistency.sh
- Download video and question from here.
- Inference.
bash scripts/llavamini/eval_video/run_qa_msvd.sh
- Evaluate using gpt-3.5-turbo.
bash scripts/llavamini/eval_video/eval_qa_msvd.sh
- Download video and question from here.
- Inference.
bash scripts/llavamini/eval_video/run_qa_msvd.sh
- Evaluate using gpt-3.5-turbo.
bash scripts/llavamini/eval_video/eval_qa_msvd.sh
- Download video and question from here.
- Inference.
bash scripts/llavamini/eval_video/run_qa_msrvtt.sh
- Evaluate using gpt-3.5-turbo.
bash scripts/llavamini/eval_video/eval_qa_msrvtt.sh
- Download video and question following offical repo.
- Inference.
bash scripts/llavamini/eval_video/run_qa_activitynet.sh
- Evaluate using gpt-3.5-turbo.
bash scripts/llavamini/eval_video/eval_qa_activitynet.sh
- Download video and question following offical repo.
- Inference and evaluate.
bash scripts/llavamini/eval_video/run_mvbench_mc.sh
- Download video and question following offical repo.
- Inference and evaluate.
bash scripts/llavamini/eval_video/run_mlvu_mc.sh
- Download video and question following offical repo.
- Inference and evaluate.
bash scripts/llavamini/eval_video/run_egoschema_mc.sh