Evaluation

We evaluate LLaVA-Mini on 11 image benchmarks and 7 video benchmarks. Here, we provide the evaluation script. To make the evaluation process much faster, we provide the multi-GPU parallel version.

Image-based Benchmarks

The evaluation pipelines for all image-based benchmarks are consistent with those used in LLaVA-v1.5. Before preparing task-specific data, first download eval.zip. It contains custom annotations, scripts, and the prediction files with LLaVA v1.5. Extract to ./playground/data/eval. This also provides a general structure for all datasets.

VQAv2

Download test2015 and put it under ./playground/data/eval/vqav2.
Inference.

bash scripts/llavamini/eval_image/vqav2.sh

Submit the results to the evaluation server: ./playground/data/eval/vqav2/answers_upload.

GQA

Download the data and evaluation scripts following the official instructions and put under ./playground/data/eval/gqa/data. You may need to modify eval.py as this due to the missing assets in the GQA v1.2 release.
Inference and evaluate.

bash scripts/llavamini/eval_image/gqa.sh

VisWiz

Download test.json and extract test.zip to test. Put them under ./playground/data/eval/vizwiz.
Inference.

bash scripts/llavamini/eval_image/vizwiz.sh

Submit the results to the evaluation server: ./playground/data/eval/vizwiz/answers_upload.

ScienceQA

Under ./playground/data/eval/scienceqa, download images, pid_splits.json, problems.json from the data/scienceqa folder of the ScienceQA repo.
Inference and evaluate.

bash scripts/llavamini/eval_image/sqa.sh

TextVQA

Download TextVQA_0.5.1_val.json and images and extract to ./playground/data/eval/textvqa.
Inference and evaluate.

bash scripts/llavamini/eval_image/textvqa.sh

POPE

Download coco from POPE and put under ./playground/data/eval/pope.
Inference and evaluate.

bash scripts/llavamini/eval_image/pope.sh

MME

Download the data following the official instructions here.
Downloaded images to MME_Benchmark_release_version.
put the official eval_tool and MME_Benchmark_release_version under ./playground/data/eval/MME.
Inference and evaluate.

bash scripts/llavamini/eval_image/mme.sh

MMBench

Download mmbench_dev_20230712.tsv and put under ./playground/data/eval/mmbench.
Inference.

bash scripts/llavamini/eval_image/mmbench.sh

Submit the results to the evaluation server: ./playground/data/eval/mmbench/answers_upload/mmbench_dev_20230712.

MMBench-CN

Download mmbench_dev_cn_20231003.tsv and put under ./playground/data/eval/mmbench.
Inference.

bash scripts/llavamini/eval_image/mmbench_cn.sh

Submit the results to the evaluation server: ./playground/data/eval/mmbench/answers_upload/mmbench_dev_cn_20231003.

SEED-Bench

Following the official instructions to download the images and the videos. Put images under ./playground/data/eval/seed_bench/SEED-Bench-image.
Extract the video frame in the middle from the downloaded videos, and put them under ./playground/data/eval/seed_bench/SEED-Bench-video-image. We provide our script extract_video_frames.py modified from the official one.
Inference and evaluate.

bash scripts/llavamini/eval_image/seed.sh

Optionally, submit the results to the leaderboard: ./playground/data/eval/seed_bench/answers_upload using the official jupyter notebook.

LLaVA-Bench-in-the-Wild

Extract contents of llava-bench-in-the-wild to ./playground/data/eval/llava-bench-in-the-wild.
Inference and evaluate.

bash scripts/llavamini/eval_image/llavabench.sh

MM-Vet

Extract mm-vet.zip to ./playground/data/eval/mmvet.
Inference.

bash scripts/llavamini/eval_image/mmvet.sh

Evaluate the predictions in ./playground/data/eval/mmvet/results using the official jupyter notebook.

Video-baed Benchmarks

Video-based generative performance benchmark

Download the videos from here, and the question-answer pairs from here.
Inference.

bash scripts/llavamini/eval_video/run_general_benckmarking.sh
bash scripts/llavamini/eval_video/run_temporal_benckmarking.sh
bash scripts/llavamini/eval_video/run_consistency_benckmarking.sh

Evaluate using gpt-3.5-turbo.

bash scripts/llavamini/eval_video/eval_benchmark_1_correctness.sh
bash scripts/llavamini/eval_video/eval_benchmark_2_detail.sh
bash scripts/llavamini/eval_video/eval_benchmark_3_contextual.sh
bash scripts/llavamini/eval_video/eval_benchmark_4_temporal.sh
bash scripts/llavamini/eval_video/eval_benchmark_5_consistency.sh

MSVD-QA

Download video and question from here.
Inference.

bash scripts/llavamini/eval_video/run_qa_msvd.sh

Evaluate using gpt-3.5-turbo.

bash scripts/llavamini/eval_video/eval_qa_msvd.sh

MSVD-QA

Download video and question from here.
Inference.

bash scripts/llavamini/eval_video/run_qa_msvd.sh

Evaluate using gpt-3.5-turbo.

bash scripts/llavamini/eval_video/eval_qa_msvd.sh

MSRVTT-QA

Download video and question from here.
Inference.

bash scripts/llavamini/eval_video/run_qa_msrvtt.sh

Evaluate using gpt-3.5-turbo.

bash scripts/llavamini/eval_video/eval_qa_msrvtt.sh

Activitynet-QA

Download video and question following offical repo.
Inference.

bash scripts/llavamini/eval_video/run_qa_activitynet.sh

Evaluate using gpt-3.5-turbo.

bash scripts/llavamini/eval_video/eval_qa_activitynet.sh

MVBench

Download video and question following offical repo.
Inference and evaluate.

bash scripts/llavamini/eval_video/run_mvbench_mc.sh

MLVU

Download video and question following offical repo.
Inference and evaluate.

bash scripts/llavamini/eval_video/run_mlvu_mc.sh

EgoSchema

Download video and question following offical repo.
Inference and evaluate.

bash scripts/llavamini/eval_video/run_egoschema_mc.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation.md

Evaluation.md

Evaluation

Image-based Benchmarks

VQAv2

GQA

VisWiz

ScienceQA

TextVQA

POPE

MME

MMBench

MMBench-CN

SEED-Bench

LLaVA-Bench-in-the-Wild

MM-Vet

Video-baed Benchmarks

Video-based generative performance benchmark

MSVD-QA

MSVD-QA

MSRVTT-QA

Activitynet-QA

MVBench

MLVU

EgoSchema

Files

Evaluation.md

Latest commit

History

Evaluation.md

File metadata and controls

Evaluation

Image-based Benchmarks

VQAv2

GQA

VisWiz

ScienceQA

TextVQA

POPE

MME

MMBench

MMBench-CN

SEED-Bench

LLaVA-Bench-in-the-Wild

MM-Vet

Video-baed Benchmarks

Video-based generative performance benchmark

MSVD-QA

MSVD-QA

MSRVTT-QA

Activitynet-QA

MVBench

MLVU

EgoSchema