Dilbert Vision Model Evaluation

Benchmarking VLMs across strict accuracy metrics

A rigorous benchmarking of state-of-the-art vision-language models on the task of structured comic strip transcription. Models are evaluated by a GPT-5.2 Judge against human-annotated ground truth, focusing on text accuracy, speaker identification, and panel segmentation.

Loading...

Evaluation Results


Model A vs Model B Comparison

Comparing Gemini 3 Flash (Model A) vs Qwen 3 VL (Model B)

Agreement Breakdown

Accuracy is reported only for panels where independent models agree.
Disagreement cases are excluded from accuracy computation.