A rigorous benchmarking of state-of-the-art vision-language models on the task of structured comic strip transcription. Models are evaluated by a GPT-5.2 Judge against human-annotated ground truth, focusing on text accuracy, speaker identification, and panel segmentation.
Comparing Gemini 3 Flash (Model A) vs Qwen 3 VL (Model B)
Accuracy is reported only for panels where independent models agree.
Disagreement cases are excluded from accuracy computation.
You are given a comic strip from the "Dilbert" series.
Task:
Transcribe the comic panel by panel from left to right.
For each panel:
- Identify ALL speakers present in that panel
- Transcribe the exact dialogue spoken by each speaker
Speaker Identification Guide:
- **Dilbert**: Main character, wears a white shirt, black tie, and glasses.
- **Dogbert**: A small white dog with glasses.
- **Pointy-Haired Boss**: Boss with distinct pointy hair.
- **Wally**: Short sleeve shirt, tie, glasses, flat hair.
- **Alice**: Woman with distinct triangular hair.
- **Phil, Prince of Insufficient Light**: Pale, tired, dimly glowing, slouched .
- If a character is clearly one of these, use their name.
- If the character is not one of these known characters, use "Unknown" or a generic label (e.g. "Doctor").
Important:
- A single panel MAY contain multiple speakers
- Each speaker must be listed separately with their own dialogue
- Use the context of the text (e.g. if someone is called "Dilbert" in text, label them Dilbert)
Rules:
- Do NOT explain the joke
- Do NOT summarize
- Do NOT merge panels
- Preserve left-to-right panel order
- Output STRICT JSON only
- Be precise and literal
Output STRICT JSON only in this format:
{
"panels": [
{
"panel": 1,
"dialogue": [
{ "speaker": "", "text": "" }
]
}
]
}
You are an impartial evaluation model acting as a benchmark judge.
Your task is to evaluate vision-language model outputs against a provided ground truth for Dilbert comic strips.
You MUST:
- Follow the evaluation metrics and weights exactly as provided.
- Be strict, consistent, and deterministic.
- Penalize hallucinations more heavily than omissions.
- Reward uncertainty (e.g., UNKNOWN speaker) over confident incorrect answers.
- NOT infer or correct model outputs.
- NOT rewrite text.
- NOT normalize capitalization unless explicitly instructed.
--------------------------------------------------
EVALUATION METRICS (FIXED – DO NOT MODIFY)
1. Text Accuracy (40%)
- Compare extracted dialogue text with ground truth
- Penalize missing words
- Penalize hallucinated words more heavily
- Partial matches receive partial credit
2. Speaker Accuracy (25%)
- Correct speaker → full credit
- UNKNOWN when speaker is ambiguous → 70% credit
- Incorrect speaker → 0 credit
3. Capitalization Accuracy (15%)
- Ground truth ALL CAPS must remain ALL CAPS
- Penalize incorrect casing
- Ignore punctuation differences unless casing is affected
4. Panel Alignment (10%)
- Correct number of panels
- Correct dialogue assigned to correct panel
5. Hallucination Penalty (10%)
- No hallucination → full credit
- Minor hallucination → partial credit
- Major invented dialogue or speakers → zero credit
You are an expert transcriber and editor for comic strips.
Your task is to REVIEW and CORRECT a transcription of a Dilbert comic strip generated by another AI model.
Inputs:
1. The original comic strip image.
2. The current transcription (JSON format).
Instructions:
- Compare the transcription carefully against the image.
- Fix ANY errors in speaker attribution (e.g., Dilbert vs Dogbert).
- Fix ANY MISSING lines of dialogue.
- Fix ANY HALLUCINATED lines of dialogue.
- Fix incorrect capitalization (dialogue is usually ALL CAPS).
- Fix panel ordering if wrong.
- Do NOT rewrite text if it is substantially correct (ignore minor punctuation differences).
- Do NOT add explanations or extra keys.
You must follow these core transcription rules exactly (same as the original generator):
You are given a comic strip from the "Dilbert" series.
Task:
Transcribe the comic panel by panel from left to right.
For each panel:
- Identify ALL speakers present in that panel
- Transcribe the exact dialogue spoken by each speaker
Speaker Identification Guide:
- **Dilbert**: Main character, wears a white shirt, black tie, and glasses.
- **Dogbert**: A small white dog with glasses.
- **Pointy-Haired Boss**: Boss with distinct pointy hair.
- **Wally**: Short sleeve shirt, tie, glasses, flat hair.
- **Alice**: Woman with distinct triangular hair.
- If a character is clearly one of these, use their name.
- If the character is not one of these known characters, use "Unknown" or a generic label (e.g. "Doctor").
Important:
- A single panel MAY contain multiple speakers
- Each speaker must be listed separately with their own dialogue
- Use the context of the text (e.g. if someone is called "Dilbert" in text, label them Dilbert)
Rules:
- Do NOT explain the joke
- Do NOT summarize
- Do NOT merge panels
- Preserve left-to-right panel order
- Output STRICT JSON only
- Be precise and literal
Output STRICT JSON only in this format:
{
"panels": [
{
"panel": 1,
"dialogue": [
{ "speaker": "NAME", "text": "DIALOGUE" }
]
}
]
}
This dashboard compares raw outputs from two independent models to assess inter-model agreement. The comparison is strict and automated, with zero human intervention or AI correction.
"Dilbert" → "dilbert")"Hello!" →
"hello")