SVG Generation Benchmark

Model Rankings

Rank	Model	Avg Score	Prompt	Structure	Physics	Complete	Coherence

Evaluator vs Generator Comparison Matrix

Average scores showing how each evaluator rates each generator. Use this to identify evaluator bias and determine which evaluator to trust most.

Evaluators behave largely independently of specific generators

A chi-square independence analysis shows that evaluator scores are almost entirely explained by two global effects: overall generator strength and overall evaluator strictness. The interaction between specific evaluators and specific generators is extremely small (χ² = 7.88, df = 54, p ≈ 1, Cramér's V ≈ 0.016).

In practical terms, evaluators are not strongly favoring or penalizing particular generators. They apply a relatively consistent scoring style across all models. Differences in scores mostly reflect how strong a generator is overall and whether an evaluator is generally strict or lenient — not evaluator favoritism.

Why this matters: Rankings are not driven by hidden evaluator bias.

Evaluators differ mainly in harshness and discrimination, not preference

Evaluators show clear differences in scoring style. Some are consistently harsher or more lenient, while others use a narrower or broader scoring range. For example, Allen AI Molmo 8B is the harshest evaluator (lowest mean score), while Gemini 3 Flash is the most lenient. Google Gemma 3 27B shows the most compressed range, differentiating models the least, whereas Gemini Flash Image spreads scores the most.

These differences reflect calibration, not bias. High agreement between evaluators (Spearman ρ ≈ 0.95–0.99) indicates that, despite stylistic differences, evaluators largely agree on relative model quality.

Why this matters: Raw score differences reflect style, not unreliability.

Most reliable evaluators vs. mild idiosyncrasies

While overall interaction effects are minimal, a few mild quirks exist. For example, Gemini Flash Image scores GPT-5.2 Pro higher than expected and is slightly harsher on weaker generators, while Google Gemma 3 27B is unusually generous to Grok. None of these deviations are large enough to indicate serious bias.

If a single evaluator must be trusted most, Gemini 3 Flash stands out: it shows extremely high agreement with others, low idiosyncratic behavior, and a reasonable scoring range. It balances neutrality with discrimination, making it the best overall reviewer. For stress-testing, Allen AI Molmo 8B is useful — provided its harsher scale is normalized.

Why this matters: Answers "which evaluator should I trust and why?"

Key Findings

Example Prompt Breakdowns

Detailed analysis of representative prompts showing why models scored differently:

Quick Insights

Per-Prompt Comparison

Click on any prompt to see detailed breakdown

SVG Generation Benchmark

Scoring Criteria Explained

Vision Judges Used

Model Rankings

Evaluator vs Generator Comparison Matrix

Evaluators behave largely independently of specific generators

Evaluators differ mainly in harshness and discrimination, not preference

Most reliable evaluators vs. mild idiosyncrasies

Key Findings

Example Prompt Breakdowns

Quick Insights

Per-Prompt Comparison

SVG Generation Benchmark

Scoring Criteria Explained

Vision Judges Used

Model Rankings

Evaluator vs Generator Comparison Matrix

Evaluators behave largely independently of specific generators

Evaluators differ mainly in harshness and discrimination, not preference

Most reliable evaluators vs. mild idiosyncrasies

Key Findings

Example Prompt Breakdowns

Quick Insights

Per-Prompt Comparison

Prompt Details

Prompts Used for Evaluation

Evaluation Prompt Used