Evaluating AI models on their ability to generate SVG images from text prompts
This benchmark evaluates 10 AI models across 30 creative prompts, scored by an ensemble of vision models on 5 criteria.
How well does the image match the text prompt?
Are objects drawn correctly and recognizably?
Does the scene make physical/spatial sense?
Are all requested elements present?
Is the style consistent and visually appealing?
Weighted average of all criteria (higher is better)
Each image was scored by 7 vision AI models. Scores were averaged for robust results.
| Rank | Model | Avg Score | Prompt | Structure | Physics | Complete | Coherence |
|---|
Average scores showing how each evaluator rates each generator. Use this to identify evaluator bias and determine which evaluator to trust most.
A chi-square independence analysis shows that evaluator scores are almost entirely explained by two global effects: overall generator strength and overall evaluator strictness. The interaction between specific evaluators and specific generators is extremely small (χ² = 7.88, df = 54, p ≈ 1, Cramér's V ≈ 0.016).
In practical terms, evaluators are not strongly favoring or penalizing particular generators. They apply a relatively consistent scoring style across all models. Differences in scores mostly reflect how strong a generator is overall and whether an evaluator is generally strict or lenient — not evaluator favoritism.
Why this matters: Rankings are not driven by hidden evaluator bias.
Evaluators show clear differences in scoring style. Some are consistently harsher or more lenient, while others use a narrower or broader scoring range. For example, Allen AI Molmo 8B is the harshest evaluator (lowest mean score), while Gemini 3 Flash is the most lenient. Google Gemma 3 27B shows the most compressed range, differentiating models the least, whereas Gemini Flash Image spreads scores the most.
These differences reflect calibration, not bias. High agreement between evaluators (Spearman ρ ≈ 0.95–0.99) indicates that, despite stylistic differences, evaluators largely agree on relative model quality.
Why this matters: Raw score differences reflect style, not unreliability.
While overall interaction effects are minimal, a few mild quirks exist. For example, Gemini Flash Image scores GPT-5.2 Pro higher than expected and is slightly harsher on weaker generators, while Google Gemma 3 27B is unusually generous to Grok. None of these deviations are large enough to indicate serious bias.
If a single evaluator must be trusted most, Gemini 3 Flash stands out: it shows extremely high agreement with others, low idiosyncratic behavior, and a reasonable scoring range. It balances neutrality with discrimination, making it the best overall reviewer. For stress-testing, Allen AI Molmo 8B is useful — provided its harsher scale is normalized.
Why this matters: Answers "which evaluator should I trust and why?"
Detailed analysis of representative prompts showing why models scored differently:
Click on any prompt to see detailed breakdown