This page documents what is on the visualization, why each design choice was made, where the numbers came from, and what stakeholders should walk away believing. If you are about to present the slide and someone is going to push back on a bar, the answer is on this page.
The OpenAI FrontierScience paper has one famous number — GPT-5.2 scores 77% on the Olympiad set and 25% on the Research set — and a much more interesting story underneath it: AI is now superhuman on closed-form science exams and still subhuman on actual research work.
The original FrontierScience data story we built walks through that gap in nine sections. Useful for a deep dive. Useless for a stakeholder slide. The ask here was different:
“Build a single slide-sized view that an exec can absorb in 5 seconds and walk away knowing where AI is winning, where humans are winning, and how the nuance plays out.”
That meant zooming out from one paper to the whole AI-vs-human-researcher landscape in May 2026 — pulling in GPQA, HLE, FrontierMath, SWE-bench, IMO, Codeforces, ARC-AGI, RadLE, METR, AlphaFold — and reducing it to a single visual.
The brief explicitly said no treemap. A diverging bar is the right substitute because the comparison is one-dimensional and oppositional: each row asks "who is better at X?" and that question has a left answer or a right answer.
Bonus: vertical ordering doubles as a sort. The viewer sees AI-leading rows clumped at the top, human-leading rows at the bottom — the silhouette gives you the answer before you read a single label.
A stakeholder slide has to hand the viewer their soundbite. The four-card strip exists so anyone presenting can quote a number without zooming into the chart. The numbers were chosen for surprise value, not balance: two AI wins, one trajectory, one collapse.
Red/green has a value judgement (good/bad) that this comparison shouldn't carry. Teal/amber are visually distinct, colour-blind safe, and value-neutral. They also match the existing FrontierScience story's palette (--accent, --accent2) so the two pages feel like one product.
Every bar has hover-revealed AI score, human score, one-line insight, and source. Labels stay short on the chart itself, so the slide is readable as a static screenshot AND interactive when projected from a laptop.
Body uses height:100vh;overflow:hidden and a CSS grid with four rows (auto auto 1fr auto). The chart and right rail share the flex middle, so the slide stays composed at any 16:9-ish viewport from 1280×720 up to a 4K monitor. We tested down to 720px tall via a @media (max-height:720px) shrink.
Every bar on the chart corresponds to a row below. AI numbers are quoted from the underlying benchmark's public leaderboard or paper. Human numbers are: published when an expert baseline exists, or structural when the comparison is an order-of-magnitude attribute like speed, scale, or cost.
| Capability | AI | Human | Source |
|---|---|---|---|
| Multiple-choice PhD-level science GPQA Diamond | 93 | 65 | GPT-5.2 Pro on GPQA. Expert baseline from Rein et al. 2023. |
| Competitive programming Codeforces Elo | 91 | 80 | OpenAI o3 reached 2727 Elo ≈ top 0.1% of ~600k human competitors. Codeforces blog. |
| Software engineering SWE-bench Verified | 94 | 75 | Claude Mythos at 93.9% on the verified split. SWE-bench Verified. |
| Olympiad math IMO 2025 (35/42 = gold) | 83 | 78 | DeepMind & OpenAI both at gold cutoff. IMO 2025 coverage. |
| Speed per research task inverted: time-to-attempt | 98 | 15 | FrontierScience: PhDs took 3-5h per Research problem; AI returns attempts in seconds. arXiv:2601.21165. |
| Scale & parallelism concurrent attempts | 97 | 8 | AlphaFold predicted 200M+ structures. DeepMind. |
| Olympiad-style closed-form science FrontierScience-Olympiad | 77 | 72 | GPT-5.2 verbatim from the FrontierScience paper. Human reference is the medalist-author standard. |
| Capability | AI | Human | Source |
|---|---|---|---|
| Open-ended research reasoning FrontierScience-Research, ≥7/10 rubric | 25 | 85 | GPT-5.2 quoted verbatim. Human reference: by construction, PhD authors must clear 7/10 themselves. |
| Frontier mathematics FrontierMath Tier 4 | 6 | 90 | o4-mini best at 6.3%; most models 0%. Human mathematicians solve in days. Epoch AI. |
| Humanity's Last Exam | 45 | 90 | Gemini 3.1 Pro Preview at 44.7% (May 2026). Expert humans ≈90% in their domains. CAIS HLE. |
| Novel adaptive reasoning ARC-AGI-3 interactive | 1 | 100 | All frontier models <1% on the March 2026 benchmark. arcprize.org. |
| Hard diagnostic radiology Radiology Last Exam (RadLE) | 30 | 83 | Board-certified radiologists 83% vs GPT-5 30%. RadLE preprint. |
| Niche subfield expertise | 22 | 90 | FrontierScience paper's top failure mode: "niche concept failure" — model substitutes a related-but-wrong concept. |
| Novel hypothesis / ideation | 15 | 92 | Sakana AI Scientist v2 got one workshop paper through, but 42% of experiments failed. Evaluation paper. |
| Wet-lab / physical experiment | 0 | 100 | FrontierScience §4 limitation: text-only eval; cannot interact with reality. Holds for every benchmark above. |
GPT-5.2 Pro beats PhD experts with internet access on GPQA Diamond. The expert-google baseline was the original gold standard for "general scientific intelligence." It is now solved.
AlphaFold predicted structures that would have taken hundreds of millions of years in the lab. 3M+ researchers in 190+ countries use the database. 2024 Nobel in Chemistry.
OpenAI o3 on Codeforces — that's top 0.1% among ~600k competitive programmers. The same model family scored IMO gold (35/42) in 2025.
On ARC-AGI-3 (March 2026), every frontier model is below 1%. Humans get 100%. When the benchmark removes memorized patterns and demands novel reasoning, the gap is total.
FrontierMath's research-level problems take human mathematicians days. The best AI solves 6.3%. Most frontier models score zero. Mathematics-as-research is not solved.
Emergency intracranial hemorrhage detection: radiologists' diagnostic odds ratio is 323× higher than AI's, with 4 false positives vs AI's 293. AI is not yet the doctor.
METR's task-completion time horizon tracks the longest task an AI can do reliably, measured by how long it takes a human. That horizon is doubling every 4.3 months (post-2023, updated Jan 2026). On the slide's shock strip, Humanity's Last Exam went from 8% (Jan 2025) to 44.7% (May 2026). Whatever AI cannot do today is on a four-month clock.
All AI scores are from public benchmark leaderboards or papers, current as of May 2026. Human-expert scores are either published expert baselines or order-of-magnitude normalizations on a 0–100 scale (clearly disclosed above). This page is an audit trail for the visualization; if you spot a number you can't reconcile against the cited source, that's a bug — not a feature.