Design notes · sources · methodology

About this slide:
Where AI stands vs Human Researchers

This page documents what is on the visualization, why each design choice was made, where the numbers came from, and what stakeholders should walk away believing. If you are about to present the slide and someone is going to push back on a bar, the answer is on this page.

1 · Why this slide exists

The OpenAI FrontierScience paper has one famous number — GPT-5.2 scores 77% on the Olympiad set and 25% on the Research set — and a much more interesting story underneath it: AI is now superhuman on closed-form science exams and still subhuman on actual research work.

The original FrontierScience data story we built walks through that gap in nine sections. Useful for a deep dive. Useless for a stakeholder slide. The ask here was different:

“Build a single slide-sized view that an exec can absorb in 5 seconds and walk away knowing where AI is winning, where humans are winning, and how the nuance plays out.”

That meant zooming out from one paper to the whole AI-vs-human-researcher landscape in May 2026 — pulling in GPQA, HLE, FrontierMath, SWE-bench, IMO, Codeforces, ARC-AGI, RadLE, METR, AlphaFold — and reducing it to a single visual.

2 · What's actually on the slide

The visual layout, top to bottom

3 · Design decisions and why

Format: diverging horizontal bars (not a treemap)

The brief explicitly said no treemap. A diverging bar is the right substitute because the comparison is one-dimensional and oppositional: each row asks "who is better at X?" and that question has a left answer or a right answer.

Bonus: vertical ordering doubles as a sort. The viewer sees AI-leading rows clumped at the top, human-leading rows at the bottom — the silhouette gives you the answer before you read a single label.

Why a shock strip on top

A stakeholder slide has to hand the viewer their soundbite. The four-card strip exists so anyone presenting can quote a number without zooming into the chart. The numbers were chosen for surprise value, not balance: two AI wins, one trajectory, one collapse.

Color: teal vs amber, not red vs green

Red/green has a value judgement (good/bad) that this comparison shouldn't carry. Teal/amber are visually distinct, colour-blind safe, and value-neutral. They also match the existing FrontierScience story's palette (--accent, --accent2) so the two pages feel like one product.

Tooltips, not labels

Every bar has hover-revealed AI score, human score, one-line insight, and source. Labels stay short on the chart itself, so the slide is readable as a static screenshot AND interactive when projected from a laptop.

The single-screen constraint

Body uses height:100vh;overflow:hidden and a CSS grid with four rows (auto auto 1fr auto). The chart and right rail share the flex middle, so the slide stays composed at any 16:9-ish viewport from 1280×720 up to a 4K monitor. We tested down to 720px tall via a @media (max-height:720px) shrink.

4 · The data, with sources

Every bar on the chart corresponds to a row below. AI numbers are quoted from the underlying benchmark's public leaderboard or paper. Human numbers are: published when an expert baseline exists, or structural when the comparison is an order-of-magnitude attribute like speed, scale, or cost.

AI-leading capabilities

CapabilityAIHumanSource
Multiple-choice PhD-level science
GPQA Diamond
9365GPT-5.2 Pro on GPQA. Expert baseline from Rein et al. 2023.
Competitive programming
Codeforces Elo
9180OpenAI o3 reached 2727 Elo ≈ top 0.1% of ~600k human competitors. Codeforces blog.
Software engineering
SWE-bench Verified
9475Claude Mythos at 93.9% on the verified split. SWE-bench Verified.
Olympiad math
IMO 2025 (35/42 = gold)
8378DeepMind & OpenAI both at gold cutoff. IMO 2025 coverage.
Speed per research task
inverted: time-to-attempt
9815FrontierScience: PhDs took 3-5h per Research problem; AI returns attempts in seconds. arXiv:2601.21165.
Scale & parallelism
concurrent attempts
978AlphaFold predicted 200M+ structures. DeepMind.
Olympiad-style closed-form science
FrontierScience-Olympiad
7772GPT-5.2 verbatim from the FrontierScience paper. Human reference is the medalist-author standard.

Human-leading capabilities

CapabilityAIHumanSource
Open-ended research reasoning
FrontierScience-Research, ≥7/10 rubric
2585GPT-5.2 quoted verbatim. Human reference: by construction, PhD authors must clear 7/10 themselves.
Frontier mathematics
FrontierMath Tier 4
690o4-mini best at 6.3%; most models 0%. Human mathematicians solve in days. Epoch AI.
Humanity's Last Exam4590Gemini 3.1 Pro Preview at 44.7% (May 2026). Expert humans ≈90% in their domains. CAIS HLE.
Novel adaptive reasoning
ARC-AGI-3 interactive
1100All frontier models <1% on the March 2026 benchmark. arcprize.org.
Hard diagnostic radiology
Radiology Last Exam (RadLE)
3083Board-certified radiologists 83% vs GPT-5 30%. RadLE preprint.
Niche subfield expertise2290FrontierScience paper's top failure mode: "niche concept failure" — model substitutes a related-but-wrong concept.
Novel hypothesis / ideation1592Sakana AI Scientist v2 got one workshop paper through, but 42% of experiments failed. Evaluation paper.
Wet-lab / physical experiment0100FrontierScience §4 limitation: text-only eval; cannot interact with reality. Holds for every benchmark above.

5 · The mind-blowing facts worth quoting

AI wins, in plain English

93% > 65%

GPT-5.2 Pro beats PhD experts with internet access on GPQA Diamond. The expert-google baseline was the original gold standard for "general scientific intelligence." It is now solved.

200M proteins

AlphaFold predicted structures that would have taken hundreds of millions of years in the lab. 3M+ researchers in 190+ countries use the database. 2024 Nobel in Chemistry.

2727 Elo

OpenAI o3 on Codeforces — that's top 0.1% among ~600k competitive programmers. The same model family scored IMO gold (35/42) in 2025.

Humans still own this

<1% ↔ 100%

On ARC-AGI-3 (March 2026), every frontier model is below 1%. Humans get 100%. When the benchmark removes memorized patterns and demands novel reasoning, the gap is total.

6% on Tier 4

FrontierMath's research-level problems take human mathematicians days. The best AI solves 6.3%. Most frontier models score zero. Mathematics-as-research is not solved.

323×

Emergency intracranial hemorrhage detection: radiologists' diagnostic odds ratio is 323× higher than AI's, with 4 false positives vs AI's 293. AI is not yet the doctor.

The fact that should change every roadmap

~4.3 months

METR's task-completion time horizon tracks the longest task an AI can do reliably, measured by how long it takes a human. That horizon is doubling every 4.3 months (post-2023, updated Jan 2026). On the slide's shock strip, Humanity's Last Exam went from 8% (Jan 2025) to 44.7% (May 2026). Whatever AI cannot do today is on a four-month clock.

6 · Methodology and honest caveats

How AI scores were chosen

How human scores were chosen — and what we're hand-waving

What this slide does NOT claim

7 · Why this is useful (and to whom)

For execs & leadership

  • One slide answers: where can we deploy AI today vs where do we still need humans?
  • Forces the right framing: not "AI vs humans" but "which tasks have crossed over."
  • The shock strip is a soundbite library — quote one of those numbers in any AI conversation for the next quarter.

For researchers & technical folks

  • Calibrates intuition. If you only read benchmarks in your field, you have a distorted view.
  • Identifies the still-hard problems — ARC-AGI-3, FrontierMath Tier 4, FrontierScience-Research, RadLE — as the places where moving the number actually means something.
  • Cited & reproducible: every number traces back to a public benchmark.

For product & strategy

  • Maps capabilities to product surface: AI is ready for "closed-form" workflows (knowledge retrieval, code completion, structured science); not yet ready for "open-ended" workflows (novel hypothesis, multi-step research, physical interaction).
  • The 4.3-month doubling is the planning horizon: anything you defer "until AI is good enough" likely arrives in 3 product cycles.

For skeptics & AI-doomers alike

  • Two truths held together: AI has crossed superhuman on multiple expert benchmarks; AI cannot yet do the open-ended work that defines research.
  • No hype, no doom — just where the bars currently sit and where they're moving.

8 · Going deeper


All AI scores are from public benchmark leaderboards or papers, current as of May 2026. Human-expert scores are either published expert baselines or order-of-magnitude normalizations on a 0–100 scale (clearly disclosed above). This page is an audit trail for the visualization; if you spot a number you can't reconcile against the cited source, that's a bug — not a feature.