Design notes · sources · methodology

About this slide:
Where AI stands vs Human Researchers

This page documents what is on the visualization, why each design choice was made, where the numbers came from, and what stakeholders should walk away believing. If you are about to present the slide and someone is going to push back on a bar, the answer is on this page.

← Back to the slide Full FrontierScience data story

1 · Why this slide exists

The OpenAI FrontierScience paper has one famous number — GPT-5.2 scores 77% on the Olympiad set and 25% on the Research set — and a much more interesting story underneath it: AI is now superhuman on closed-form science exams and still subhuman on actual research work.

The original FrontierScience data story we built walks through that gap in nine sections. Useful for a deep dive. Useless for a stakeholder slide. The ask here was different:

“Build a single slide-sized view that an exec can absorb in 5 seconds and walk away knowing where AI is winning, where humans are winning, and how the nuance plays out.”

That meant zooming out from one paper to the whole AI-vs-human-researcher landscape in May 2026 — pulling in GPQA, HLE, FrontierMath, SWE-bench, IMO, Codeforces, ARC-AGI, RadLE, METR, AlphaFold — and reducing it to a single visual.

2 · What's actually on the slide

The visual layout, top to bottom

Header. Title + the single most important sentence: AI beats experts on exams, but cannot do real research yet, and the gap is closing exponentially.
Shock strip (4 mini-cards). Each one is a number designed to be quoted in the meeting: 93% on GPQA, 200M protein structures, 8% → 44.7% in 16 months on HLE, <1% on ARC-AGI-3. Two AI wins, one trajectory wake-up, one stark gap.
Diverging bar chart (the centerpiece, 14 capabilities). AI score extends left in teal, human-expert score extends right in amber. AI-leading capabilities are stacked above human-leading ones — the visual silhouette is the story.
Right rail (4 nuance cards). The "but…" details: AI is top-0.1% on Codeforces, IMO gold, but only 30% on hard radiology vs radiologist 83%, and METR's 4-month doubling.
Footer. Legend + sources, plus a link back here.

3 · Design decisions and why

Format: diverging horizontal bars (not a treemap)

The brief explicitly said no treemap. A diverging bar is the right substitute because the comparison is one-dimensional and oppositional: each row asks "who is better at X?" and that question has a left answer or a right answer.

Bonus: vertical ordering doubles as a sort. The viewer sees AI-leading rows clumped at the top, human-leading rows at the bottom — the silhouette gives you the answer before you read a single label.

Why a shock strip on top

A stakeholder slide has to hand the viewer their soundbite. The four-card strip exists so anyone presenting can quote a number without zooming into the chart. The numbers were chosen for surprise value, not balance: two AI wins, one trajectory, one collapse.

Color: teal vs amber, not red vs green

Red/green has a value judgement (good/bad) that this comparison shouldn't carry. Teal/amber are visually distinct, colour-blind safe, and value-neutral. They also match the existing FrontierScience story's palette (--accent, --accent2) so the two pages feel like one product.

Tooltips, not labels

Every bar has hover-revealed AI score, human score, one-line insight, and source. Labels stay short on the chart itself, so the slide is readable as a static screenshot AND interactive when projected from a laptop.

The single-screen constraint

Body uses height:100vh;overflow:hidden and a CSS grid with four rows (auto auto 1fr auto). The chart and right rail share the flex middle, so the slide stays composed at any 16:9-ish viewport from 1280×720 up to a 4K monitor. We tested down to 720px tall via a @media (max-height:720px) shrink.

4 · The data, with sources

Every bar on the chart corresponds to a row below. AI numbers are quoted from the underlying benchmark's public leaderboard or paper. Human numbers are: published when an expert baseline exists, or structural when the comparison is an order-of-magnitude attribute like speed, scale, or cost.

AI-leading capabilities

Capability	AI	Human	Source
Multiple-choice PhD-level science GPQA Diamond	93	65	GPT-5.2 Pro on GPQA. Expert baseline from Rein et al. 2023.
Competitive programming Codeforces Elo	91	80	OpenAI o3 reached 2727 Elo ≈ top 0.1% of ~600k human competitors. Codeforces blog.
Software engineering SWE-bench Verified	94	75	Claude Mythos at 93.9% on the verified split. SWE-bench Verified.
Olympiad math IMO 2025 (35/42 = gold)	83	78	DeepMind & OpenAI both at gold cutoff. IMO 2025 coverage.
Speed per research task inverted: time-to-attempt	98	15	FrontierScience: PhDs took 3-5h per Research problem; AI returns attempts in seconds. arXiv:2601.21165.
Scale & parallelism concurrent attempts	97	8	AlphaFold predicted 200M+ structures. DeepMind.
Olympiad-style closed-form science FrontierScience-Olympiad	77	72	GPT-5.2 verbatim from the FrontierScience paper. Human reference is the medalist-author standard.

Human-leading capabilities

Capability	AI	Human	Source
Open-ended research reasoning FrontierScience-Research, ≥7/10 rubric	25	85	GPT-5.2 quoted verbatim. Human reference: by construction, PhD authors must clear 7/10 themselves.
Frontier mathematics FrontierMath Tier 4	6	90	o4-mini best at 6.3%; most models 0%. Human mathematicians solve in days. Epoch AI.
Humanity's Last Exam	45	90	Gemini 3.1 Pro Preview at 44.7% (May 2026). Expert humans ≈90% in their domains. CAIS HLE.
Novel adaptive reasoning ARC-AGI-3 interactive	1	100	All frontier models <1% on the March 2026 benchmark. arcprize.org.
Hard diagnostic radiology Radiology Last Exam (RadLE)	30	83	Board-certified radiologists 83% vs GPT-5 30%. RadLE preprint.
Niche subfield expertise	22	90	FrontierScience paper's top failure mode: "niche concept failure" — model substitutes a related-but-wrong concept.
Novel hypothesis / ideation	15	92	Sakana AI Scientist v2 got one workshop paper through, but 42% of experiments failed. Evaluation paper.
Wet-lab / physical experiment	0	100	FrontierScience §4 limitation: text-only eval; cannot interact with reality. Holds for every benchmark above.

5 · The mind-blowing facts worth quoting

AI wins, in plain English

93% > 65%

GPT-5.2 Pro beats PhD experts with internet access on GPQA Diamond. The expert-google baseline was the original gold standard for "general scientific intelligence." It is now solved.

200M proteins

AlphaFold predicted structures that would have taken hundreds of millions of years in the lab. 3M+ researchers in 190+ countries use the database. 2024 Nobel in Chemistry.

2727 Elo

OpenAI o3 on Codeforces — that's top 0.1% among ~600k competitive programmers. The same model family scored IMO gold (35/42) in 2025.

Humans still own this

<1% ↔ 100%

On ARC-AGI-3 (March 2026), every frontier model is below 1%. Humans get 100%. When the benchmark removes memorized patterns and demands novel reasoning, the gap is total.

6% on Tier 4

FrontierMath's research-level problems take human mathematicians days. The best AI solves 6.3%. Most frontier models score zero. Mathematics-as-research is not solved.

323×

Emergency intracranial hemorrhage detection: radiologists' diagnostic odds ratio is 323× higher than AI's, with 4 false positives vs AI's 293. AI is not yet the doctor.

The fact that should change every roadmap

~4.3 months

METR's task-completion time horizon tracks the longest task an AI can do reliably, measured by how long it takes a human. That horizon is doubling every 4.3 months (post-2023, updated Jan 2026). On the slide's shock strip, Humanity's Last Exam went from 8% (Jan 2025) to 44.7% (May 2026). Whatever AI cannot do today is on a four-month clock.

6 · Methodology and honest caveats

How AI scores were chosen

Top frontier model for each benchmark as of May 2026 (GPT-5.2 Pro, Gemini 3.1 Pro Preview, Claude Mythos, OpenAI o3, etc.). We use the public leaderboard's headline number, not internal results.
No averaging across models: this is the "ceiling," what a stakeholder cares about.
Verbatim where possible: GPQA 93%, HLE 44.7%, FrontierScience 77/25, IMO 35/42, SWE-bench 93.9%. No estimation from charts.

How human scores were chosen — and what we're hand-waving

Published expert baselines where they exist: GPQA expert ≈65%, HLE expert ≈90%, RadLE radiologists 83%.
By construction on FrontierScience-Research: the rubric is calibrated so a competent PhD scores ≥7/10. We encode that as ~85.
Structural attributes — speed, scale, cost, parallelism — are normalized 0–100 from order-of-magnitude estimates. A 3-hour human attempt vs a 10-second AI attempt is ~1000×; that becomes "AI 98 vs Human 15," not "AI 1000 vs Human 1." The bar is meant to compare visually, not be divided.
The FrontierScience paper itself notes (§4) that no formal human baseline was run. Most public AI benchmarks share this gap. We're upfront about it: the human scores in this slide are defensible best-evidence estimates, not measured experiments.

What this slide does NOT claim

It is not a forecast. The 4.3-month doubling is a measured past trend, not a promised future.
It is not a productivity claim. A 94% on SWE-bench Verified ≠ "AI replaces engineers." Real codebases are much harder; SWE-Bench Pro caps top models at ~23%.
It is not a benchmark-by-benchmark teardown. Each row collapses thousands of hours of evaluation work into one number. The companion citations are the audit trail.
"AI wins on speed" is not the same as "AI wins on quality." The same model that returns an answer in 10 seconds is often the model that gets a niche concept wrong.

7 · Why this is useful (and to whom)

For execs & leadership

One slide answers: where can we deploy AI today vs where do we still need humans?
Forces the right framing: not "AI vs humans" but "which tasks have crossed over."
The shock strip is a soundbite library — quote one of those numbers in any AI conversation for the next quarter.

For researchers & technical folks

Calibrates intuition. If you only read benchmarks in your field, you have a distorted view.
Identifies the still-hard problems — ARC-AGI-3, FrontierMath Tier 4, FrontierScience-Research, RadLE — as the places where moving the number actually means something.
Cited & reproducible: every number traces back to a public benchmark.

For product & strategy

Maps capabilities to product surface: AI is ready for "closed-form" workflows (knowledge retrieval, code completion, structured science); not yet ready for "open-ended" workflows (novel hypothesis, multi-step research, physical interaction).
The 4.3-month doubling is the planning horizon: anything you defer "until AI is good enough" likely arrives in 3 product cycles.

For skeptics & AI-doomers alike

Two truths held together: AI has crossed superhuman on multiple expert benchmarks; AI cannot yet do the open-ended work that defines research.
No hype, no doom — just where the bars currently sit and where they're moving.

8 · Going deeper

FrontierScience paper (arXiv) — the source paper this slide is grounded in.
Two faces of FrontierScience — data story — the original 9-section deep dive.
METR Time Horizons — the 4.3-month doubling argument.
Humanity's Last Exam — the hardest known AI test.
FrontierMath — the math research benchmark.
ARC Prize — the novel-reasoning benchmark.
AlphaFold: Five Years of Impact — the AI-for-science exemplar.

All AI scores are from public benchmark leaderboards or papers, current as of May 2026. Human-expert scores are either published expert baselines or order-of-magnitude normalizations on a 0–100 scale (clearly disclosed above). This page is an audit trail for the visualization; if you spot a number you can't reconcile against the cited source, that's a bug — not a feature.