A data story · OpenAI's FrontierScience eval

Two evals.
One model.
A 52-point gap.

GPT-5.2 scores 77% on FrontierScience-Olympiad — and 25% on FrontierScience-Research. The 160-problem dataset shows you exactly why, and the answer isn't "the model got worse." It's that one of the graders is finally looking at the work.

GPT-5.2, the top model

77%

Olympiad Single-expression answer check.
"U = CT". Pass / fail.

25%

Research Got 7 of 10 rubric points
on 25% of problems. Show your work.

Same model, same prompt style, no tools. The score collapses because Research grading checks every step.

Why this benchmark exists

Science benchmarks keep dying. When PhDs released GPQA in November 2023, GPT-4 scored 39%. Two years later, GPT-5.2 scores 92%. So OpenAI built something harder — from scratch, with experts.

The benchmark-saturation problem

39%

GPT-4
Nov 2023
GPQA

92%

GPT-5.2
2025
GPQA (saturated)

77%

GPT-5.2
2025
FS-Olympiad

25%

GPT-5.2
2025
FS-Research

Section 1

What "25%" actually means.

FrontierScience-Research has 60 problems. Each one is worth 10 rubric points. To pass, the model must earn at least 7 of 10 points — the paper's explicit "suitable solution" threshold. 25% means GPT-5.2 cleared that bar on 15 of the 60 problems.

15 / 60

Problems passed @ ≥7 pts

Every ● is one Research problem (60 total). Filled circles = a model passed under the threshold rule. Hover for the rubric profile.

Results averaged across 30 independent trials per problem in the paper; the 25% figure rounds to 15 / 60.

Section 2

The two faces of FrontierScience.

A real Olympiad problem next to a real Research problem, pulled straight from the released dataset. The problems look the same length on the page. The answers are from different planets.

Olympiad physics problem 0

Answer (one expression)

How it's graded: a judge model checks expression equivalence with the reference answer. One pass/fail verdict per problem, averaged across 20 trials.

Research physics problem 0 · ? rubric items

Answer (rubric, 10 pts total)

How it's graded: a judge model awards points per rubric item, then sums. ≥7 / 10 = pass. Averaged across 30 trials.

Section 3

The expert pipeline behind 160 problems.

Hard problems aren't found, they're written. OpenAI commissioned 87 experts — 42 international olympiad medalists and 45 PhD scientists — to write and review problems for months. The released gold set is the survivors.

Olympiad authors

former medalists / national-team coaches with 108 medals (45 G · 37 S · 26 B)

Research authors

PhD scientists: doctoral candidates, postdocs, professors at globally recognized institutions.

Per Research problem

3–5h

time a PhD scientist spends drafting one problem. 2+ independent reviewers before it ships.

Held back

~480

problems written but not open-sourced — held out to detect contamination of the public 160.

From submitted → gold set

500+

Olympiad submitted

→

100

Olympiad in gold set

200+

Research submitted

→

Research in gold set

Olympiad problems that internal models already solved correctly were discarded as too easy. The benchmark is therefore designed to be adversarial against OpenAI's own models — meaning the 77% figure is despite that selection pressure.

Section 4

What's in the box, by subject.

The two splits aren't just graded differently — they're composed differently. Olympiad leans heavily on physics and chemistry because those subjects produce questions with verifiable single-expression answers. Research is evenly split because the rubric format lets biology — messy and open-ended — show up on equal footing.

Physics Chemistry Biology

How that translates to scores: on Olympiad, models perform best on chemistry, then physics, then biology. On Research, the order shifts to chemistry > biology > physics — biology rubrics give partial credit where olympiad-style biology questions don't.

Section 5

Answer length is the visual tell.

If you plot how long each split's "answer" field is, you see two different universes — on a log axis. The median Olympiad answer is 57 characters ("U = CT"). The median Research answer is 2,248 characters — a fully structured rubric.

Olympiad answer length (n=100) Research answer length (n=60)

x-axis: character count on a log scale. Vertical dashed lines mark the medians.

Section 6

Anatomy of every Research rubric.

Each Research problem is worth exactly 10 points, split across 7 to 18 sub-criteria. Every bar below is one of the 60 problems; every segment is one rubric item with width = its point value. The most common item size is 1.0 point; the finest granularity is 0.125 points.

Sort by:

Physics Chemistry Biology Hover any segment for the actual rubric text.

Items / problem

7 – 18

Median 10. Some chemistry problems pack 18 small partial-credit items.

Total per problem

10.0

Exact, every problem. The threshold for "success" is ≥7.

Most common item size

1.0 pt

401 of 635 rubric items in the released set. Smallest item: 0.125 pts.

Section 7

More compute helps. It doesn't close the gap.

The paper sweeps GPT-5.2's reasoning effort and reports two endpoints. More test-time compute reliably lifts both scores — but Research stays under 30%.

Olympiad — GPT-5.2

67.5%

→

77.1%

+9.6 pts

lower effort → higher effort. Already strong at low compute.

Research — GPT-5.2

18%

→

25%

+7 pts

Same model, same compute step, less than a third the gain in absolute terms.

Confirmed numbers from the paper

Only values the paper quotes verbatim. Figure 6 in the paper shows the full leaderboard across 9 models (GPT-4o, o4-mini, o3, Claude Opus 4.5, Grok 4, Gemini 3 Pro, GPT-5, GPT-5.1, GPT-5.2) — all trail GPT-5.2 — but specific percentages aren't given in prose, so we don't quote estimated bar heights.

The most surprising line in the paper

"Surprisingly, GPT-5 outperforms GPT-5.1 on the Research set and ties GPT-5.2."

In other words: a newer GPT release isn't strictly better on open-ended scientific reasoning. The Research split exposes per-version quirks that get averaged out by single-answer benchmarks.

Caveat: GPT-5.2 was evaluated at "xhigh" reasoning effort; the other reasoning models were evaluated at "high". GPT-5.2's edge therefore partially reflects more test-time compute.

Section 8

Where models actually break.

The paper bins wrong answers into four categories. Only one is a math/code bug — the others are about reasoning, knowledge, and accuracy. Tool use wouldn't obviously fix any of them.

Reasoning · logic error

The model gets the setup right but chains the steps incorrectly — wrong limit, wrong regime, dropped term.

Niche concept failure

The problem hinges on a specific sub-field idea (e.g., weak-value amplification, Marcus theory, enzymatic kinetics). The model substitutes a closely related but wrong concept.

Calculation error

Algebra mistakes, sign errors, misapplied constants. This bucket is the most responsive to scaling reasoning effort.

Factual inaccuracy

Hallucinated reaction products, wrong enzymes, fabricated references. Most common on biology, where the answer space is wide and external lookup is unavailable.

Section 9

Browse the dataset yourself.

Pick a split and a subject; shuffle through real problems. The Olympiad answers fit on one line. The Research answers don't fit on one screen.

Problem

Answer

Source: Miles Wang, Robi Lin, Kat Hu, Joy Jiao, Neil Chowdhury, Ethan Chang & Tejal Patwardhan (OpenAI), FrontierScience: Evaluating AI's Ability to Perform Expert-Level Scientific Tasks. arXiv:2601.21165 · openai.com/index/frontierscience · dataset (Apache 2.0).
All dataset charts on this page are computed live from the released 160-problem gold set. All model/effort numbers are quoted verbatim from the paper text; figure-only numbers are omitted to avoid estimation error.