GPT-5.2 scores 77% on FrontierScience-Olympiad — and 25% on FrontierScience-Research. The 160-problem dataset shows you exactly why, and the answer isn't "the model got worse." It's that one of the graders is finally looking at the work.
U = CT". Pass / fail.
Same model, same prompt style, no tools. The score collapses because Research grading checks every step.
Science benchmarks keep dying. When PhDs released GPQA in November 2023, GPT-4 scored 39%. Two years later, GPT-5.2 scores 92%. So OpenAI built something harder — from scratch, with experts.
FrontierScience-Research has 60 problems. Each one is worth 10 rubric points. To pass, the model must earn at least 7 of 10 points — the paper's explicit "suitable solution" threshold. 25% means GPT-5.2 cleared that bar on 15 of the 60 problems.
Problems passed @ ≥7 pts
Every ● is one Research problem (60 total). Filled circles = a model passed under the threshold rule. Hover for the rubric profile.
Results averaged across 30 independent trials per problem in the paper; the 25% figure rounds to 15 / 60.
A real Olympiad problem next to a real Research problem, pulled straight from the released dataset. The problems look the same length on the page. The answers are from different planets.
Hard problems aren't found, they're written. OpenAI commissioned 87 experts — 42 international olympiad medalists and 45 PhD scientists — to write and review problems for months. The released gold set is the survivors.
former medalists / national-team coaches with 108 medals (45 G · 37 S · 26 B)
PhD scientists: doctoral candidates, postdocs, professors at globally recognized institutions.
time a PhD scientist spends drafting one problem. 2+ independent reviewers before it ships.
problems written but not open-sourced — held out to detect contamination of the public 160.
Olympiad problems that internal models already solved correctly were discarded as too easy. The benchmark is therefore designed to be adversarial against OpenAI's own models — meaning the 77% figure is despite that selection pressure.
The two splits aren't just graded differently — they're composed differently. Olympiad leans heavily on physics and chemistry because those subjects produce questions with verifiable single-expression answers. Research is evenly split because the rubric format lets biology — messy and open-ended — show up on equal footing.
How that translates to scores: on Olympiad, models perform best on chemistry, then physics, then biology. On Research, the order shifts to chemistry > biology > physics — biology rubrics give partial credit where olympiad-style biology questions don't.
If you plot how long each split's "answer" field is, you see two different
universes — on a log axis. The median Olympiad answer is 57 characters
("U = CT"). The median Research answer is 2,248 characters — a
fully structured rubric.
x-axis: character count on a log scale. Vertical dashed lines mark the medians.
Each Research problem is worth exactly 10 points, split across 7 to 18 sub-criteria. Every bar below is one of the 60 problems; every segment is one rubric item with width = its point value. The most common item size is 1.0 point; the finest granularity is 0.125 points.
Median 10. Some chemistry problems pack 18 small partial-credit items.
Exact, every problem. The threshold for "success" is ≥7.
401 of 635 rubric items in the released set. Smallest item: 0.125 pts.
The paper sweeps GPT-5.2's reasoning effort and reports two endpoints. More test-time compute reliably lifts both scores — but Research stays under 30%.
lower effort → higher effort. Already strong at low compute.
Same model, same compute step, less than a third the gain in absolute terms.
Only values the paper quotes verbatim. Figure 6 in the paper shows the full leaderboard across 9 models (GPT-4o, o4-mini, o3, Claude Opus 4.5, Grok 4, Gemini 3 Pro, GPT-5, GPT-5.1, GPT-5.2) — all trail GPT-5.2 — but specific percentages aren't given in prose, so we don't quote estimated bar heights.
"Surprisingly, GPT-5 outperforms GPT-5.1 on the Research set and ties GPT-5.2."
In other words: a newer GPT release isn't strictly better on open-ended scientific reasoning. The Research split exposes per-version quirks that get averaged out by single-answer benchmarks.
Caveat: GPT-5.2 was evaluated at "xhigh" reasoning effort; the other reasoning models were evaluated at "high". GPT-5.2's edge therefore partially reflects more test-time compute.
The paper bins wrong answers into four categories. Only one is a math/code bug — the others are about reasoning, knowledge, and accuracy. Tool use wouldn't obviously fix any of them.
The model gets the setup right but chains the steps incorrectly — wrong limit, wrong regime, dropped term.
The problem hinges on a specific sub-field idea (e.g., weak-value amplification, Marcus theory, enzymatic kinetics). The model substitutes a closely related but wrong concept.
Algebra mistakes, sign errors, misapplied constants. This bucket is the most responsive to scaling reasoning effort.
Hallucinated reaction products, wrong enzymes, fabricated references. Most common on biology, where the answer space is wide and external lookup is unavailable.
Pick a split and a subject; shuffle through real problems. The Olympiad answers fit on one line. The Research answers don't fit on one screen.
Source: Miles Wang, Robi Lin, Kat Hu, Joy Jiao, Neil Chowdhury, Ethan Chang & Tejal
Patwardhan (OpenAI), FrontierScience: Evaluating AI's Ability to Perform Expert-Level
Scientific Tasks. arXiv:2601.21165 ·
openai.com/index/frontierscience ·
dataset (Apache 2.0).
All dataset charts on this page are computed live from the released 160-problem gold set.
All model/effort numbers are quoted verbatim from the paper text; figure-only numbers are
omitted to avoid estimation error.