30 min read

The Score Went Up. The Model Didn't.

The Score Went Up. The Model Didn't.

I'm going to say something that sounds dramatic but is, in a precise technical sense, true: most LLM benchmark scores you've ever seen are not interpretable as a property of the model.

Not wrong. Not useless. But not what people think they are. They're conditional statistics — performance under a specific prompt, decoding scheme, tool configuration, and evaluation harness — reported as if they were intrinsic traits, like height. Missing administration metadata makes construct validity un-auditable; uncontrolled protocol factors add construct-irrelevant variance. The result is numbers that look precise but whose meaning can't be verified.

Here's a concrete example. The LM Council's February 2026 benchmarks page reports GPQA Diamond scores for the top three frontier models: Gemini 3 Pro at 92.6% ±1.7, GPT-5.2 at 91.4% ±1.8, Claude Opus 4.6 at 90.5% ±1.7. Those confidence intervals overlap completely — given the reported uncertainty, you can't confidently order these models without a paired design or more trials. But it gets worse: Claude Opus 4.6 has separate entries for "high, 32k thinking" (90.5%) and "high, 64k thinking" (88.8%). Same model weights, same knowledge, different thinking budget. That 1.7-point gap is not a difference in what the model knows — it's a difference in how much inference-time compute it was allowed to burn.

Or consider METR Time Horizons, which measures AI task completion speed: Claude Opus 4.5 leads at 288.9 minutes ±558.2. The uncertainty range is nearly twice the point estimate. The emperor is wearing error bars, and the error bars are bigger than the emperor.

These aren't cherry-picked embarrassments. They're the norm. When a benchmark reports "Model X scores 90%," what it usually means is: "Model X, with this particular reasoning budget, this tool configuration, this agent scaffold, on this particular run, scored 90% — and under different conditions the score might be anywhere from 83% to 95%."

Benchmark scores are conditional expectations masquerading as intrinsic traits. And the gap between what they claim to measure (intrinsic model capability, which I'll call Θ) and what they actually measure (a context-dependent system property) is the central measurement crisis in AI evaluation.

I've spent the last several months applying Item Response Theory — the mathematical framework that underpins every standardised test from the SAT to clinical psychology assessments — to this problem. What I found is that the field is ignoring approximately 80 years of measurement science that addresses exactly the validity threats present in current practice.

The first step is admitting we're conflating three different statistical objects.


Three Things People Call "A Score" (They're Not the Same Thing)

A recent NIST report (AI 800-3) makes a distinction that should be tattooed on every leaderboard: there's benchmark accuracy, generalised accuracy, and system performance. They are different objects, and confusing them is exactly how we end up overinterpreting leaderboards.

Benchmark accuracy is performance on a fixed item set under a fixed protocol. It's what you actually measured. "Claude Opus 4.6 scored 90.5% on GPQA Diamond under these specific conditions."

Generalised accuracy is expected performance on the superpopulation of items the benchmark samples from. It's what you usually want to know. "How well does this model handle PhD-level science questions in general?" Getting from benchmark accuracy to generalised accuracy requires a statistical model — at minimum, confidence intervals that account for item sampling; ideally, IRT-based ability estimation.

System performance is benchmark accuracy after adding tools, agent harnesses, retry loops, and inference-time compute. It's what most modern evaluations actually report, without admitting the shift. "Claude Opus 4.6 with 32k thinking tokens, Stirrup agent harness, and 5 retries scored 90.5% on GPQA Diamond."

Most published benchmark results are system performance reported as if they were generalised accuracy. The NIST report argues for statistical modelling (including GLMMs) to estimate uncertainty and decompose variance — the same direction this piece is headed. The fact that a national metrology institute now says this isn't my personal crusade; it's institutionalised measurement doctrine.

Here's what's going on.


The Dirty Secret: Three Sources of Noise Nobody Reports

Before we get to what benchmarks measure, we need to talk about something even more fundamental: LLM benchmark scores have at least three distinct sources of variance, and most evaluations don't report any of them.

In psychometrics, this decomposition is standard practice — it's called generalizability theory (Brennan, 2001), and it treats each source of error as a separate "facet" that contributes to total measurement uncertainty. For LLMs, the three critical facets are:

1. Item sampling error — the variance that comes from having a finite number of test questions. A 198-item test (GPQA Diamond's size) samples from a much larger universe of possible questions. A different sample would give a different score, even if the model's underlying ability were perfectly stable. This is the classical reliability problem, and it's the easiest to quantify: confidence intervals from binomial proportions.

2. Within-item stochasticity — the variance that comes from the model itself being non-deterministic. Run the same benchmark item 100 times with different random seeds, and you'll get different answers. Each item doesn't have a binary "correct/incorrect" outcome for a given model — it has a response probability. A model might get a particular GPQA question right 73% of the time. On any given run, that item is a Bernoulli trial, not a known outcome.

This gets brutal in agentic settings, where variance compounds across multiple tool calls, environment interactions, and retry policies. Recent work that reran GAIA and FRAMES tasks 64 times per question found enormous per-question variance and ICC (intraclass correlation) values as low as 0.30 for complex agentic tasks — meaning less than a third of the observed variance reflected genuine task difficulty rather than run-to-run noise.

3. Protocol and harness variance — the variance that comes from evaluation design choices that aren't part of the model but aren't reported as conditions. This includes: prompt format, few-shot examples, answer-option ordering, instruction placement, decoding parameters (temperature, top_p), tool availability, retry policies, scoring heuristics, and the agent framework itself.

The "Evaluation is All You Need" paper (Sun et al., 2025) demonstrated this with controlled experiments: on GPQA Diamond, simply changing the order of answer options and the position of the correct answer shifted scores by over 5 percentage points. Not different models. Not different prompts. Just A-B-C-D ordering versus randomised ordering. Five points — enough to reshuffle a leaderboard — from a design choice most evaluators don't even report.

Each source requires a different fix: item sampling error needs more items or IRT-based ability estimation; within-item stochasticity needs multiple trials per item (pass@k); protocol variance needs standardised administration or explicit reporting of conditions. Conflating them produces uncertainty estimates that are either too narrow (if you only account for item sampling) or too vague (if you wave your hands at "noise" without decomposing it).

Here's what this looks like concretely. Imagine a 198-item test where a model has 150 items it gets right >95% of the time ("knows cold"), 30 items in a near-50/50 zone, and 18 items it gets right <10% of the time ("doesn't know").

The expected score is about 83.3%. But those 30 coin-flip items alone produce roughly 1.4 percentage points of standard deviation in the overall score — about ±2.7 points for a 95% confidence interval — and that's just from within-item stochasticity on the uncertain items, before you add protocol variance or item sampling error. Layer on the 5+ point swings from evaluation design choices documented by Sun et al., and you're looking at total uncertainty windows that dwarf the gaps between "ranked" models on most leaderboards.

What the field needs:

At minimum, run every benchmark item multiple times (pass@k where k ≥ 5, ideally k ≥ 20 for contested items) and report the mean with confidence intervals that decompose item sampling error from within-item stochasticity. Better yet, fit a proper IRT model that estimates item-level response probabilities and derives ability estimates with standard errors. Best of all, use adaptive testing that targets items near the model's ability boundary — where response probabilities are around 50% and the measurement information is highest — rather than wasting evaluation budget on items the model either always gets right or always gets wrong.

One organisation is actually doing this. Artificial Analysis — an independent benchmarking firm that has become the closest thing the industry has to a trusted third-party evaluator — runs multiple repeats across their entire evaluation suite and targets a 95% confidence interval of less than ±1% on their composite Intelligence Index. Their GPQA Diamond evaluations use 5 repeats per question. Their AIME 2025 uses 10 repeats. Their agentic benchmarks (Terminal-Bench Hard, τ²-Bench Telecom) use 3 repeats each. This is what responsible measurement looks like — and the fact that it costs them significantly more than single-run evaluation is a cost the field needs to accept.

But Artificial Analysis is the exception. The fact that most benchmarks are still reported as single-run point estimates in February 2026 is, frankly, embarrassing. We wouldn't accept a clinical diagnostic test with this level of measurement noise. We shouldn't accept it for deployment decisions about models that affect millions of users.


The Four Levels of Construct Shift

When we evaluate an LLM, we're supposedly measuring what it knows — the quality of the internal representations and reasoning patterns it acquired during training. In measurement terms, this is the construct: the latent trait we intend to measure.

But inference-time augmentation progressively shifts the construct being measured — from intrinsic model capability toward system-level performance. This isn't cheating; it's a validity threat that psychometricians have studied for decades under names like "construct-irrelevant variance" and "method effects." The problem isn't that augmentation exists — it's that it's uncontrolled and unreported.

I call this the Inference-Time Augmentation (ITA) hierarchy. A quick notation guide before we dive in:

Notation. Throughout this piece: Θ (theta) = base model capability (the stable property we want to measure). σ = scaffolding sensitivity (how much a model's score improves with initial augmentation — not the statistical standard deviation). κ = capacity ceiling (where more scaffolding stops helping). τ₁, τ₂, τ₃ = the contribution of reasoning compute, tool use, and agentic scaffolding respectively to the observed score.

The four ITA levels:

Level 0 — Raw capability. The model answers from its weights alone. No scaffolding, no chain-of-thought, no tools. This is as close to measuring Θ as we can get.

Level 1 — Reasoning compute. We add chain-of-thought prompting, extended thinking budgets, best-of-N sampling. The model still has no external information — but it's burning more compute. Now the score reflects Θ plus a reasoning efficiency factor. A model that's great at decomposing problems into steps looks "smarter" than one that isn't, even if they have identical underlying knowledge.

Level 2 — Tool use. We hand the model a calculator, a code interpreter, web search. The model becomes an orchestrator. A 7B model with a calculator and a 405B model without one can now score identically on arithmetic for completely different reasons. The score is a three-way mixture of base capability, reasoning efficiency, and tool orchestration ability.

Level 3 — Agentic scaffolding. Multi-step tool chains, self-correction loops, multi-agent debate. We're no longer measuring a model. We're measuring a system. The score reflects engineering decisions as much as model properties.

Each level shifts the construct further from Θ toward "system capability." And current benchmarks jump between these levels without acknowledgment, producing scores that conflate everything together.

Here's the thing: educational psychometrics solved this problem decades ago. When researchers discovered that timed exams measure two things at once — ability and processing speed — they developed formal methods to either model the dimensions separately or standardise conditions to eliminate the confound. The relevant framework is generalizability theory (Brennan, 2001), which treats each source of variance — items, occasions, raters, prompts, seeds — as a separate "facet." For LLM evaluation, the prompt format, reasoning budget, tool configuration, and agent scaffold are literally facets in the G-theory sense. We have the mathematical machinery. The LLM evaluation community just hasn't picked it up.

The standardisation problem is real and insidious. As Artificial Analysis co-founder Micah-Hill Smith noted in a recent interview: when Google needed a number that said Gemini 1.0 Ultra was better than GPT-4, they constructed 32 unpublished chain-of-thought examples for every MMLU topic — a prompting strategy they never shipped to users. Labs routinely prompt their own models differently than competitors', cherry-picking few-shot examples, reasoning scaffolds, and temperature settings. A few points on MMLU can be the difference between "SOTA" and "also-ran," and those points can be manufactured through prompt engineering alone.

This isn't speculation. Sun et al. (2025) ran controlled experiments showing that evaluation design choices as mundane as answer-option ordering and correct-answer placement on GPQA Diamond shift scores by over 5 percentage points — enough to reshuffle a leaderboard entirely. Their paper's title, "Evaluation is All You Need," is only half-joking: with the right evaluation setup, you can manufacture improvements that don't reflect genuine model progress.

Artificial Analysis addresses this with what they call a "mystery shopper policy" — registering accounts on domains not associated with their company and running evaluations incognito, so labs can't serve different models on private endpoints. They also enforce standardised testing conditions: temperature 0 for non-reasoning models, temperature 0.6 for reasoning models, and identical prompting strategies across all models. This is the minimum standard the field should demand, and most published benchmarks don't come close.


The February 2026 Benchmark Landscape: Case Studies in Construct Shift

The ITA framework isn't abstract. Every major benchmark in current use illustrates the problem. (I examine the full landscape — 20+ benchmarks classified by ITA level, construct, and variance source — in the Appendix: Benchmark Field Guide below.) Three case studies capture the core argument.

Case Study 1: GPQA Diamond — Leaderboards Inside Overlapping Uncertainty

GPQA Diamond (198 PhD-level science questions) is the current workhorse reasoning benchmark. The top three frontier models — Gemini 3 Pro at 92.6% ±1.7, GPT-5.2 at 91.4% ±1.8, Claude Opus 4.6 at 90.5% ±1.7 — are separated by gaps smaller than their reported uncertainty. Given those intervals, you can't confidently order these models without a paired design or more trials. The "leaderboard" is three overlapping distributions, not a ranking.

More revealing: Claude Opus 4.6 has separate entries for "high, 32k thinking" (90.5%) and "high, 64k thinking" (88.8%). Same model weights, same knowledge, different thinking budget. That 1.7-point gap is pure τ₁ (reasoning compute), not Θ. And it's enough to move a model from "first place" to "third place" — based on a parameter the evaluator chose, not a property the model has.

Sun et al. (2025) demonstrated that evaluation design choices as mundane as answer-option ordering can shift GPQA scores by over 5 percentage points. Five points — more than the spread across the top three models — from a design choice most evaluators don't even report.

In the NIST three-object framework: GPQA leaderboards report system performance (model + thinking budget + prompt design) as if it were generalised accuracy (how well the model handles PhD-level science in general). The gap between those two objects is where the misleading happens.

Case Study 2: METR Time Horizons — Uncertainty Is the Story

METR's time horizon metric measures the human task duration at which an AI model reaches 50% success. Claude Opus 4.5 leads at 288.9 minutes ±558.2. The uncertainty is nearly twice the point estimate. The 95% interval spans roughly 1 hour 49 minutes to 20 hours 25 minutes — derived via bootstrapping, and METR themselves caution that the upper end likely reflects limitations of the task suite rather than true capability.

This is what honest measurement looks like. METR is admirably explicit about the limitations: the task suite doesn't include enough long tasks to reliably bound the upper end, and different scaffolding choices would produce different results. They standardise what they use and publish the conditions. The score is scaffold-dependent and they say so.

If every benchmark team showed this level of transparency, we wouldn't need this essay. The problem is that most teams report similar levels of uncertainty (when they measure it at all) but present the point estimate as if it's the whole story.

Case Study 3: ARC-AGI-2 — The Dosage Curve Made Visible

ARC-AGI-2 is simultaneously the most psychometrically honest benchmark in the field and the most vivid demonstration of ITA-driven construct shift.

François Chollet designed ARC in 2019 to measure fluid intelligence — novel abstract reasoning that resists memorisation. ARC-AGI-1 endured five years with minimal progress until test-time reasoning compute cracked it; by early 2026 it's saturated at 96%. ARC-AGI-2 raised the difficulty while preserving human solvability (100% of tasks solved by at least two people in under two attempts). But AI performance plummeted — pure LLMs score 0%.

The current leaderboard is a masterclass in construct shift:

  • Gemini 3 Deep Think: 84.6% at $13.62/task
  • Claude Opus 4.6 Thinking Max: 68.8%
  • GPT-5.2 Thinking: 52.9% at $1.90/task
  • Gemini 3 Pro baseline: 31.1% at $0.81/task
  • NVARC (fine-tuned 4B model): ~24% at $0.20/task

Gemini 3 Pro baseline scores 31.1%. The same underlying model in Deep Think mode scores 84.6%. That's a 53.5 percentage point gain from reasoning compute alone — same Θ, radically different scores. No other benchmark so cleanly isolates the τ₁ dimension.

The cost data makes this even starker. ARC Prize reports cost per task alongside accuracy — explicitly because "intelligence is about finding the solution efficiently, not exhaustively." Deep Think costs 17× more than baseline for a 2.7× accuracy increase. Meanwhile, NVARC — a 4-billion-parameter model, 100× smaller than frontier systems — reaches 77% of the baseline Gemini 3 Pro score at 25% of the cost. The "intelligence" rankings depend entirely on whether you're measuring raw accuracy, cost-adjusted accuracy, or parameter-adjusted accuracy. Three different constructs on the same leaderboard.

ARC-AGI-3 (expected March 2026) will shift the format entirely: from static grid puzzles to interactive environments requiring exploration, planning, memory, and goal acquisition. That's a deliberate move from Level 0–1 measurement to Level 2–3 — making the construct shift explicit rather than accidental.


The Pattern Across All 20+ Benchmarks

(See the full Appendix: Benchmark Field Guide for the detailed breakdown. Here's the summary.)

Four things jump out from the full landscape analysis:

First, the benchmarks that most cleanly measure base capability — SimpleBench, ARC-AGI, GSO, VPCT — are not the ones labs headline. The headline benchmarks (SWE-bench, GPQA, AIME, HLE) are Level 1–3 and heavily subject to construct shift. They produce the most impressive-looking numbers, which is exactly why labs prefer them.

Second, confidence intervals, where reported, frequently overlap between the "top-ranked" models. The GPQA Diamond "leaderboard" is three overlapping distributions, not a ranking. Add in model stochasticity from multiple runs, and the "winner" on any given day is essentially random.

Third, the same benchmark produces wildly different scores depending on the evaluation framework. SWE-bench scores differ by 15+ percentage points between independent and lab-reported evaluations because the scaffolding differs. METR Time Horizons has confidence intervals larger than the point estimate. These are not stable measurements of stable properties.

Fourth, the organisations doing it best — Artificial Analysis (standardised harnesses, multiple repeats, retired saturated tests), METR (explicit scaffold documentation, honest uncertainty), Epoch AI (SimpleBench's 5-run methodology), and NIST (variance decomposition guidance) — prove that rigorous evaluation is feasible. The problem isn't that we don't know how. It's that the incentive structure rewards impressive-looking numbers over honest ones.


The Best Current Practice (And Why It's Still Not Enough)

If anyone is doing LLM evaluation right, it's Artificial Analysis. Their Intelligence Index v4 is the most methodologically rigorous composite score in the industry, and it's worth examining in detail — both as a model for what the field should be doing, and as an illustration of how far even the best current practice remains from proper psychometric measurement.

What they get right:

The Intelligence Index aggregates 10 evaluations across four equally-weighted categories: Agents (25%), Coding (25%), General (25%), and Scientific Reasoning (25%). This categorical structure implicitly acknowledges that "intelligence" is multidimensional — you can't reduce it to one number without at least structuring the dimensions first. Their agent benchmarks (GDPval-AA, τ²-Bench Telecom) sit in a dedicated category, separate from static Q&A benchmarks like GPQA Diamond and HLE. This is a crucial design choice: it prevents agentic scores from shifting the construct measured by assessments of base reasoning, and it means a model that dominates at terminal operation (Level 3 in our hierarchy) doesn't get credit for "being smarter" — it gets credit for "being a better agent."

They run repeats: 5 repeats for GPQA Diamond, 10 for AIME 2025, 3 for agentic benchmarks, 5 for IFBench and CritPt. They've experimentally validated that their composite index achieves 95% confidence intervals of less than ±1%. They use a standardised agent harness (Stirrup) for agentic evaluations, ensuring that scaffolding differences between models are minimised — though not eliminated, since reasoning models inevitably use different thinking budgets.

They've built their own benchmarks to fill gaps. AA-Omniscience is a 6,000-question knowledge benchmark that does something almost no other benchmark does: it penalises hallucination. The scoring function assigns points for correct answers, subtracts points for confident wrong answers, and treats "I don't know" as neutral. This means a model that abstains when uncertain scores better than one that confabulates — measuring epistemic calibration alongside knowledge. In the Intelligence Index, Omniscience contributes equally via accuracy (50%) and non-hallucination rate (50%). This is a direct analogue of the IRT concept of item-person fit: a model that "knows what it doesn't know" has better calibrated item response functions.

Their AA-LCR benchmark (Long Context Reasoning) uses ~100k tokens of input per question across 230 documents, testing whether models can actually reason over long contexts rather than just retrieve from them. And they've been ruthless about retiring saturated benchmarks — their V1 index would be completely useless today because tasks like HumanEval-style Python function writing are now trivial for small models.

Where the ITA framework exposes remaining gaps:

Even with all this rigour, the Intelligence Index still conflates ITA levels within categories. GPQA Diamond (in Scientific Reasoning, 25% weight) is evaluated with reasoning models at temperature 0.6 using their full thinking budget. A model with 64k thinking tokens and a model with 4k thinking tokens both contribute to the same GPQA Diamond score, but they're operating at different points on the τ₁ dimension. The index doesn't decompose how much of each model's GPQA score comes from base knowledge (Θ) versus reasoning compute (τ₁).

Similarly, Terminal-Bench Hard (in Coding, 25% weight) is pure Level 3 — the score depends on the Terminus 2 agent harness as much as the model. Artificial Analysis standardises this by using the same harness for all models, which is far better than the Wild West of lab-reported SWE-bench scores. But the score is still a system property, not a model property. A model that would excel with a different agent architecture gets no credit for that latent capability.

The deepest issue is that the Intelligence Index is still a single number. A composite of composites. Two models with identical index scores can have radically different profiles: one might dominate on GDPval (real-world tasks with tools) while struggling on HLE (pure reasoning), while the other shows the reverse pattern. The index score hides this. The categorical breakdown (four 25% buckets) is better than nothing, but it's still four numbers where you need at minimum six: base capability (Θ), reasoning sensitivity (σ), capacity ceiling (κ), tool orchestration (τ₂), agentic system performance (τ₃), and epistemic calibration.

Artificial Analysis is the existence proof that rigorous, independent, multi-repeat evaluation at scale is economically viable. Their methodology should be the floor, not the ceiling, of what the field demands. But the framework I'm proposing would push even their best-in-class approach further — from "fair measurement under standardised conditions" to "decomposed measurement that separates the dimensions entirely."


Why Tool Use Breaks the Standard Model

Reasoning compute is a quantitative problem — it adds a dimension, which existing statistical methods can handle. Tool use is a qualitative pathology that causes standard unidimensional IRT to misfit.

Below some ability threshold, a model can't use tools at all. It produces malformed API calls, hallucinates function signatures, can't parse results. Above that threshold, tools work and certain items become trivially easy. This creates a step-function in the item response surface where unidimensional IRT expects a smooth monotonic curve — and the result is systematic misfit.

The fix isn't to abandon IRT; it's to reach for the right extensions. Multidimensional IRT can model base competence and tool competence as separate latent dimensions. Hurdle models can capture the threshold below which tool use fails entirely. Mixture IRT can identify subpopulations of items that behave differently with and without tools. The toolbox exists — but the field is still using the simplest model and wondering why the fit is poor.

Worse, tools create a fan-spread: they amplify existing ability differences rather than compensating for them. High-ability models leverage tools multiplicatively — better queries, multi-step chains, output verification. Low-ability models either can't use tools at all or use them poorly. Tool-augmented benchmarks systematically exaggerate the real ability gap between strong and weak models.


The Gorilla Problem: Same Score, Different Universe

Picture two models:

Model A is a 7B parameter model fine-tuned extensively on function-calling datasets. On a tool-augmented benchmark, it scores 85%.

Model B is a 405B parameter base model without tool training. Same benchmark, same tools available. It scores 75%.

The leaderboard says Model A is better. By a lot.

Now remove the tools. Model A collapses to 30%. Model B scores 82%. The ranking inverts completely.

This isn't hypothetical. Models like Gorilla, xLAM, and NexusRaven are real, deployed systems with exactly this profile. Apply this to SWE-bench specifically: a small model embedded in a brilliantly engineered agentic framework can outscore a much larger model in a naïve framework. The SWE-bench score tells you about the system. Deployers who switch to a different framework will get completely different results from the same model. The benchmark score is not a property of the model — it's a property of the model-scaffold interaction. Reporting it as a model property is a category error.

This is exactly what we see in the Feb 2026 data: DeepSeek V3.2-Speciale scores 77.8% on SWE-bench in one evaluation framework but lower in independent evaluations. The model didn't change — the scaffolding did.


The Fix: Stop Reporting One Condition. Report a Curve.

Here's where it gets interesting — and where the measurement framework moves from diagnosis to prescription. Everything I've described has a precise parallel in human intelligence testing.

Raven's Progressive Matrices is the gold-standard test of fluid intelligence. For decades, psychometrists have studied what happens when you coach people on it. The result is consistent: coached scores jump by 0.3–0.5 standard deviations. Then they hit a wall. More coaching doesn't help. The wall height correlates with working memory capacity — an architectural constraint, not a knowledge deficit.

Chain-of-thought prompting is coaching. Extended thinking is practice time. And the wall — the point where more scaffolding stops helping — is the model's true capacity.

This motivates the strategy dosage curve, which I believe is the single most important measurement primitive the field is missing. Instead of testing a model at a single augmentation level, you test it under escalating scaffolding:

  • S0: Zero-shot, single-token answer
  • S1: "Think step by step"
  • S2: Structured decomposition scaffold
  • S3: A worked example of similar problems
  • S4: Tree-of-thought with multiple reasoning paths, self-critique, majority vote

Then you fit a saturation curve and extract two parameters (as defined in the notation box above):

  • σ (scaffolding sensitivity): How much does the model improve with initial scaffolding?
  • κ (capacity ceiling): Where does it plateau no matter what you throw at it?

These two numbers tell you radically more than any single benchmark score.

We can see this directly in the Feb 2026 data. Claude Opus 4.6 has separate GPQA entries for "high, 32k thinking" (90.5%) and "high, 64k thinking" (88.8%) and "high, no thinking" (unlisted but significantly lower). That's three points on a dosage curve from a single model.

But the most dramatic dosage curve data is the ARC-AGI-2 case study above: Gemini 3 Pro baseline at 31.1% versus Deep Think at 84.6%. A 53.5-point gain, same weights, different reasoning budget. That is a dosage curve — we just don't call it one yet.

If labs reported full dosage curves — S0 through S4 for every model on every benchmark — we could decompose σ and κ for every model-benchmark pair.

The same-score illusion is everywhere. Imagine two models that both score ~73% on GPQA Diamond under standard chain-of-thought:

A base 70B model went from 55% at S0 to 73% at S2, plateauing around 80%. High σ, high κ.

An R1-distilled 7B went from 73% at S0 to 74% at S4. σ ≈ 0 (strategy baked in), κ = 75%.

Under a single-condition benchmark, they look equivalent. But the 70B has headroom. When harder items arrive next year, it will scale. The 7B won't. The dosage curve is the only way to expose this.


The Capacity Wall: Where All Roads Converge

For any given model, as items get harder, all augmentation conditions converge. Easy items? The model gets them right with or without chain-of-thought. Medium items? Scaffolding helps substantially. Hard items? The model fails with and without scaffolding. Nothing helps.

That convergence point — the difficulty at which no amount of prompting, compute, or strategy makes a difference — is the capacity wall. And the capacity wall is the model's true capability.

Look at VPCT (the physics ramp benchmark) as a case study: Gemini 3 Pro scores 91%, while GPT-5.2 ranges from 67% at "high" thinking to 84% at "xhigh" thinking. The 17-point gap between thinking settings for the same model is pure reasoning compute (τ₁). But both settings will converge at some difficulty level — and that convergence reveals the true gap between models, stripped of scaffolding.


Confidence Intervals: The Minimum Viable Fix

Of all the problems I've described, the stochasticity problem has the simplest fix and the least excuse for remaining unsolved.

As shown earlier, the coin-flip items in a typical benchmark contribute roughly 1.4 percentage points of standard deviation — about ±2.7 points for a 95% confidence interval — before you add protocol variance. That's enough to swamp the 1–2 point gaps on frontier leaderboards.

The fix is elementary: run the benchmark multiple times and report the mean ± standard error. Better yet, fit an IRT model that estimates per-item response probabilities and derives a latent ability estimate with proper standard errors. This gives you a continuous ability scale with known precision, rather than a percentage that could be anywhere in a multi-point window.

Some benchmarks are starting to do this. The LM Council reports confidence intervals. METR Time Horizons honestly shows ±558.2 minutes on its leading entry. BALROG reports ±2.2%. But many of the most-cited benchmarks — MMLU-Pro, SimpleBench, GDPval, Terminal-Bench — still report single-run point estimates. And even where confidence intervals exist, they typically capture only item sampling error, not within-item stochasticity or protocol variance.

We need all three. And we need deployers to understand that when two models are within each other's confidence intervals — as the top three models on GPQA Diamond currently are — the "ranking" is noise.


Profiles, Not Scores

The implication is straightforward: a single benchmark ranking is psychometrically incoherent when augmentation strategies vary and stochastic uncertainty is unquantified.

The fix isn't complicated. Instead of:

Model X: GPQA = 90.5%

Report:

Model X: Θ = 1.2 ±0.08, σ = 0.4, κ = 91% ±1.2, τ₂ = 1.4, class = generalist
Model Y: Θ = −0.2 ±0.12, σ ≈ 0, κ = 75% ±0.9, τ₂ = 1.6, class = tool-specialist

Now a deployer can ask the right question. "Am I deploying this with guaranteed tool access? Then Model Y's profile matters. Am I measuring what the model has actually learned? Then Θ and κ matter. Am I evaluating a training method? Then I care about whether σ moved or κ moved. And am I confident in the measurement? Then I check the standard errors."

No single number captures all of this. And pretending one does is how you end up trusting a small tool-calling specialist over a large model that genuinely understands the domain — until the novel API schema arrives that wasn't in the fine-tuning data, and the specialist falls apart while the generalist adapts.


Scaling Laws = Spearman's g — And That's the Problem

There's a deeper question hiding here that the intelligence research community will recognise: do LLMs have a general factor?

Spearman's key insight (1904) was that when you administer a broad and varied battery of cognitive tests to a broad population, you can decompose the variance in scores into two components: a general factor (g) that loads onto every test, and specific factors (s) unique to each test. Together, g + s explain the majority of the variance. The empirical signature of g is the positive manifold — the finding that all cognitive tests correlate positively, because g contributes to all of them.

LLMs have exactly this structure. The empirical scaling relationships discovered by Kaplan et al. (2020) and refined by Hoffmann et al. (2022) show that as you increase compute, data, and parameters together, loss decreases as a smooth power law across essentially all tasks simultaneously. More scale → lower loss → better performance on everything. One underlying resource dimension that loads onto every benchmark. That's not merely analogous to Spearman's g. It's the closest AI parallel possible. Scaling laws are the general factor for LLMs.

And that's exactly why IQ-style composite scores are so seductive — they're capturing something real. A model trained with more compute genuinely is "generally more capable," in the same way a person with higher g genuinely does better across a range of cognitive tasks. The positive manifold at the frontier isn't an illusion. The leaderboard is measuring a real thing.

But here's the catch. g is real, and g is what makes IQ tests meaningful — and you can still cheat at IQ tests.

Decades of research on practice effects, coaching, and test-taking strategy show consistent results: coached IQ scores jump by 0.3–0.5 standard deviations. The test-taker gets better at the format — time management, elimination strategies, pattern familiarity — without any change in the underlying g. The score goes up. The trait doesn't. Psychometricians call this construct-irrelevant variance: the test is now measuring test-taking skill on top of the construct it's supposed to isolate.

This is the entire article in one analogy.

Inference-time augmentation is coaching for LLM evals. Extended thinking budgets, chain-of-thought scaffolding, tool access, agent loops — these are the LLM equivalent of test prep courses. They inflate the score without changing what's in the weights. Gemini 3 Pro doesn't gain 53.5 percentage points of fluid reasoning ability when you turn on Deep Think. It gains 53.5 percentage points of test performance by deploying a better problem-solving strategy. The g-like trait (what was learned during training) hasn't changed. The score has.

That's what "lying" means. Not fraud. Not meaninglessness. Scores that reflect test-taking strategy rather than the underlying trait the test is supposed to measure.

The specific factors matter too. Specialist models — Gorilla-7B for tool calling, R1-distilled models for reasoning chains — are the LLM equivalent of human savants: individuals with low g but enormous s on one dimension. Both g and s exist in Spearman's framework simultaneously. The general factor is real AND the specific factors are large enough to invert rankings depending on the task. A composite score captures the general factor and hides the specific-factor structure — which is exactly the information a deployer choosing between a generalist and a specialist actually needs.

As the model ecosystem diversifies — more specialist fine-tunes, more RL-trained reasoners, more domain-specific distillations — the specific factors grow larger relative to g. The coaching effects grow more extreme. And single composite scores become progressively less adequate, even though the general factor they track remains real.


What Needs to Change

Report ITA conditions alongside scores — and adopt a standard Evaluation Card. Every SWE-bench score should specify the agent framework. Every GPQA score should specify the thinking budget. Every AIME score should specify the reasoning scaffold. Without this, scores are not comparable. The stochasticity-in-agentic-evals work (Mustahsan et al., 2025) proposes exactly this kind of reporting template, and it should become standard practice.

Here's a minimum viable Evaluation Card — the metadata that should accompany every published benchmark result:

Evaluation Card (minimum metadata)

Model:
name + version + endpoint (public API / private / self-hosted) Benchmark: name + split + version/commit hash Prompting: system prompt + few-shot examples (full text or hash) Decoding: temperature, top_p, max_tokens, stop rules Reasoning/effort: thinking budget / effort setting / number of reasoning tokens Tools: enabled tools + versions + retry/timeout policy Harness: agent framework + planner/critic loops + k for pass@k Trials: number of items × number of trials per item × seed strategy Score: mean ± CI, with variance decomposed into item sampling, within-item stochasticity, and protocol variance where possible

That turns a critique into a standard. Any benchmark result published without at least this metadata should be treated as incomplete. NIST's AI 800-3 guidance points in the same direction: specify a statistical model for what you think you're measuring, and report uncertainty appropriately rather than pretending a single accuracy number is self-explanatory.

Report confidence intervals. Always. Single-run point estimates are psychometric malpractice. Artificial Analysis has proven this is economically viable at scale — they run 3–10 repeats per evaluation and achieve ±1% confidence on their composite index. If a 20-person company can do this, major labs have no excuse. At minimum, report item-sampling uncertainty. Ideally, run multiple times and decompose the three variance sources (item sampling, within-item stochasticity, protocol variance). The field should treat any benchmark result reported without uncertainty quantification the way we treat a p-value reported without an effect size: as incomplete to the point of being uninterpretable.

Adopt dosage curves. Five ITA conditions per benchmark instead of one. It's not free, but it catches the same-score illusion that single-condition benchmarks miss entirely. The Claude Opus 4.6 GPQA entries already show this is feasible — they just need to be formalised.

Build more ARC-AGI-style benchmarks. Tests that resist augmentation by testing implicit, fast reasoning rather than multi-step chains. The commonsense and fluid reasoning benchmarks are the closest thing we have to clean Θ measurement. We need harder versions — and, as ARC-AGI-3's shift toward interactive environments shows, versions that measure agentic reasoning by design rather than having it creep in as construct-irrelevant variance.

Stop putting Level 0 and Level 3 scores on the same leaderboard. MMLU-Pro and SWE-bench are measuring different constructs. GPQA (no tools) and GAIA (tool-dependent) test different things. Listing them side by side implies commensurability that doesn't exist. Artificial Analysis's categorical structure — separating Agents, Coding, General, and Scientific Reasoning into distinct 25% buckets — is a step in the right direction. But even within those categories, ITA levels are still mixed.

Replace single rankings with capability profiles. A profile that reports base capability, strategy sensitivity, capacity ceiling, tool orchestration, and confidence intervals tells you what you need to know for your specific deployment context. A single rank does not.


The LLM evaluation crisis is a measurement crisis. Not a data crisis, not a compute crisis, not a benchmark-leakage crisis — though those are real too. It's that we're measuring multiple constructs at once, reporting them without confidence intervals or administration metadata, calling them one thing, and making deployment decisions on the result.

Benchmark scores are conditional expectations masquerading as intrinsic traits. The fix is to make the conditions explicit, decompose the variance, and report profiles instead of rankings. We've had the tools to do this since 1968. It's time we picked them up.


Appendix: Benchmark Field Guide (February 2026)

Each benchmark is classified by ITA level, stated construct, actual construct as measured, and whether confidence intervals are reported. This is the reference companion to the three case studies in the main essay.

The Pattern

Benchmark ITA Level What it claims What it actually measures (Feb 2026) CI reported?
GPQA Diamond 0–1 Expert science knowledge Θ + τ₁ (thinking budget dominates) Yes (±1.7–1.9)
MMLU-Pro 0–1 Knowledge + reasoning Knowledge + strategy; top models within 7pts No (usually)
HLE 0–1 Frontier capability Ceiling detection; floor effects for most Yes (±1.6–1.9)
FrontierMath 1 Research-level maths Reasoning scaffold quality (τ₁) Yes (±2.9)
SimpleBench 0 Common-sense reasoning Closest to clean Θ in reasoning domain No
MATH Level 5 1 Maths reasoning Saturated — top models >97% Yes (±0.3)
Mock AIME 1 Competition maths τ₁-dominated; thinking budget is the variable Yes (±2.7–3.3)
SWE-bench Verified 2–3 Coding ability System capability (Θ + τ₁ + τ₂ + τ₃) Yes (±2.1–2.2)
Terminal-Bench 3 Terminal operation Agentic scaffolding quality No
WebDev Arena 3 + pref Web development System capability + aesthetic preference No (Elo CI)
METR Time Horizons 3 Task completion speed System capability; CI is ±558 minutes Yes (huge)
GDPval 2–3 Real work output Pragmatist system capability No
BALROG 2–3 Game reasoning Long-horizon planning + stochastic variance Yes (±2.2)
Chatbot Arena Uncontrolled Human preference Persuasiveness + style + unknown ITA mix Yes (Elo CI)
ARC-AGI-1 0–1 Fluid reasoning Saturated (96%); data contamination suspected Varies
ARC-AGI-2 0–1 (but τ₁ dominates) Fluid reasoning 53-pt swing from thinking budget alone; cost-per-task metric Varies
ARC-AGI-3 2–3 (interactive) Agentic reasoning Upcoming; first format change since 2019 TBD
GSO 0–1 Code optimisation Understanding + minimal scaffolding benefit No
VPCT 0–1 Physical reasoning Θ; wide spread across thinking budgets No
AA-Omniscience 0 Knowledge + calibration Uniquely measures epistemic calibration Yes (via repeats)
CritPt 1 Research physics Hard enough to resist saturation Yes (5 repeats)
τ²-Bench Telecom 2–3 Conversational agents Dual-control agentic capability Yes (3 repeats)

Knowledge & Reasoning Tier

GPQA Diamond — See Case Study 1 in main essay.

MMLU-Pro expanded to 10 answer choices and harder items, replacing the saturated MMLU. The top 10 models are now within 6.9 percentage points of each other — from GPT-5.2 Pro at 88.7% down to Mistral 3 at 82.8%. "Harder" inevitably means "more reasoning required," pushing the benchmark from Level 0 toward Level 1. When a reasoning model with 64k thinking budget takes MMLU-Pro, it's measuring Θ + τ₁. When a standard model takes it zero-shot, it's closer to Θ alone. Both appear in the same leaderboard column.

Humanity's Last Exam (HLE) is 2,500 expert-crafted questions designed to be the hardest benchmark ever built. Grok 4 Heavy's 50% score made headlines. But for the majority of models, scores are near the floor. IRT shows that measurement information peaks when item difficulty matches ability — HLE items are so hard they provide almost no discriminative information for models below the frontier.

FrontierMath (several hundred unpublished expert-level mathematics problems) shows GPT-5.2 and Claude Opus 4.6 essentially tied around 40% ±2.9. Overlapping confidence intervals, reasoning models at extreme thinking budgets. The "40%" is a system property, not a model property.

SimpleBench ("trick" questions requiring common-sense reasoning) partially resists the τ₁ construct shift. The spread between first (Gemini 3 Pro at 76.4%) and second (Claude Opus 4.6 at 67.6%) — almost 10 points — is much larger than on reasoning benchmarks, suggesting it taps a genuinely different construct.

MATH Level 5 is saturated at the top: GPT-5 at 98.1%, o4-mini at 97.8%. Time to retire it as a frontier discriminator.

Mock AIME (45 problems harder than MATH Level 5) replaces official AIME to avoid data contamination. A model's AIME score is dominated by its reasoning scaffold quality — how effectively it decomposes, backtracks, and verifies. When labs report AIME scores from models with different thinking budgets, they're comparing systems, not models.

Coding Tier

SWE-bench Verified — The LM Council data (standardised Epoch AI evaluation) shows Claude Sonnet 4.5 at 64.8%, while the Awesome Agents rankings (aggregating different evaluations) show Claude Opus 4.5 at 80.9%. Same benchmark, 15+ percentage-point difference because the scaffolding differs. SWE-bench is inherently Level 2–3: the score is a four-dimensional measurement (Θ + τ₁ + τ₂ + τ₃). The 15-point gap is almost entirely τ₃.

Terminal-Bench 2.0 — GPT-5.3 Codex at 75.1%, Claude Opus 4.6 at 69.9%. Pure Level 3 — agentic terminal operation. Useful for evaluating agents, but the score tells you nothing about the model's code understanding independent of terminal operation ability.

WebDev Arena — Models build websites, humans vote. Claude Opus 4.5 at 1512 Elo. Level 3 + human preference. Honest about being an arena, but scores appear alongside GPQA in aggregate rankings, implying commensurability that doesn't exist.

Agentic Tier

METR Time Horizons — See Case Study 2 in main essay.

GDPval (44 knowledge-work occupations) — Claude Opus 4.1 at 43.6%, GPT-5 at 34.8%. Pragmatist construct ("can this model do work worth paying for?") rather than mentalist. Honest about what it measures.

BALROG (text-based games) — Grok 4 at 43.6% ±2.2. At least reports error bars, but those are per-run variance, not accounting for stochasticity across runs.

DeepResearchBench — Claude Sonnet 4.5 at 57.7%, GPT-5 at 57.4%. Gap of 0.3 points without reported confidence intervals. Noise.

Human Preference Tier

Chatbot Arena (LMArena) — Over 6 million votes, Bradley-Terry Elo. Claude Opus 4.6 Thinking at 1506. Captures whether responses feel helpful, but ITA conditions are completely uncontrolled. The Elo score conflates knowledge, reasoning, tool use, writing style, response length, and whatever makes humans click "prefer." "Style Control" Elo adjusts for verbosity bias, but fundamentally, you're comparing systems at unknown ITA levels.

Emerging Frontier

ARC-AGI — See Case Study 3 in main essay.

GSO (General Speedup Optimization) — GPT-5.2 at 27.4%, Claude Opus 4.5 at 26.5%. Possibly the most ITA-resistant coding benchmark: optimising performance requires genuine algorithmic understanding, not just code generation.

VPCT (Visual Physics Comprehension Test) — Predicting where a ball lands on a ramp. Gemini 3 Pro at 91%, GPT-5.2 ranges from 67% (high thinking) to 84% (xhigh thinking). The 17-point spread across thinking settings for the same model demonstrates τ₁ construct shift cleanly.

AA-Omniscience — 6,000-question knowledge benchmark that penalises confident wrong answers while treating abstention as neutral. The only major benchmark measuring epistemic calibration alongside knowledge. From an IRT perspective, it captures whether a model's confidence is calibrated, not just whether its answers are correct.

CritPt — Research-level physics, 70 challenges. Answers as Python functions, symbolic expressions, and numerical values. Hard enough that no model has saturated it, and multi-format requirements resist pattern-matching inflation.

τ²-Bench Telecom — Simulates technical support with dual-control conversational planning. Measures "can this agent actually resolve a customer problem" rather than "intelligence." Level 2–3 with an honest construct definition.

Factorio Learning Environment — Building factories in a video game. Claude 3.7 Sonnet leads at 29.1 but no model comes close to solving the challenge. Points toward evaluation in rich, dynamic environments rather than static question sets.


This post summarises work from "Test-Time Compute and Tool Use as Threats to Measurement Validity in LLM Evaluation: An Item Response Theory Analysis."

Subscribe to my newsletter

Subscribe to my newsletter to get the latest updates and news

Member discussion