Show sidebar Data

Data

We fit the model to benchmark scores covering 43 benchmarks spanning 2023–present, drawn from the Epoch Benchmarking Hub. We use the following benchmarks:

Internal evaluations

Chess Puzzles, FrontierMath Tiers 1-3, FrontierMath Tier 4, GPQA Diamond, MATH Level 5, OTIS Mock AIME 2024-2025, SimpleQA Verified, SWE-Bench Verified

External benchmark leaderboards

Aider polyglot, APEX-Agents, ARC-AGI-2, BALROG, DeepResearch Bench, Fiction.liveBench, GeoBench, GSO, HLE, Lech Mazur Writing, OS World, PostTrainBench, SimpleBench, Terminal-Bench, The Agent Company, VPCT, WeirdML V2

Developer reported scores

ANLI, ARC AI2, ARC-AGI, BIG-Bench Hard, CADEval, CSQA2, Cybench, GSM8K, HellaSwag, LAMBADA, MMLU, OpenBookQA, PIQA, ScienceQA, SuperGLUE, TriviaQA, Video MME, WinoGrande

Preprocessing

To be used in our methodology, benchmarks need to be scored on a 0-1 scale (or equivalently, 0% to 100%). For benchmarks where random guessing would score above 0 (e.g. those with multiple choice responses), we rescale so that random guessing performance is scaled to zero.

In order to capture the upper end of each model’s capabilities, we aggregate across model evaluation settings (e.g. thinking effort and inference provider), taking the highest score for each benchmark. We only aggregate over models released on the same day with the same name (e.g. we do not aggregate across versions of GPT-4o released on different days). We drop any models with fewer than 4 benchmark scores from our model fit, to avoid low-certainty estimates. We also exclude models released before 2023 due to data sparsity; we hope to increase coverage in the future.