ECI Documentation

Data

We fit the model to benchmark scores covering 41 benchmarks spanning 2023–present, drawn from the Epoch Benchmarking Hub. We use the following benchmarks:

Internal evaluations

Chess Puzzles, FrontierMath Tiers 1-3, FrontierMath Tier 4, GPQA Diamond, MATH Level 5, OTIS Mock AIME 2024-2025, SimpleQA Verified, SWE-Bench Verified

External benchmark leaderboards

Aider polyglot, APEX-Agents, ARC-AGI-2, BALROG, DeepResearch Bench, Fiction.liveBench, GeoBench, GSO, HLE, Lech Mazur Writing, OS World, PostTrainBench, SimpleBench, Terminal-Bench, The Agent Company, VPCT, WeirdML V2

Developer reported scores

ANLI, ARC AI2, ARC-AGI, BIG-Bench Hard, CADEval, Cybench, GSM8K, HellaSwag, LAMBADA, MMLU, OpenBookQA, PIQA, ScienceQA, SuperGLUE, TriviaQA, WinoGrande

Preprocessing

To be used in our methodology, benchmarks need to be scored on a 0-1 scale (or equivalently, 0% to 100%). For benchmarks where random guessing would score above 0 (e.g. those with multiple choice responses), we rescale so that random guessing performance is scaled to zero.

In order to capture the upper end of each model’s capabilities, we aggregate across model evaluation settings (e.g. thinking effort and inference provider), taking the highest score for each benchmark. We only aggregate over models released on the same day with the same name (e.g. we do not aggregate across versions of GPT-4o released on different days). We drop any models with fewer than 4 benchmark scores from our model fit, to avoid low-certainty estimates. We also exclude models released before 2023 due to data sparsity; we hope to increase coverage in the future.

Methodology

FAQ

Featured

Publications

Data explorers

Benchmarks by Epoch AI

AI Progress

Industry

Infrastructure

Impacts

Papers & Reports

Data Insights

Newsletter

Podcast

Capabilities

Models

Frontier Data Centers

Chip Owners

Companies

Polling on AI Use

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

Data

Internal evaluations

External benchmark leaderboards

Developer reported scores

Preprocessing

ECI Documentation – Data

Featured

Publications

Data explorers

Benchmarks by Epoch AI

AI Progress

Industry

Infrastructure

Impacts

Publications

Papers & Reports

Data Insights

Newsletter

Podcast

Data explorers

Capabilities

Models

Frontier Data Centers

Chip Owners

Companies

Polling on AI Use

Benchmarks by Epoch AI

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

Scaling

Software progress

Open models

Capabilities

Math

Leading companies

Finances

Geopolitics

Chips

Data centers

Energy

Adoption and use

Economic impact

Future of AI

About Epoch AI

Donate

Team

Careers

Consultations

For press

Transparency

ECI Documentation

Data

Internal evaluations

External benchmark leaderboards

Developer reported scores

Preprocessing