Epoch Capabilities Index

Dashboard Benchmarks Models ECI Data About

Overview

ECI is a composite metric which uses scores from 39 distinct benchmarks to generate a single, general capability scale. At a high level, ECI stitches together its component benchmarks, determining their relative difficulty by making comparisons wherever models are evaluated on multiple benchmarks. Individual models obtain higher ECI scores if they perform better on harder benchmarks.

We give an overview of our methodology below; further technical details will be available in our forthcoming paper, A Rosetta Stone for AI Benchmarks, which was funded by Google DeepMind, and written in collaboration with researchers from their AGI Safety & Alignment team. However, the ECI is an independent Epoch AI product that Epoch has full rights over.

Model

The technical foundation for the ECI comes from Item Response Theory (IRT), a statistical framework originally developed for educational testing. IRT enables comparisons between students, even when they took tests from different years, and one test might be harder than another.

The core of our model is a simple logistic function:

\[\textrm{performance}(m,b) = \sigma(\alpha_b [C_m - D_b])\]

Here, \(\sigma\) represents the logistic function, \(C_m\) is the model’s capability, \(D_b\) is the benchmark’s difficulty, and \(\alpha_b\) is a slope parameter related to the distribution of difficulty across questions within the benchmark. The formula says that a model’s performance on a given benchmark is dependent on how capable it is relative to the difficulty of the benchmark, and how “steep” the benchmark is. Higher \(\alpha_b\) values correspond to “steeper” benchmarks, where individual questions have a narrower range of difficulties, and there is no long tail of much harder questions. These benchmarks tend to saturate quickly as soon as models are able to gain some headway.

We fit model capability (\(C_m\)), benchmark difficulty (\(D_b\)), and benchmark slope (\(\alpha_b\)) parameters that best explain the full set of observed scores. We do not assume any relationship between capability and time or compute inside the model.

We fit the model via non-linear least-squares estimation, using a ridge regularization penalty to discourage overfitting. The scale of the resultant values is arbitrary; we currently rescale so that Claude 3.5 Sonnet is fixed at 130, and GPT-5 (medium) is fixed at 150, in order to allow for consistent scoring over recent models in a way that balances communicating our uncertainty with providing detailed information.

We model each benchmark result as a logistic function:

After fitting the model, we linearly rescale the model capability scores such that Claude 4.5 Sonnet is fixed at 130, and GPT-5 is fixed at 150.

Data

We fit the model to 1123 benchmark scores covering 147 models and 39 benchmarks spanning 2023–present, drawn from the Epoch Benchmarking Hub. We use the following benchmarks:

Internal evaluations	External benchmark leaderboards	Developer reported scores
FrontierMath Tiers 1-3 FrontierMath Tier 4 GPQA Diamond MATH Level 5 OTIS Mock AIME2024-2025 SWE-Bench Verified	Aider Polyglot BALROG DeepResearch Bench Factorio Learning Environment Fiction.liveBench GeoBench GSO SimpleBench Terminal-Bench VPCT WeirdML V2	ANLI ARC AI2 ARC-AGI BIG-Bench Hard BoolQ CADEval Cybench GSM8K HellaSwag LAMBADA Lech Mazur Writing LiveBench MMLU OpenBookQA OSWorld OSUniverse PIQA ScienceQA SuperGLUE TriviaQA Video MME WinoGrande

Internal evaluations

External benchmark leaderboards

Developer reported scores

FrontierMath Tiers 1-3

FrontierMath Tier 4

GPQA Diamond

MATH Level 5

OTIS Mock AIME2024-2025

SWE-Bench Verified

Aider Polyglot

BALROG

DeepResearch Bench

Factorio Learning Environment

Fiction.liveBench

GeoBench

GSO

SimpleBench

Terminal-Bench

VPCT

WeirdML V2

ANLI

ARC AI2

ARC-AGI

BIG-Bench Hard

BoolQ

CADEval

Cybench

GSM8K

HellaSwag

LAMBADA

Lech Mazur Writing

LiveBench

MMLU

OpenBookQA

OSWorld

OSUniverse

PIQA

ScienceQA

SuperGLUE

TriviaQA

Video MME

WinoGrande

To be used in our methodology, benchmarks need to be scored on a 0-1 scale (or equivalently, 0% to 100%). For benchmarks where random guessing would score above 0 (e.g. those with multiple choice responses), we rescale so that random guessing performance is scaled to zero.

We drop any models with fewer than 4 benchmark scores from our model fit, to avoid low-certainty estimates. We also exclude models released before 2023 due to data sparsity; we hope to increase coverage in the future.

FAQ

What does the ECI represent?

Similar to an IQ test, ECI is designed to capture a broad, underlying capability useful across many tasks, rather than performance on specific skills. ECI summarizes a model’s capability across benchmarks it has been evaluated on, giving more weight to the benchmarks that carry the most signal.

ECI scores are abstract values that can’t be interpreted in isolation, but they can be compared across models to judge meaningful differences in underlying capabilities. See the next question for more details.

How should I interpret ECI values?

ECI scores are similar to Elo scores; absolute values are meaningless by themselves, but meaningful comparisons can be made between models. The ECI scale is linear; in theory, a 10 point jump should be equally “impressive” when moving from 100 to 110 as it is going from 140 to 150. Similar principles hold when looking at trends in ECI; absolute slope values are meaningless, but relative changes in slope indicate faster or slower progress.

The scale of ECI scores is arbitrary. To put values in a convenient range, we have scaled the raw scores so that Claude 3.5 Sonnet = 130 and GPT-5 = 150. As future models and benchmarks are incorporated, these calibration points may be revisited. Note that as with Elo scores, there is no maximum achievable ECI.

Values are not linearly related to benchmark accuracy, since benchmarks differ in both difficulty and slope.

Why isn’t this model’s ECI higher, if it leads some benchmarks?

ECI reflects how well a model performs across many benchmarks. Performance on harder benchmarks is more informative about general capability, with benchmark difficulty inferred statistically from overlapping model results (not set by hand).

Models which are highly specialized may receive low ECI scores, despite being very capable within their domain.

How do you decide which benchmarks to use?

As a general principle, Epoch aims to collect benchmarks that are diverse, economically valuable, unsaturated, and widely used. We supplemented our existing set of internally run and benchmark-developer reported scores with a set of older benchmarks, with scores obtained from developer reports (i.e. model cards and technical reports). Having a wider range of benchmark difficulties allows us to obtain more precise ECI estimates.

Benchmarks need to be scored between 0 and 1; for this reason, we left out “ELO” style benchmarks like WebDev Arena.

Why did the ECI score of a model change?

The model used to produce ECI scores is fit jointly across all data. As new models and new benchmark evaluations are obtained, values may shift slightly even for models whose data have not changed.

Isn’t it a problem if model developers release only their best scores?

It is true that evaluations from model developer reports might be cherry-picked in order to make the model look better. Relying on these evaluations alone could lead to bias in our estimation. We mitigate against this by running models on our own internal evaluations, and by collecting evaluations from independently-run leaderboards. Since these sources are reported regardless of outcome, they reduce the impact of cherry-picking.

Isn’t it a problem if model developers optimize for benchmark scores?

Although our approach inherits the benefits of benchmarks, it also inherits its limitations. One shortcoming is that model developers can optimize for high performance on certain benchmarks, so that we overestimate the capabilities of some models.

Why isn’t my model included?

We strive to cover as many models as possible, with particular focus on plausibly-frontier level models. However, we require a minimum of 4 benchmark evaluations for any model to ensure ECI scores are both stable and fair. If a model you care about doesn’t yet have an ECI score, we are most likely trying to obtain more benchmark scores for it.

Why isn’t my benchmark included?

See “How do you decide which benchmarks to use?”. We are actively expanding the number of benchmarks we support while balancing considerations like breadth of coverage, relevance, and feasibility of maintenance.

Acknowledgements

This work was based on research conducted with support from Google DeepMind, and thus draws directly on the methodology introduced in our forthcoming joint paper, A Rosetta Stone for AI Benchmarks. However, the ECI is an independent Epoch AI product that Epoch has full rights over. We thank Rohin Shah, Samuel Albanie, Anna Wang, Eli Lifland, Nate Rush, Ezra Edelman, and Isabel Juniewicz.