ECI Documentation

Methodology

The technical foundation for the ECI comes from Item Response Theory (IRT), a statistical framework originally developed for educational testing. IRT enables comparisons between students, even when they took tests from different years, and one test might be harder than another.

The core of our model is a simple logistic function:

\(\textrm{performance}(m,b) = \sigma(\alpha_b [C_m - D_b])\)

Here, \(\sigma\) represents the logistic function, \(C_m\) is the model’s capability, \(D_b\) is the benchmark’s difficulty, and \(\alpha_b\) is a slope parameter related to the distribution of difficulty across questions within the benchmark. The formula says that a model’s performance on a given benchmark is dependent on how capable it is relative to the difficulty of the benchmark, and how “steep” the benchmark is. Higher \(\alpha_b\) values correspond to “steeper” benchmarks, where individual questions have a narrower range of difficulties, and there is no long tail of much harder questions. These benchmarks tend to saturate quickly as soon as models are able to gain some headway.

We fit model capability (\(C_m\)), benchmark difficulty (\(D_b\)), and benchmark slope (\(\alpha_b\)) parameters that best explain the full set of observed scores. We do not assume any relationship between capability and time or compute inside the model.

We fit the model via non-linear least-squares estimation, using a ridge regularization penalty to discourage overfitting. The scale of the resultant values is arbitrary; we currently rescale so that Claude 3.5 Sonnet is fixed at 130, and GPT-5 is fixed at 150, in order to allow for consistent scoring over recent models in a way that balances communicating our uncertainty with providing detailed information.

Overview

Data

AI Progress

Industry

Infrastructure

Impacts

Featured

Publications

Data explorers

Benchmarks by Epoch AI

Papers & Reports

Data Insights

Newsletter

Podcast

Capabilities

Models

Frontier Data Centers

Chip Owners

Companies

Polling on AI Use

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

Methodology

ECI Documentation – Methodology

AI Progress

Industry

Infrastructure

Impacts

Featured

Publications

Data explorers

Benchmarks by Epoch AI

Scaling

Software progress

Open models

Capabilities

Math

Leading companies

Finances

Geopolitics

Chips

Data centers

Energy

Adoption and use

Economic impact

Future of AI

Publications

Papers & Reports

Data Insights

Newsletter

Podcast

Data explorers

Capabilities

Models

Frontier Data Centers

Chip Owners

Companies

Polling on AI Use

Benchmarks by Epoch AI

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

About Epoch AI

Donate

Team

Careers

Consultations

For press

Transparency

ECI Documentation

Methodology