ECI Documentation

FAQ

What does the ECI represent?

Similar to an IQ test, ECI is designed to capture a broad, underlying capability useful across many tasks, rather than performance on specific skills. ECI summarizes a model’s capability across benchmarks it has been evaluated on, giving more weight to the benchmarks that carry the most signal.

ECI scores are abstract values that can’t be interpreted in isolation, but they can be compared across models to judge meaningful differences in underlying capabilities. See the next question for more details.

How should I interpret ECI values?

Absolute ECI values are meaningless by themselves, but meaningful comparisons can be made between models. The ECI scale is linear; in theory, a 10 point jump should be equally “impressive” when moving from 100 to 110 as it is going from 140 to 150. That said, just because ECI is linear doesn’t mean that other things we care about are linear in ECI units. For instance, at the time of ECI’s launch, a 5 point gain in ECI appeared to roughly correspond to a doubling of the METR Time Horizon. Similar principles hold when looking at trends in ECI; absolute slope values are meaningless, but relative changes in slope indicate faster or slower progress.

The scale of ECI scores is arbitrary. To put values in a convenient range, we have scaled the raw scores so that Claude 3.5 Sonnet = 130 and GPT-5 = 150. As future models and benchmarks are incorporated, these calibration points may be revisited. Note that as with Elo scores, there is no maximum achievable ECI.

Values are not linearly related to benchmark accuracy, since benchmarks differ in both difficulty and slope.

Why isn’t this model’s ECI higher, if it leads some benchmarks?

ECI reflects how well a model performs across many benchmarks. Performance on harder benchmarks is more informative about general capability, with benchmark difficulty inferred statistically from overlapping model results (not set by hand).

Models which are highly specialized may receive low ECI scores, despite being very capable within their domain.

How do you decide which benchmarks to use?

As a general principle, Epoch aims to collect benchmarks that are diverse, economically valuable, unsaturated, and widely used. We supplemented our existing set of internally run and benchmark-developer reported scores with a set of older benchmarks, with scores obtained from developer reports (i.e. model cards and technical reports). Having a wider range of benchmark difficulties allows us to obtain more precise ECI estimates.

Benchmarks need to be scored between 0 and 1; for this reason, we left out “ELO” style benchmarks like WebDev Arena.

Why did the ECI score of a model change?

The model used to produce ECI scores is fit jointly across all data. As new models and new benchmark evaluations are obtained, values may shift slightly even for models whose data have not changed.

Isn’t it a problem if model developers release only their best scores?

It is true that evaluations from model developer reports might be cherry-picked in order to make the model look better. Relying on these evaluations alone could lead to bias in our estimation. We mitigate against this by running models on our own internal evaluations, and by collecting evaluations from independently-run leaderboards. Since these sources are reported regardless of outcome, they reduce the impact of cherry-picking.

Isn’t it a problem if model developers optimize for benchmark scores?

Although our approach inherits the benefits of benchmarks, it also inherits its limitations. One shortcoming is that model developers can optimize for high performance on certain benchmarks, so that we overestimate the capabilities of some models.

Why isn’t my model included?

We strive to cover as many models as possible, with particular focus on plausibly-frontier level models. However, we require a minimum of 4 benchmark evaluations for any model to ensure ECI scores are both stable and fair. If a model you care about doesn’t yet have an ECI score, we are most likely trying to obtain more benchmark scores for it.

Why isn’t my benchmark included?

See “How do you decide which benchmarks to use?”. We are actively expanding the number of benchmarks we support while balancing considerations like breadth of coverage, relevance, and feasibility of maintenance.

Data

Domain-specific ECI

AI Progress

Industry

Infrastructure

Impacts

Featured

Publications

Data explorers

Benchmarks by Epoch AI

Papers & Reports

Data Insights

Newsletter

Podcast

Capabilities

Models

Frontier Data Centers

Chip Owners

Companies

Polling on AI Use

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

FAQ

What does the ECI represent?

How should I interpret ECI values?

Why isn’t this model’s ECI higher, if it leads some benchmarks?

How do you decide which benchmarks to use?

Why did the ECI score of a model change?

Isn’t it a problem if model developers release only their best scores?

Isn’t it a problem if model developers optimize for benchmark scores?

Why isn’t my model included?

Why isn’t my benchmark included?

ECI Documentation – FAQ

AI Progress

Industry

Infrastructure

Impacts

Featured

Publications

Data explorers

Benchmarks by Epoch AI

Scaling

Software progress

Open models

Capabilities

Math

Leading companies

Finances

Geopolitics

Chips

Data centers

Energy

Adoption and use

Economic impact

Future of AI

Publications

Papers & Reports

Data Insights

Newsletter

Podcast

Data explorers

Capabilities

Models

Frontier Data Centers

Chip Owners

Companies

Polling on AI Use

Benchmarks by Epoch AI

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

About Epoch AI

Donate

Team

Careers

Consultations

For press

Transparency

ECI Documentation

FAQ

What does the ECI represent?

How should I interpret ECI values?

Why isn’t this model’s ECI higher, if it leads some benchmarks?

How do you decide which benchmarks to use?

Why did the ECI score of a model change?

Isn’t it a problem if model developers release only their best scores?

Isn’t it a problem if model developers optimize for benchmark scores?

Why isn’t my model included?

Why isn’t my benchmark included?