Epoch Capabilities Index

The Epoch Capabilities Index (ECI) combines scores from many different AI benchmarks into a single “general capability” scale, allowing comparisons between models even over timespans long enough for single benchmarks to reach saturation.

Learn more about how the ECI is calculated.

Settings
Explore by

Filter

Domain-specific ECI Explorer

The Domain-specific Epoch Capabilities Index (ECI) uses the methodology and benchmark parameters from the general ECI, but only incorporates benchmarks from a specific domain, such as software engineering (SWE) or math.

Learn more about how the domain-specific ECI is calculated.

Settings

Display

Chart
Color by

Filter

Data

Min. benchmarks
2

Minimum number of benchmarks a model must have to be included. Results for models with very few benchmarks may be noisy and have unreliable CIs.

More about this dataset

Documentation

The general ECI is a composite metric which uses scores from 40+ distinct benchmarks to generate a single, general capability scale. At a high level, ECI stitches together its component benchmarks, determining their relative difficulty by making comparisons wherever models are evaluated on multiple benchmarks. Individual models obtain higher ECI scores if they perform better on harder benchmarks.

We give an overview of our methodology here; further technical details are available in our paper, A Rosetta Stone for AI Benchmarks, which was funded by Google DeepMind, and written in collaboration with researchers from their AGI Safety & Alignment team. However, the ECI is an independent Epoch AI product that Epoch has full rights over.

The Domain-specific ECIs show how LLMs’ ECI scores are influenced by the selection of benchmarks used to calculate them. It largely uses the methodology from the general ECI but only incorporates benchmarks from a specific domain, such as software engineering (SWE) or math. In particular, we keep the benchmark difficulty and slopes from the general ECI fit, and only refit the LLM capability parameters.

The results are given on a scale comparable to the general ECI, so if a LLM has a higher math ECI than general ECI, this means it performs better on math benchmarks than non-math benchmarks. However this scaling means that the results cannot be used to assess progress trends in different domains, as they are all scaled to increase at the same overall pace as the general ECI (but it is possible for e.g. one family of LLMs to go from underperforming to overperforming within a domain). See the methodology for the full details.

Code for the ECI is available in a public repository here.

Frequently asked questions

What does the ECI represent?

How should I interpret ECI values?

Why isn’t this model’s ECI higher, if it leads some benchmarks?

How do you decide which benchmarks to use?

Why did the ECI score of a model change?

Isn’t it a problem if model developers release only their best scores?

Isn’t it a problem if model developers optimize for benchmark scores?

Why isn’t my model included?

Why isn’t my benchmark included?

Why do some LLMs have a general ECI but not a Math-ECI or SWE-ECI?

How should I interpret the results? e.g . What does it mean for an LLM to have a Maths-ECI of 160 but a general ECI of 155?

Do the results mean that math and software engineering capabilities have been improving at the same rate as the general ECI?

Did you add additional math or software engineering benchmarks to construct the SWE-ECI or Math-ECI?