Epoch Capabilities Index

More about this dataset

Documentation

The general ECI is a composite metric which uses scores from 40+ distinct benchmarks to generate a single, general capability scale. At a high level, ECI stitches together its component benchmarks, determining their relative difficulty by making comparisons wherever models are evaluated on multiple benchmarks. Individual models obtain higher ECI scores if they perform better on harder benchmarks.

We give an overview of our methodology here; further technical details are available in our paper, A Rosetta Stone for AI Benchmarks, which was funded by Google DeepMind, and written in collaboration with researchers from their AGI Safety & Alignment team. However, the ECI is an independent Epoch AI product that Epoch has full rights over.

The Domain-specific ECIs show how LLMs’ ECI scores are influenced by the selection of benchmarks used to calculate them. It largely uses the methodology from the general ECI but only incorporates benchmarks from a specific domain, such as software engineering (SWE) or math. In particular, we keep the benchmark difficulty and slopes from the general ECI fit, and only refit the LLM capability parameters.

The results are given on a scale comparable to the general ECI, so if a LLM has a higher math ECI than general ECI, this means it performs better on math benchmarks than non-math benchmarks. However this scaling means that the results cannot be used to assess progress trends in different domains, as they are all scaled to increase at the same overall pace as the general ECI (but it is possible for e.g. one family of LLMs to go from underperforming to overperforming within a domain). See the methodology for the full details.

Code for the ECI is available in a public repository here.

Read the complete documentation

Frequently asked questions

What does the ECI represent?

How should I interpret ECI values?

AI Progress

Industry

Infrastructure

Impacts

Featured

Publications

Data explorers

Benchmarks by Epoch AI

Scaling

Software progress

Open models

Capabilities

Math

Leading companies

Finances

Geopolitics

Chips

Data centers

Energy

Adoption and use

Economic impact

Future of AI

Publications

Papers & Reports

Data Insights

Newsletter

Podcast

Data explorers

Capabilities

Models

Frontier Data Centers

Chip Owners

Companies

Polling on AI Use

Benchmarks by Epoch AI

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

About Epoch AI

Donate

Team

Careers

Consultations

For press

Transparency

Epoch Capabilities Index

Filter

Domain-specific ECI Explorer

Display

Filter

Data

More about this dataset

Documentation

Frequently asked questions

What does the ECI represent?

How should I interpret ECI values?

Why isn’t this model’s ECI higher, if it leads some benchmarks?

How do you decide which benchmarks to use?

Why did the ECI score of a model change?

Isn’t it a problem if model developers release only their best scores?

Isn’t it a problem if model developers optimize for benchmark scores?

Why isn’t my model included?

Why isn’t my benchmark included?

Why do some LLMs have a general ECI but not a Math-ECI or SWE-ECI?

How should I interpret the results? e.g . What does it mean for an LLM to have a Maths-ECI of 160 but a general ECI of 155?

Do the results mean that math and software engineering capabilities have been improving at the same rate as the general ECI?

Did you add additional math or software engineering benchmarks to construct the SWE-ECI or Math-ECI?