The Epoch Capabilities Index (ECI) combines scores from many different AI benchmarks into a single “general capability” scale, allowing comparisons between models even over timespans long enough for single benchmarks to reach saturation.
The Domain-specific Epoch Capabilities Index (ECI) uses the methodology and benchmark parameters from the general ECI, but only incorporates benchmarks from a specific domain, such as software engineering (SWE) or math.
Minimum number of benchmarks a model must have to be included. Results for models with very few benchmarks may be noisy and have unreliable CIs.
The general ECI is a composite metric which uses scores from 40+ distinct benchmarks to generate a single, general capability scale. At a high level, ECI stitches together its component benchmarks, determining their relative difficulty by making comparisons wherever models are evaluated on multiple benchmarks. Individual models obtain higher ECI scores if they perform better on harder benchmarks.
We give an overview of our methodology here; further technical details are available in our paper, A Rosetta Stone for AI Benchmarks, which was funded by Google DeepMind, and written in collaboration with researchers from their AGI Safety & Alignment team. However, the ECI is an independent Epoch AI product that Epoch has full rights over.
The Domain-specific ECIs show how LLMs’ ECI scores are influenced by the selection of benchmarks used to calculate them. It largely uses the methodology from the general ECI but only incorporates benchmarks from a specific domain, such as software engineering (SWE) or math. In particular, we keep the benchmark difficulty and slopes from the general ECI fit, and only refit the LLM capability parameters.
The results are given on a scale comparable to the general ECI, so if a LLM has a higher math ECI than general ECI, this means it performs better on math benchmarks than non-math benchmarks. However this scaling means that the results cannot be used to assess progress trends in different domains, as they are all scaled to increase at the same overall pace as the general ECI (but it is possible for e.g. one family of LLMs to go from underperforming to overperforming within a domain). See the methodology for the full details.
Code for the ECI is available in a public repository here.
Similar to an IQ test, ECI is designed to capture a broad, underlying capability useful across many tasks, rather than performance on specific skills. ECI summarizes a model’s capability across benchmarks it has been evaluated on, giving more weight to the benchmarks that carry the most signal.
ECI scores are abstract values that can’t be interpreted in isolation, but they can be compared across models to judge meaningful differences in underlying capabilities. See the next question for more details.
Absolute ECI values are meaningless by themselves, but meaningful comparisons can be made between models. The ECI scale is linear; in theory, a 10 point jump should be equally “impressive” when moving from 100 to 110 as it is going from 140 to 150. That said, just because ECI is linear doesn’t mean that other things we care about are linear in ECI units. For instance, at the time of ECI’s launch, a 5 point gain in ECI appeared to roughly correspond to a doubling of the METR Time Horizon. Similar principles hold when looking at trends in ECI; absolute slope values are meaningless, but relative changes in slope indicate faster or slower progress.
The scale of ECI scores is arbitrary. To put values in a convenient range, we have scaled the raw scores so that Claude 3.5 Sonnet = 130 and GPT-5 = 150. As future models and benchmarks are incorporated, these calibration points may be revisited. Note that as with Elo scores, there is no maximum achievable ECI.
Values are not linearly related to benchmark accuracy, since benchmarks differ in both difficulty and slope.
ECI reflects how well a model performs across many benchmarks. Performance on harder benchmarks is more informative about general capability, with benchmark difficulty inferred statistically from overlapping model results (not set by hand).
Models which are highly specialized may receive low ECI scores, despite being very capable within their domain.
As a general principle, Epoch aims to collect benchmarks that are diverse, economically valuable, unsaturated, and widely used. We supplemented our existing set of internally run and benchmark-developer reported scores with a set of older benchmarks, with scores obtained from developer reports (i.e. model cards and technical reports). Having a wider range of benchmark difficulties allows us to obtain more precise ECI estimates.
Benchmarks need to be scored between 0 and 1; for this reason, we left out “ELO” style benchmarks like WebDev Arena.
The model used to produce ECI scores is fit jointly across all data. As new models and new benchmark evaluations are obtained, values may shift slightly even for models whose data have not changed.
It is true that evaluations from model developer reports might be cherry-picked in order to make the model look better. Relying on these evaluations alone could lead to bias in our estimation. We mitigate against this by running models on our own internal evaluations, and by collecting evaluations from independently-run leaderboards. Since these sources are reported regardless of outcome, they reduce the impact of cherry-picking.
Although our approach inherits the benefits of benchmarks, it also inherits its limitations. One shortcoming is that model developers can optimize for high performance on certain benchmarks, so that we overestimate the capabilities of some models.
We strive to cover as many models as possible, with particular focus on plausibly-frontier level models. However, we require a minimum of 4 benchmark evaluations for any model to ensure ECI scores are both stable and fair. If a model you care about doesn’t yet have an ECI score, we are most likely trying to obtain more benchmark scores for it.
See “How do you decide which benchmarks to use?”. We are actively expanding the number of benchmarks we support while balancing considerations like breadth of coverage, relevance, and feasibility of maintenance.
We require at least 2 benchmarks within a domain to calculate its domain-specific ECI to avoid overly noisy results. Some LLMs might pass the 4-benchmark minimum for inclusion in the general ECI but don’t have enough to be given a domain-specific ECI.
This means that the LLM’s performance on maths benchmarks is what we would expect from a LLM with general ECI 160, but it performs less well on other benchmarks, resulting in an general ECI of only 155\ .
No, the methodology we use causes the domain-specific ECI values to all be scaled to the same level as the general ECI. See the methodology section for details.
No, the benchmarks used to calculate the domain-specific ECIs are a subset of the benchmarks used to construct the general ECI.
Have a question? Noticed something wrong? Let us know.
The Epoch Capabilities Index combines many benchmarks into a single capability scale for comparing models over time.