Similar to an IQ test, ECI is designed to capture a broad, underlying capability useful across many tasks, rather than performance on specific skills. ECI summarizes a model’s capability across benchmarks it has been evaluated on, giving more weight to the benchmarks that carry the most signal.
ECI scores are abstract values that can’t be interpreted in isolation, but they can be compared across models to judge meaningful differences in underlying capabilities. See the next question for more details.
Absolute ECI values are meaningless by themselves, but meaningful comparisons can be made between models. The ECI scale is linear; in theory, a 10 point jump should be equally “impressive” when moving from 100 to 110 as it is going from 140 to 150. That said, just because ECI is linear doesn’t mean that other things we care about are linear in ECI units. For instance, at the time of ECI’s launch, a 5 point gain in ECI appeared to roughly correspond to a doubling of the METR Time Horizon. Similar principles hold when looking at trends in ECI; absolute slope values are meaningless, but relative changes in slope indicate faster or slower progress.
The scale of ECI scores is arbitrary. To put values in a convenient range, we have scaled the raw scores so that Claude 3.5 Sonnet = 130 and GPT-5 = 150. As future models and benchmarks are incorporated, these calibration points may be revisited. Note that as with Elo scores, there is no maximum achievable ECI.
Values are not linearly related to benchmark accuracy, since benchmarks differ in both difficulty and slope.
ECI reflects how well a model performs across many benchmarks. Performance on harder benchmarks is more informative about general capability, with benchmark difficulty inferred statistically from overlapping model results (not set by hand).
Models which are highly specialized may receive low ECI scores, despite being very capable within their domain.
As a general principle, Epoch aims to collect benchmarks that are diverse, economically valuable, unsaturated, and widely used. We supplemented our existing set of internally run and benchmark-developer reported scores with a set of older benchmarks, with scores obtained from developer reports (i.e. model cards and technical reports). Having a wider range of benchmark difficulties allows us to obtain more precise ECI estimates.
Benchmarks need to be scored between 0 and 1; for this reason, we left out “ELO” style benchmarks like WebDev Arena.
The model used to produce ECI scores is fit jointly across all data. As new models and new benchmark evaluations are obtained, values may shift slightly even for models whose data have not changed.
It is true that evaluations from model developer reports might be cherry-picked in order to make the model look better. Relying on these evaluations alone could lead to bias in our estimation. We mitigate against this by running models on our own internal evaluations, and by collecting evaluations from independently-run leaderboards. Since these sources are reported regardless of outcome, they reduce the impact of cherry-picking.
Although our approach inherits the benefits of benchmarks, it also inherits its limitations. One shortcoming is that model developers can optimize for high performance on certain benchmarks, so that we overestimate the capabilities of some models.
We strive to cover as many models as possible, with particular focus on plausibly-frontier level models. However, we require a minimum of 4 benchmark evaluations for any model to ensure ECI scores are both stable and fair. If a model you care about doesn’t yet have an ECI score, we are most likely trying to obtain more benchmark scores for it.
See “How do you decide which benchmarks to use?”. We are actively expanding the number of benchmarks we support while balancing considerations like breadth of coverage, relevance, and feasibility of maintenance.