The technical foundation for the ECI comes from Item Response Theory (IRT), a statistical framework originally developed for educational testing. IRT enables comparisons between students, even when they took tests from different years, and one test might be harder than another.
The core of our model is a simple logistic function:
\(\textrm{performance}(m,b) = \sigma(\alpha_b [C_m - D_b])\)
Here, \(\sigma\) represents the logistic function, \(C_m\) is the model’s capability, \(D_b\) is the benchmark’s difficulty, and \(\alpha_b\) is a slope parameter related to the distribution of difficulty across questions within the benchmark. The formula says that a model’s performance on a given benchmark is dependent on how capable it is relative to the difficulty of the benchmark, and how “steep” the benchmark is. Higher \(\alpha_b\) values correspond to “steeper” benchmarks, where individual questions have a narrower range of difficulties, and there is no long tail of much harder questions. These benchmarks tend to saturate quickly as soon as models are able to gain some headway.
We fit model capability (\(C_m\)), benchmark difficulty (\(D_b\)), and benchmark slope (\(\alpha_b\)) parameters that best explain the full set of observed scores. We do not assume any relationship between capability and time or compute inside the model.
We fit the model via non-linear least-squares estimation, using a ridge regularization penalty to discourage overfitting. The scale of the resultant values is arbitrary; we currently rescale so that Claude 3.5 Sonnet is fixed at 130, and GPT-5 is fixed at 150, in order to allow for consistent scoring over recent models in a way that balances communicating our uncertainty with providing detailed information.