Show sidebar Domain-specific ECI

Domain-specific ECI

Overview

Domain-specific ECIs show how LLMs’ ECI scores are influenced by the selection of benchmarks used to calculate them.

The domain-specific ECI largely uses the methodology from the general ECI but only incorporates benchmarks from a specific domain, such as software engineering (SWE) or math. In particular, we keep the benchmark difficulty and slopes from the general ECI fit, and only refit the LLM capability parameters. See the methodology section below for the full details. The cyber ECI is an exception, as it also incorporates cybersecurity benchmarks that are not part of the general ECI; see the Cyber ECI section below.

The results are given on a scale comparable to the general ECI, so if an LLM has a higher math ECI than general ECI, this means it performs better on math benchmarks than non-math benchmarks. However this scaling means that the results cannot be used to assess long term progress trends in different domains, as they are all scaled to increase at the same overall pace as the general ECI (but it is possible to detect one-off capability jumps, or for e.g. one family of LLMs to go from underperforming to overperforming within a domain).

Methodology

The methodology borrows heavily from the general ECI, so we recommend reading the information in its methodology section for background.

In particular we use the same logistic model for performance of LLM m on benchmark b:

\(\textrm{performance}(m,b) = \sigma(\alpha_b [C_m - D_b])\)

We keep benchmark difficulty (\(D_b\)) and benchmark slope (\(\alpha_b\)) parameters from the general ECI fit, and then only recalculate the LLM capability parameters. This is done to ensure the resulting values are comparable to the general ECI values, but means that the results cannot be used to assess long term progress trends in different domains. For the cyber ECI, this same procedure is applied starting from a refit “general + cyber” joint fit rather than the published general ECI fit; see the Cyber ECI section below.

For a given subset of the benchmarks, we find domain-specific ECI (\(C_m\)) values for each LLM that minimize the squared prediction error calculated only on the benchmarks within the subset of interest. (Using values that are already on the general ECI scale, where Sonnet 3.5 has score 130 and GPT-5 150, so no rescaling is required).

To calculate confidence intervals we use a two-step process:

  • First we take the 100 bootstrap samples for the benchmark parameters that were calculated in the general ECI fit
  • Then for each of those sets of parameters we generate a further 10 bootstrap samples by sampling with replacement for each LLM from its observed benchmark results (within the subset of interest) 10 times, to create 1000 total samples.
  • For each of those 1000 samples we fit the domain specific ECI results for each LLM. We then use the distribution of these results to find the 5th and 95th percentile results to construct the CIs.

N.B. This is a slightly different approach from the general ECI bootstrap (since it gives each LLM a fixed number of resamples each time, whereas the general ECI bootstrap resamples over all observed benchmark-LLM pairs) and so results in slightly smaller CIs if including all benchmark results.

Cyber ECI

The cyber ECI is constructed differently from other domain-specific ECIs, because the general ECI’s benchmark pool contains almost no cybersecurity benchmarks. We first refit a new “general + cyber” version of the ECI, incorporating 19 additional cyber benchmarks, then follow the standard domain-specific methodology described above, starting from that joint fit. The general ECI values shown alongside the cyber ECI are still the published general ECI values, and some LLMs covered by the cyber data (e.g. Claude Mythos Preview (Early)) have a cyber ECI but no general ECI. Because the cyber ECI starts from this separate joint fit, it may lag behind updates to the general ECI.

Data

Since the domain-specific ECI incorporates the benchmark parameters from the general ECI, it is influenced by all the data covered in its data section.

For which benchmarks are categorized as SWE or Math, see the benchmark hub.

FAQs

See general ECI FAQs for general information about the ECI

Why do some LLMs have a general ECI but not a Math-ECI or SWE-ECI?

We require at least 2 benchmarks within a domain to calculate its domain-specific ECI to avoid overly noisy results. Some LLMs might pass the 4-benchmark minimum for inclusion in the general ECI but don’t have enough to be given a domain-specific ECI.

Why do some LLMs have a cyber ECI but no general ECI?

Unlike the other domain-specific ECIs, the cyber ECI incorporates cyber benchmarks that are not part of the general ECI. An LLM with enough cyber benchmark results can therefore be given a cyber ECI even if it does not meet the 4-benchmark minimum for a general ECI. These LLMs are marked with a “No General ECI” note in the explorer. See the Cyber ECI section above for more details.

How should I interpret the results? e.g. What does it mean for an LLM to have a Math-ECI of 160 but a general ECI of 155?

This means that the LLM’s performance on math benchmarks is what we would expect from an LLM with general ECI 160, but it performs less well on other benchmarks, resulting in a general ECI of only 155.

Do the results mean that math and software engineering capabilities have been improving at the same rate as the general ECI?

No, the methodology we use causes the domain-specific ECI values to all be scaled to the same level as the general ECI. See the methodology section for details.

Did you add additional math or software engineering benchmarks to construct the SWE-ECI or Math-ECI?

No, the benchmarks used to calculate the math and SWE domain-specific ECIs are a subset of the benchmarks used to construct the general ECI. The cyber ECI is the exception: it incorporates many cyber benchmarks that are not part of the general ECI (see the Cyber ECI section above).

Acknowledgements

We thank Alexander Barry for his work on the domain-specific ECI. For acknowledgements for the overall ECI see here.