Show sidebar Domain-specific ECI

Domain-specific ECI

Overview

Domain-specific ECIs show how LLMs’ ECI scores are influenced by the selection of benchmarks used to calculate them.

The domain-specific ECI largely uses the methodology from the general ECI but only incorporates benchmarks from a specific domain, such as software engineering (SWE) or math. In particular, we keep the benchmark difficulty and slopes from the general ECI fit, and only refit the LLM capability parameters. See the methodology section below for the full details.

The results are given on a scale comparable to the general ECI, so if a LLM has a higher math ECI than general ECI, this means it performs better on math benchmarks than non-math benchmarks. However this scaling means that the results cannot be used to assess progress trends in different domains, as they are all scaled to increase at the same overall pace as the general ECI (but it is possible for e.g. one family of LLMs to go from underperforming to overperforming within a domain).

Methodology

The methodology borrows heavily from the general ECI, so we recommend reading the information in its methodology section for background.

In particular we use the same logistic model for performance of LLM m on benchmark b:

performance(m,b)=(b[Cm-Db])

We keep benchmark difficulty (Db), and benchmark slope (b) parameters from the general ECI fit, and then only recalculate the LLM capability parameters. This is done to ensure the resulting values are comparable to the general ECI values, but means that the results cannot be used to assess progress trends in different domains.

For a given subset of the benchmarks, we find domain-specific ECI (Cm) values for each LLM that minimize the squared prediction error calculated only on the benchmarks within the subset of interest. (Using values that are already on the general ECI scale, where Sonnet 3.5 has score 130 and GPT-5 150, so no rescaling is required).

To calculate confidence intervals we use a two-step process:

  • First we take the 100 bootstrap samples for the benchmark parameters that were calculated in the general ECI fit
  • Then for each of those sets of parameters we generate a further 10 bootstrap samples by sampling with replacement for each LLM from its observed benchmark results (within the subset of interest) 10 times, to create 1000 total samples.
  • For each of those 1000 samples we fit the domain specific ECI results for each LLM. We then use the distribution of these results to find the 5th and 95th percentile results to construct the CIs.

N.B. This is a slightly different approach from the general ECI bootstrap (since it gives each LLM a fixed number of resamples each time, whereas the general ECI bootstrap resamples over all observed benchmark-LLM pairs) and so results in slightly smaller CIs if including all benchmark results.

Data

Since the domain-specific ECI incorporates the benchmark parameters from the general ECI, it is influenced by all the data covered in its data section.

For which benchmarks as categorised as SWE or Math see the benchmark hub.

FAQs

See general ECI FAQs for general information about the ECI

Why do some LLMs have a general ECI but not a Math-ECI or SWE-ECI?

We require at least 2 benchmarks within a domain to calculate its domain-specific ECI to avoid overly noisy results. Some LLMs might pass the 4-benchmark minimum for inclusion in the general ECI but don’t have enough to be given a domain-specific ECI.

How should I interpret the results? e.g . What does it mean for an LLM to have a Maths-ECI of 160 but a general ECI of 155?

This means that the LLM’s performance on maths benchmarks is what we would expect from a LLM with general ECI 160, but it performs less well on other benchmarks, resulting in an general ECI of only 155.

Do the results mean that math and software engineering capabilities have been improving at the same rate as the general ECI?

No, the methodology we use causes the domain-specific ECI values to all be scaled to the same level as the general ECI. See the methodology section for details.

Did you add additional math or software engineering benchmarks to construct the SWE-ECI or Math-ECI?

No, the benchmarks used to calculate the domain-specific ECIs are a subset of the benchmarks used to construct the general ECI.

Acknowledgements

We thank Alexander Barry for his work on the domain-specific ECI. For acknowledgements for the overall ECI see here.