Domain-specific ECIs show how LLMs’ ECI scores are influenced by the selection of benchmarks used to calculate them.
The domain-specific ECI largely uses the methodology from the general ECI but only incorporates benchmarks from a specific domain, such as software engineering (SWE) or math. In particular, we keep the benchmark difficulty and slopes from the general ECI fit, and only refit the LLM capability parameters. See the methodology section below for the full details.
The results are given on a scale comparable to the general ECI, so if a LLM has a higher math ECI than general ECI, this means it performs better on math benchmarks than non-math benchmarks. However this scaling means that the results cannot be used to assess progress trends in different domains, as they are all scaled to increase at the same overall pace as the general ECI (but it is possible for e.g. one family of LLMs to go from underperforming to overperforming within a domain).
The methodology borrows heavily from the general ECI, so we recommend reading the information in its methodology section for background.
In particular we use the same logistic model for performance of LLM m on benchmark b:
performance(m,b)=(b[Cm-Db])
We keep benchmark difficulty (Db), and benchmark slope (b) parameters from the general ECI fit, and then only recalculate the LLM capability parameters. This is done to ensure the resulting values are comparable to the general ECI values, but means that the results cannot be used to assess progress trends in different domains.
For a given subset of the benchmarks, we find domain-specific ECI (Cm) values for each LLM that minimize the squared prediction error calculated only on the benchmarks within the subset of interest. (Using values that are already on the general ECI scale, where Sonnet 3.5 has score 130 and GPT-5 150, so no rescaling is required).
To calculate confidence intervals we use a two-step process:
N.B. This is a slightly different approach from the general ECI bootstrap (since it gives each LLM a fixed number of resamples each time, whereas the general ECI bootstrap resamples over all observed benchmark-LLM pairs) and so results in slightly smaller CIs if including all benchmark results.
Since the domain-specific ECI incorporates the benchmark parameters from the general ECI, it is influenced by all the data covered in its data section.
For which benchmarks as categorised as SWE or Math see the benchmark hub.
See general ECI FAQs for general information about the ECI
We require at least 2 benchmarks within a domain to calculate its domain-specific ECI to avoid overly noisy results. Some LLMs might pass the 4-benchmark minimum for inclusion in the general ECI but don’t have enough to be given a domain-specific ECI.
This means that the LLM’s performance on maths benchmarks is what we would expect from a LLM with general ECI 160, but it performs less well on other benchmarks, resulting in an general ECI of only 155.
No, the methodology we use causes the domain-specific ECI values to all be scaled to the same level as the general ECI. See the methodology section for details.
No, the benchmarks used to calculate the domain-specific ECIs are a subset of the benchmarks used to construct the general ECI.
We thank Alexander Barry for his work on the domain-specific ECI. For acknowledgements for the overall ECI see here.