Domain-specific ECIs show how LLMs’ ECI scores are influenced by the selection of benchmarks used to calculate them.
The domain-specific ECI largely uses the methodology from the general ECI but only incorporates benchmarks from a specific domain, such as software engineering (SWE) or math. In particular, we keep the benchmark difficulty and slopes from the general ECI fit, and only refit the LLM capability parameters. See the methodology section below for the full details. The cyber ECI is an exception, as it also incorporates cybersecurity benchmarks that are not part of the general ECI; see the Cyber ECI section below.
The results are given on a scale comparable to the general ECI, so if an LLM has a higher math ECI than general ECI, this means it performs better on math benchmarks than non-math benchmarks. However this scaling means that the results cannot be used to assess long term progress trends in different domains, as they are all scaled to increase at the same overall pace as the general ECI (but it is possible to detect one-off capability jumps, or for e.g. one family of LLMs to go from underperforming to overperforming within a domain).
The methodology borrows heavily from the general ECI, so we recommend reading the information in its methodology section for background.
In particular we use the same logistic model for performance of LLM m on benchmark b:
\(\textrm{performance}(m,b) = \sigma(\alpha_b [C_m - D_b])\)
We keep benchmark difficulty (\(D_b\)) and benchmark slope (\(\alpha_b\)) parameters from the general ECI fit, and then only recalculate the LLM capability parameters. This is done to ensure the resulting values are comparable to the general ECI values, but means that the results cannot be used to assess long term progress trends in different domains. For the cyber ECI, this same procedure is applied starting from a refit “general + cyber” joint fit rather than the published general ECI fit; see the Cyber ECI section below.
For a given subset of the benchmarks, we find domain-specific ECI (\(C_m\)) values for each LLM that minimize the squared prediction error calculated only on the benchmarks within the subset of interest. (Using values that are already on the general ECI scale, where Sonnet 3.5 has score 130 and GPT-5 150, so no rescaling is required).
To calculate confidence intervals we use a two-step process:
N.B. This is a slightly different approach from the general ECI bootstrap (since it gives each LLM a fixed number of resamples each time, whereas the general ECI bootstrap resamples over all observed benchmark-LLM pairs) and so results in slightly smaller CIs if including all benchmark results.
The cyber ECI is constructed differently from other domain-specific ECIs, because the general ECI’s benchmark pool contains almost no cybersecurity benchmarks. We first refit a new “general + cyber” version of the ECI, incorporating 19 additional cyber benchmarks, then follow the standard domain-specific methodology described above, starting from that joint fit. The general ECI values shown alongside the cyber ECI are still the published general ECI values, and some LLMs covered by the cyber data (e.g. Claude Mythos Preview (Early)) have a cyber ECI but no general ECI. Because the cyber ECI starts from this separate joint fit, it may lag behind updates to the general ECI.
Since the domain-specific ECI incorporates the benchmark parameters from the general ECI, it is influenced by all the data covered in its data section.
For which benchmarks are categorized as SWE or Math, see the benchmark hub.
See general ECI FAQs for general information about the ECI
We require at least 2 benchmarks within a domain to calculate its domain-specific ECI to avoid overly noisy results. Some LLMs might pass the 4-benchmark minimum for inclusion in the general ECI but don’t have enough to be given a domain-specific ECI.
Unlike the other domain-specific ECIs, the cyber ECI incorporates cyber benchmarks that are not part of the general ECI. An LLM with enough cyber benchmark results can therefore be given a cyber ECI even if it does not meet the 4-benchmark minimum for a general ECI. These LLMs are marked with a “No General ECI” note in the explorer. See the Cyber ECI section above for more details.
This means that the LLM’s performance on math benchmarks is what we would expect from an LLM with general ECI 160, but it performs less well on other benchmarks, resulting in a general ECI of only 155.
No, the methodology we use causes the domain-specific ECI values to all be scaled to the same level as the general ECI. See the methodology section for details.
No, the benchmarks used to calculate the math and SWE domain-specific ECIs are a subset of the benchmarks used to construct the general ECI. The cyber ECI is the exception: it incorporates many cyber benchmarks that are not part of the general ECI (see the Cyber ECI section above).
We thank Alexander Barry for his work on the domain-specific ECI. For acknowledgements for the overall ECI see here.