SWE-bench Verified
GitHub issues from real-world Python repos, testing whether models can implement valid code fixes. Note: our SWE-bench Verified scaffold was significantly upgraded in February 2026. ECI currently uses data from the SWE-bench team’s runs using mini-SWE-agent.
About SWE-bench Verified
SWE-bench Verified is a human-validated subset of the original SWE-bench dataset, consisting of 500 samples that evaluate AI models’ ability to solve real-world software engineering issues. Epoch evaluations of this benchmark use 484 samples that are validated on our infrastructure. Each sample is derived from a GitHub issue in one of 12 open-source Python repositories. Our review of SWE-bench Verified examines the benchmark in more detail.
The evaluation workflow is as follows: the model is given access to a code repository and a description of an issue that needs to be fixed. The model must then investigate and modify the repository to resolve the issue. Once the model has made its changes, the solution is evaluated by running unit tests on the modified codebase.
The dataset was curated through a rigorous human annotation process involving 93 software developers. Each sample was reviewed by three separate annotators to ensure the issue description is well-specified, the unit tests are appropriate, and the sample is free of other major issues that could make evaluation unreliable. Nevertheless, some samples may remain ambiguous – and we have previously estimated an error rate of 5-10%.
Methodology
There are five parts to our SWE-bench Verified evaluation methodology:
- Prompts and other information to direct the model.
- Scaffold and tools used when running the model.
- Samples used for evaluation.
- Limits on model input/output.
- Environment the benchmark runs in.
Prompts and information: We use a fairly simple prompt, close to that used in the SWE-bench developers’ bash-only runs, which is in turn similar to SWE-bench Verified prompts used by Anthropic models. The full prompt can be seen in our log viewer.
Our prompt tells the model about the issue (i.e. SWE-bench’s provided issue description), the repository, the available token budget, and available tools. Our prompt also tells the model not to modify the tests. It’s important to mention that the model shouldn’t modify tests, because the gold standard tests from the repository will be used to grade the solution. Without this instruction, the model might refactor code in a way that’s conceptually correct, but is graded as a failure, for example because a method was renamed, returns a different data structure, etc.
Scaffold and tools: Unless otherwise specified, our scaffold is a simple loop where models can take an individual action (e.g. reasoning and tool call), then see any output. After each turn, the model is reminded of its remaining token budget. We also include results from other scaffolds, such as Claude Code or Codex, evaluated via Inspect-SWE. For these, we provide a tool for checking token usage.
We provide a bash tool to run commands, Inspect’s Anthropic-style text_editor tool for file viewing and modification, and an OpenAI-style apply_patch tool to apply patches to files. These are called via tool calling APIs. We include these tools because they are associated with the two leading code LLM APIs, and are often adopted by other developers. Meanwhile, they are well-defined by public specifications, and have persisted (with minor changes) across several generations of coding models.
Samples used: We exclude 16 samples that do not reliably run in our infrastructure. Typical issues we have observed include tests that require network access, or repo versions that are incompatible with the sample’s specified dependencies. This has been a fairly common practice in evals run by AI developers, for example GPT-5 used 477 tasks, and Sonnet-3.7 used 489 tasks. Later Anthropic releases fixed and used all 500 tasks. The SWE-bench developers have fixed many such issues over time for the official leaderboard evals, although some problems may remain.
We evaluate using the remaining 484 samples, including one sample where the test list has been amended. This should have a negligible impact on overall SWE-bench performance, for example GPT-5 Mini’s results on the SWE-bench bash-only leaderboard would be 59.8% if they also excluded these issues, compared to a reported 59.9% when all samples are used.1
Limits on model input/output: We limit model token usage to 2M uncached read/write tokens (including reasoning), and 20M cached token reads. We enforce limits after each turn, meaning that a model is forced to submit its current working state on the first turn after it has exceeded either limit. Token usage grows quadratically with each model call, as the conversation history is passed as input for the LLM to produce its next message or tool call. We have not observed models hitting these high token limits in practice, other than in pathological cases such as getting stuck in a loop.
Environment the samples run in: We run the benchmark within a barebones Linux-based Docker container. We do not allow network access, to prevent cheating. Dependencies for each sample are installed based on their spec in the SWE-bench version used to build our original Docker images (v2.1.0). When instantiating a Docker container, we remove all git history after that sample’s original GitHub issue. This prevents models from cheating by looking at the human solutions, which are part of the subsequent git history.
Changelog
2026-02-13. We added third-party scaffolds (Claude Code and Codex), fixed a bug where token usage was not being counted properly for Google models’ reasoning, and added the new, post-GPT-5.1, versions of native OpenAI apply_patch and shell tools. These changes are not expected to affect existing scores, as they did not apply to models that were already evaluated. The version number incremented to v2.0.2.
2026-02-13. We upgraded our Inspect dependency. This is not expected to affect scores for this benchmark. The version number incremented to v2.0.1.
2026-02-12. We performed a major upgrade of scaffolding, environments, and token limits. We updated the Epoch version of this benchmark to v2.0.0, and re-evaluated key models. This led to model performance improving significantly. Our default graph view above only displays results from v2.0.0 onwards.
-
In principle, if these excluded samples showed a different difficulty distribution to the rest of SWE-bench Verified, then they could affect results more than 0.1pp. For example, if a model were scoring 100% on the rest of SWE-bench Verified and 0% on these samples, then the maximum score difference would be 3.2pp. However, given that 11/16 of these samples were already solved in at least one run from SWE-bench Verified bash-only v1, the effect of removing them is expected to be far smaller. ↩