SWE-bench Verified

Epoch evaluated

SWE-bench Verified is a human-validated subset of the original SWE-bench dataset, consisting of 500 samples that evaluate AI models’ ability to solve real-world software engineering issues. Each sample is derived from an actual GitHub issue in one of 12 open-source Python repositories.

The evaluation workflow is as follows: the model is given access to a code repository and a description of an issue that needs to be fixed. The model must then modify the appropriate files in the repository to resolve the issue. Once the model has made its changes, the solution is evaluated by running unit tests on the modified codebase.

The dataset was carefully curated through a rigorous human annotation process involving 93 software developers. Each sample was reviewed by three separate annotators to ensure the issue description is well-specified, the unit tests are appropriate, and the sample is free of other major issues that could make evaluation unreliable.

Methodology

For SWE-bench Verified, we recommend using the log viewer as the best way to understand the evaluation settings (e.g. click here for claude-3-7-sonnet-20250219).

We use the tools from SWE-Agent 1.0.1’s default config, as described here.

We give models the following prompt:

You will be solving an open issue from the astropy/astropy repository. You have access to a Python environment containing the repository, and you will need to modify the code to solve the issue. When you are done, call the submit_answer tool to let us know.

  • Environment:
    • You have access to a Python environment containing the repository. You interact with the environment by calling various tools.
    • The environment does not have internet access.
  • Token limits:
    • There is a hard limit of 1,000,000 tokens. If you exceed this, the conversation will end immediately (even if you haven’t called the submit tool). The final state of the repository at that point will be used as your answer.
    • Both input and output tokens count towards the limit.
  • SWE-Agent Tools:
    • Available tools by category:
      • defaults: create, goto, open, scroll_down, scroll_up
      • search: find_file, search_dir, search_file
      • edit_replace: edit, insert
    • Additional guidance on the file viewing and editing tools:
      • The file viewing and editing tools let you view 100 lines at a time with an overlap of 2 lines.
      • You can use the scroll_up and scroll_down commands to navigate through larger files.
      • Avoid using the scrolling commands multiple times. Instead, use the goto command to jump to a specific line, or open the file at a specific line.
  • You also have access to the bash tool:
    • You can use the tool to execute arbitrary shell commands in the repository environment.
    • The tool has a timeout of 300 seconds.
  • The submit_answer tool takes no arguments and simply records that you have finished making changes to the repository.

Here is the issue you need to solve:

{issue}

Due to the cost of running evaluations, we only run models once on each issue. We set a limit of 1 million tokens per issue, which is mentioned in the prompt. Note that reaching the token limit does not mean that the model has failed the task: the model does not need to formally submit an answer, so as long as the repository is in a state that passes the tests, the task will be scored as correctly completed.

Gemini 2.5 models

For Gemini 2.5 models, we have faced difficulties using the API between March and June 2025. Some requests were systematically failing without an informative error message. These requests either returned a status code of 500, or failed to send any data through the network at all. In June 2025, Google staff confirmed that they did not intend to address the remaining issues we had documented.

As a result, as of July 2025, we are evaluating Gemini 2.5 models with specific scoring rules that apply only to Gemini 2.5 models:

  • Requests that fail in the ways described above are retried up to 10 times, with exponential backoff and a maximum backoff of 30 minutes.
  • If all 10 retries fail, the corresponding sample is marked as incorrect.

Samples were marked as incorrect due to this policy in the following cases:

Model Benchmark Accuracy Samples marked as incorrect after 10 retries Run ID
gemini-2.5-pro-preview-06-05 FrontierMath-2025-02-28-Public 30% (±15%) 1/10 = 10% DVgoky7pTLzriyfcj3oiTC
gemini-2.5-pro-preview-06-05 FrontierMath-2025-02-28-Private 10% (±2%) 21/290 = 7% hNLQNC6TASXNNzat8kVS77
gemini-2.5-pro FrontierMath-2025-02-28-Public 40% (±16%) 1/10 = 10% g7m7nM2iRMskxAUbyrV766
gemini-2.5-pro FrontierMath-2025-02-28-Private 11% (±2%) 25/290=9% HN7YDEAzGvPRMBDo7mGuXG

Grok 4

For grok-4-0709, we experienced timeouts and network errors using the API in July 2025.

As a result, as of July 2025, we evaluated Grok 4 using specific scoring rules:

FrontierMath-2025-02-28-Private was evaluated using our standard settings. The record ID is gda5UeWrA8HcbDCRuLJ56H. We used the streaming API. 1/290 was not scored due to the server not sending any response. (We allow up to 1% of samples to fail without being scored).

For OTIS Mock AIME 2024-2025, GPQA diamond, and FrontierMath-Tier-4-2025-07-01-Private, we used a maximum output length per request of 128,000 tokens (default is no maximum), as recommended by xAI. If any requests failed due to network errors or timeouts, we moved the corresponding sample directly to the scoring phase of the evaluation (which generally causes it be be marked as incorrect). This was due to the highly time-sensitive nature of the evaluations.

Benchmark Accuracy Samples with API errors Run ID
OTIS Mock AIME 2024-2025 84% (±5%) 4 out of 45*8=360 (1%) cvTPRDCM38zSTn9Y3MUb9d
GPQA diamond 87% (±2%) 7 out of 198 × 8 = 1584 (0.4%) A85Zfq2qguE4X9xXBweBHP
FrontierMath-Tier-4-2025-07-01-Private 2% (±2%) 8 out of 48 (16%) QxtNUmV2L34UyrySmBLTbv

xAI compensated us for this evaluation and provided compute credits. We signed no NDA and maintained complete editorial independence: we publish all results regardless of performance.