About SWE-bench (Bash Only)

SWE-bench Verified is a human-validated subset of the original SWE-bench dataset, consisting of 500 samples that evaluate AI models’ ability to solve real-world software engineering issues. Each sample is derived from an actual GitHub issue in one of 12 open-source Python repositories.

The evaluation workflow is as follows: the model is given access to a code repository and a description of an issue that needs to be fixed. The model must then modify the appropriate files in the repository to resolve the issue. Once the model has made its changes, the solution is evaluated by running unit tests on the modified codebase.

The dataset was carefully curated through a rigorous human annotation process involving 93 software developers. Each sample was reviewed by three separate annotators to ensure the issue description is well-specified, the unit tests are appropriate, and the sample is free of other major issues that could make evaluation unreliable.