GSO
Global Software Optimization (GSO) is a benchmark that aims to measure a model’s ability to improve the performance of a codebase by modifying its code. Performance engineering is a challenging part of software engineering that typically requires specialized knowledge and expertise. Implementing successful optimizations also requires strong reasoning abilities that remain coherent during long exploratory paths while developing them. These factors make optimization a challenging task for models by depending on more than the narrow coding abilities that many other benchmarks evaluate.
Questions were generated by scanning public Github repositories for prior code changes that substantially sped up a program’s execution time. The models are given the code in its original state before the commit and tasked with making changes to increase its performance without breaking its functionality. The problems were curated by humans before ending up in the final benchmark set, with the aim of selecting sufficiently-complex problems that required substantial insight and modification of code to pass. They also tested problems with models in advance to see which could be completed too easily with trivial changes.
The benchmark includes performance tests written for the programs in question which are run as part of the evaluation process to determine how much the model’s modification accelerated the program. Correctness tests ensure that the changes made by the model do not break the intended functionality of the program.
An example problem was based on the code before this commit to the numpy repository, which improved the performance of the ufunct.at method. More examples are available in Appendix J of the paper.
A model’s score is the percentage of questions where it was able to program an optimization that sped the program up by at least 95% of that which the original Github commit had achieved.
The benchmark was introduced in GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents, also available on arxiv.
Methodology
We get our data from the GSO leaderboard.
The language models were enabled to behave as agents using OpenHands, a popular framework that allows models to perform software engineering tasks on the computer. It allows the model to do things such as execute code, create and modify files, browse the web and interact with the terminal.
The models received the following prompt:
“I’ve uploaded a python code repository in the directory workspace_dir_name. Consider the
following test script showing an example usage of the repository:
<test_script>
[[ SPECIFICATION TEST]]
</test_script>
Can you help me implement the necessary changes to the repository so that the runtime of
the <test_script> is optimized? Basic guidelines:
1. Your task is to make changes to non-test files in the /workspace directory to improve the
performance of the <test_script>.
2. Make changes while ensuring the repository is functionally equivalent to the original.
3. Do not overoptimize for just the specific inputs in <test_script>. Make general perfor-
mance improvements for the usage scenario shown.
4. You may need to rebuild the repo for your changes to take effect before testing. Some
rebuilds may take time to run, so be patient with running them.
Follow these steps to improve performance:
1. As a first step, explore the repository structure.
2. Create a script in the /workspace directory (e.g., /workspace/test_opt.py) to reproduce and
time the example, then execute it with python /workspace/<filename.py>.
3. Edit the source code of the repository to improve performance.
4. Rebuild and rerun your script to confirm that performance has improved.”
Reasoning models were allowed to use test-time reasoning but models were otherwise evaluated equally. The amount of test-time compute allowed for each model is fixed manually and reported on the leaderboard. In one setting, Opt@1, the model is given only one try for each question. Opt@k is scored in the same way but the model is given k attempts for each question.
The evaluation code is available on Github.