GSO | Epoch AI

Dashboard Benchmarks Models ECI Data About

About GSO

Performance engineering is a challenging part of software engineering which requires a high level of skill and computer science understanding. GSO-bench aims to evaluate models on their ability to optimize the performance of programs. It creates problems with an automated method that uses performance optimizations that were made by humans in real-world software packages available on GitHub. The model is shown the code from that software package in the state before the optimization was made and tasked with making performance improvements to the software which are then measured up against what the programmer had come up with. The metric they report, OPT@K, refers to the percentage of trials where a model could achieve at least 95% of the human speed up after K attempts.

Methodology

We get our data from the GSO leaderboard.

The language models were enabled to behave as agents using OpenHands, a popular framework that allows models to perform software engineering tasks on the computer. It allows the model to do things such as execute code, create and modify files, browse the web and interact with the terminal.

The models received the following prompt:

I’ve uploaded a python code repository in the directory workspace_dir_name. Consider the
following test script showing an example usage of the repository:
<test_script>
[[ SPECIFICATION TEST]]
</test_script>
Can you help me implement the necessary changes to the repository so that the runtime of
the <test_script> is optimized? Basic guidelines:
1. Your task is to make changes to non-test files in the /workspace directory to improve the
performance of the <test_script>.
2. Make changes while ensuring the repository is functionally equivalent to the original.
3. Do not overoptimize for just the specific inputs in <test_script>. Make general perfor-
mance improvements for the usage scenario shown.
4. You may need to rebuild the repo for your changes to take effect before testing. Some
rebuilds may take time to run, so be patient with running them.
Follow these steps to improve performance:
1. As a first step, explore the repository structure.
2. Create a script in the /workspace directory (e.g., /workspace/test_opt.py) to reproduce and
time the example, then execute it with python /workspace/<filename.py>.
3. Edit the source code of the repository to improve performance.
4. Rebuild and rerun your script to confirm that performance has improved.

Reasoning models were allowed to use test-time reasoning but models were otherwise evaluated equally. The amount of test-time compute allowed for each model is fixed manually and reported on the leaderboard. In one setting, Opt@1, the model is given only one try for each question. Opt@k is scored in the same way but the model is given k attempts for each question.

The evaluation code is available on Github.