About Terminal-Bench

A Terminal-Bench task is a folder containing an instruction, docker environment, and test script. To successfully complete the tasks, the models need to understand the shell environment and how to use its programs, as well as the state of the computer, including its filesystem and running processes. They also require coherence and skill to plan and execute the necessary steps without being explicitly told what they are.

Methodology

Models run on this benchmark are paired with an agent tool which helps them continuously interact with the terminal environment. These include Anthropic’s Claude Code, OpenAI’s Codex CLI, Terminus and Goose. Performance is reported for given model-agent combinations because of the difference that the agent can make on performance. Some of these tools enforce or dynamically adjust the level of reasoning effort used for the models in a way which isn’t amenable to correction or elucidation by a user or evaluator in this context.

Scores on the leaderboard are determined by calculating the percentage of tasks in the question set which were solved, generally as an average over multiple repetitions of the entire set of questions.

The code for the benchmark is available in the Terminal-Bench Github repository. We get our data from the Terminal-Bench leaderboard.