MirrorCode

What's the largest software project AI can complete on its own?

AI has made rapid progress on software engineering benchmarks in the past few years. However, most such benchmarks tend to focus on shorter tasks like fixing bugs or implementing individual features. MirrorCode is our benchmark, co-developed with METR, to test AI models on long-horizon coding tasks. In a MirrorCode task, AI models are tasked with reimplementing an entire program end-to-end, without access to the original source code. AI-generated solutions must match the original program’s output exactly on end-to-end tests, including held-out tests. MirrorCode’s 25 target programs span different areas of computing: Unix utilities, data serialization and query tools, bioinformatics, interpreters, static analysis, cryptography, and compression.

How MirrorCode is different

Scale-aware evaluations

Crucially, we provide a large enough inference budget to make a serious attempt at MirrorCode tasks. Many existing software engineering benchmarks limit inference spending to around $1–10, even when the task would take weeks for a human to complete. For example, one of the largest MirrorCode tasks cost $2,600 for a single run and involved AI working for 19 days without human intervention.

Difficult, but fair

Reimplementing entire programs is extremely challenging for human software engineers. We believe a human engineer without AI would take months to solve the most complex MirrorCode tasks. However, MirrorCode tasks are also feasible; we know that there is enough information for the tasks to be fair.

Cheat-resistant by design

We sandbox AI models, requiring them to conduct their work without access to the internet, without access to the original codebase, and with no way to cheat on the task. There are end-to-end tests that models never see while developing their code, so they cannot simply create a lookup table to mimic the original program's outputs.

AI can already perform some long-horizon coding tasks

AI can already solve long-horizon MirrorCode tasks, despite their difficulty. For example, Claude Opus 4.7 reimplemented gotree: a bioinformatics toolkit with ~16,000 lines of Go and 40+ commands.1 We believe this same task would take a human engineer without AI assistance 2–17 weeks. Opus 4.7 solved it in 14 hours, costing $251.

However, MirrorCode is not fully solved. Claude Opus 4.7’s headline score is only 56%, meaning there is significant room for further improvement.2 We look forward to evaluating new models on the benchmark.

We also found that AI models are improving rapidly over time. Leading models from a year ago would have scored about 30%, and were limited to simpler programs, such as a calendar utility. There was no clear overall trend in cost: GPT-5.5 cost 3× more than GPT-5 to solve the same tasks, whereas Claude Opus 4.7 was 3× cheaper than Claude Opus 4.1.

One important caveat to these results is data contamination. Because MirrorCode tasks involve reimplementing open-source programs, AI models are likely to have seen the original codebases in pretraining. This might lead to inflated performance on the benchmark. However, AI successfully reimplemented several target programs that passed our memorization screen, and failed to reimplement programs where the screen showed evidence of memorization. This suggests that the results were not dominated by memorization, but we cannot rule out the possibility that memorization contributes to AI performance. Overall, we expect that the capabilities measured by MirrorCode would generalize to an unseen codebase. We discuss this further, along with more results and details on benchmark construction, in the paper.

Open-source code

We release our scaffold and 22 of the 25 MirrorCode target programs (totaling 132 task instances across the six supported programming languages) as open-source, with the other three targets held out as a private test set.

This work was co-developed with METR and supported by a grant from METR. The authors of MirrorCode are Tom Adamczewski, David Owen, and David Rein. Florian Brand, Giles Edkins, Allen Hart, and Daniel O’Connell contributed additional target programs. Rasmus Faber-Espensen made crucial infrastructure improvements and gave advice on engineering

Notes
  1. The best-scoring AI gotree implementations passed 2000/2001 tests, but failed a single edge-case test for a niche command to manipulate date annotations. Consequently, they do not strictly solve the task to 100% completion, but we consider the reimplementation near-perfect, covering essentially all scoped functionality. Return

  2. On 21/25 MirrorCode targets, AI models have at least once passed 99% of tests or more. Typically, outstanding test failures are from a handful of edge cases. At the stricter threshold of reimplementation (100% of tests passing), eight MirrorCode targets have never been solved in any run. Benchmark scores are lower than 17/25 ≈ 70% because several targets are not solved reliably: AI solves them only in some runs. Return