GBAEval

GBAEval

GBAEval evaluates AI coding agents on a long-horizon software engineering task: building a Game Boy Advance emulator from scratch in Rust and WebAssembly. Runs are graded with replay, procedural, and audio tests, so the benchmark measures the full range of functional correctness across emulator behavior.

Methodology

We source results from the public GBAEval leaderboard. The leaderboard reports an overall score that combines replay, procedural, and audio section scores. Our chart defaults to overall score but makes section scores available.

In GBAEval, agents are asked to build a Game Boy Advance emulator in Rust and WebAssembly. Candidate emulators are graded against Mesen2, a reference emulator, across three categories. Replay tests run fixed button-input traces and compare the resulting video frames, while procedural tests run ROMs that exercise hardware behavior and DMA audio tests compare generated sound output. The overall score weights replay tests most heavily, with procedural and audio tests contributing the remaining score.