ExploitBench

ExploitBench

ExploitBench, created by researchers at Carnegie Mellon University, evaluates how far LLM-based agents can progress when attempting to exploit real software vulnerabilities. Rather than scoring success or failure as a binary, it measures progressive capability along a ladder that runs from reaching vulnerable code, to triggering a crash, to building exploitation primitives, up to full arbitrary code execution.

The benchmark targets N-day vulnerabilities (bugs that already have a patch) in Chromium’s V8 JavaScript and WebAssembly engine, evaluated against default release builds with mitigations such as the heap sandbox, ASLR, and stack canaries enabled.

Methodology

We source results from the public ExploitBench leaderboard.

ExploitBench comprises 41 V8 bugs (all reported in 2024 or later). The exploitation process is decomposed into 16 deterministically graded “flags” grouped into five tiers, ranging from T5 (coverage of vulnerable code) down to T1 (control-flow hijacking and code execution). Agents are run under a uniform turn budget with multiple seeds per bug, and grading uses a deterministic oracle. Our chart plots the leaderboard’s mean capability metric, the average fraction of flags an agent lights across seeds; the underlying mean flag count (on a 0–16 scale) is available in the tooltip.

For full details, see the ExploitBench paper and code.