ExploitBench, created by researchers at Carnegie Mellon University, evaluates how far LLM-based agents can progress when attempting to exploit real software vulnerabilities. Rather than scoring success or failure as a binary, it measures progressive capability along a ladder that runs from reaching vulnerable code, to triggering a crash, to building exploitation primitives, up to full arbitrary code execution.
The benchmark targets N-day vulnerabilities (bugs that already have a patch) in Chromium’s V8 JavaScript and WebAssembly engine, evaluated against default release builds with mitigations such as the heap sandbox, ASLR, and stack canaries enabled.
We source results from the public ExploitBench leaderboard.
ExploitBench comprises 41 V8 bugs (all reported in 2024 or later). The exploitation process is decomposed into 16 deterministically graded “flags” grouped into five tiers, ranging from T5 (coverage of vulnerable code) down to T1 (control-flow hijacking and code execution). Agents are run under a uniform turn budget with multiple seeds per bug, and grading uses a deterministic oracle. Our chart plots the leaderboard’s mean capability metric, the average fraction of flags an agent lights across seeds; the underlying mean flag count (on a 0–16 scale) is available in the tooltip.
For full details, see the ExploitBench paper and code.
Have a question? Noticed something wrong? Let us know.
A benchmark measuring how far LLM agents can climb a "capability ladder" of software exploitation against real, hardened security vulnerabilities.