ExploitBench

ExploitBench, created by researchers at Carnegie Mellon University, evaluates how far LLM-based agents can progress when attempting to exploit real software vulnerabilities. Rather than scoring success or failure as a binary, it measures progressive capability along a ladder that runs from reaching vulnerable code, to triggering a crash, to building exploitation primitives, up to full arbitrary code execution.

The benchmark targets N-day vulnerabilities (bugs that already have a patch) in Chromium’s V8 JavaScript and WebAssembly engine, evaluated against default release builds with mitigations such as the heap sandbox, ASLR, and stack canaries enabled.

Methodology

We source results from the public ExploitBench leaderboard.

ExploitBench comprises 41 V8 bugs (all reported in 2024 or later). The exploitation process is decomposed into 16 deterministically graded “flags” grouped into five tiers, ranging from T5 (coverage of vulnerable code) down to T1 (control-flow hijacking and code execution). Agents are run under a uniform turn budget with multiple seeds per bug, and grading uses a deterministic oracle. Our chart plots the leaderboard’s mean capability metric, the average fraction of flags an agent lights across seeds; the underlying mean flag count (on a 0–16 scale) is available in the tooltip.

For full details, see the ExploitBench paper and code.

Featured

Publications

Data explorers

Benchmarks by Epoch AI

AI Progress

Industry

Infrastructure

Impacts

Papers & Reports

Data Insights

Newsletter

Podcast

Capabilities

Models

Data Centers

Chip Owners

Companies

Polling on AI Use

MirrorCode

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

ExploitBench

ExploitBench

Methodology

ExploitBench

Featured

Publications

Data explorers

Benchmarks by Epoch AI

AI Progress

Industry

Infrastructure

Impacts

Publications

Papers & Reports

Data Insights

Newsletter

Podcast

Data explorers

Capabilities

Models

Data Centers

Chip Owners

Companies

Polling on AI Use

Benchmarks by Epoch AI

MirrorCode

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

Scaling

Software progress

Open models

Capabilities

Math

Leading companies

Finances

Geopolitics

Chips

Data centers

Energy

Adoption and use

Economic impact

Future of AI

About Epoch AI

Donate

Team

Careers

Consultations

For press

Transparency

ExploitBench

ExploitBench

Methodology