Podcast
May 1, 2026

Are AI benchmarks doomed?

In this episode, Greg Burnham and Tom Adamczewski join Anson Ho to push back on benchmark pessimism and dig into what the next generation of AI benchmarks could look like.

Greg Burnham leads Epoch’s benchmarking team. Tom Adamczewski is a senior research engineer who develops new benchmarks, including MirrorCode.

Topics we cover: why benchmark saturation isn’t as alarming as it seems, how AI can speed up benchmark development, the benchmark-reality gap, whether an AGI benchmark can exist, and the role of human evaluation in future benchmarks.

We also discuss MirrorCode, a benchmark (co-developed by Epoch and METR) of long-horizon coding tasks, and FrontierMath: Open Problems, Epoch’s benchmark of real unsolved math research problems.

Transcript

In this podcast