Are AI benchmarks doomed?

Greg Burnham leads Epoch’s benchmarking team. Tom Adamczewski is a senior research engineer who develops new benchmarks, including MirrorCode.

Topics we cover: why benchmark saturation isn’t as alarming as it seems, how AI can speed up benchmark development, the benchmark-reality gap, whether an AGI benchmark can exist, and the role of human evaluation in future benchmarks.

We also discuss MirrorCode, a benchmark (co-developed by Epoch and METR) of long-horizon coding tasks, and FrontierMath: Open Problems, Epoch’s benchmark of real unsolved math research problems.

Transcript

In this podcast

Greg Burnham

Greg Burnham is a researcher at Epoch AI. Prior to this, he worked at Elemental Cognition and Bridgewater Associates. He has a BA in mathematics from Princeton University.

Tom Adamczewski

Tom Adamczewski started Epoch AI's benchmark engineering team. He now works on developing new evals to measure economically important AI capabilities. Before Epoch AI, he created a Monte Carlo simulation application and worked on payments technology.

Anson Ho

Anson Ho is a researcher at Epoch AI. He is interested in helping develop a more rigorous understanding of future developments in AI and its societal impacts.

AI Progress

Industry

Infrastructure

Impacts

Featured

Publications

Data explorers

Benchmarks by Epoch AI

Papers & Reports

Data Insights

Newsletter

Podcast

Capabilities

Models

Frontier Data Centers

Chip Owners

Companies

Polling on AI Use

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

Are AI benchmarks doomed?

Transcript

Are AI benchmarks doomed? [00:00:36]

The costs and benefits of benchmark development [00:03:13]

MirrorCode and scalable benchmarks [00:11:48]

AI speed-up in benchmark development [00:20:57]

The benchmark-reality gap [00:23:28]

Can an AGI benchmark exist? [00:38:26]

Beyond automated scoring [00:43:18]

How AI changes benchmark building in practice [01:00:45]

In this podcast

Are AI benchmarks doomed?

AI Progress

Industry

Infrastructure

Impacts

Featured

Publications

Data explorers

Benchmarks by Epoch AI

Scaling

Software progress

Open models

Capabilities

Math

Leading companies

Finances

Geopolitics

Chips

Data centers

Energy

Adoption and use

Economic impact

Future of AI

Publications

Papers & Reports

Data Insights

Newsletter

Podcast

Data explorers

Capabilities

Models

Frontier Data Centers

Chip Owners

Companies

Polling on AI Use

Benchmarks by Epoch AI

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

About Epoch AI

Donate

Team

Careers

Consultations

For press

Transparency

Transcript

Are AI benchmarks doomed? [00:00:36]

The costs and benefits of benchmark development [00:03:13]

MirrorCode and scalable benchmarks [00:11:48]

AI speed-up in benchmark development [00:20:57]

The benchmark-reality gap [00:23:28]

Can an AGI benchmark exist? [00:38:26]

Beyond automated scoring [00:43:18]

How AI changes benchmark building in practice [01:00:45]

In this podcast

Related topics