Greg Burnham

By Greg Burnham

0 results

RIP Classic Reasoning Benchmarks. What's Next?

Newsletter

May 5, 2026

RIP Classic Reasoning Benchmarks. What's Next?

Give up at least one of: text only, short time horizon, easy to grade, and expert human superiority.

By Greg Burnham

Podcast

May 1, 2026

Are AI benchmarks doomed?

In this episode, Greg Burnham and Tom Adamczewski join Anson Ho to push back on benchmark pessimism and dig into what the next generation of AI benchmarks could look like.

By Greg Burnham, Tom Adamczewski, and Anson Ho

What do “economic value” benchmarks tell us?

Report

Feb. 13, 2026

What do “economic value” benchmarks tell us?

These benchmarks track a wide range of digital work. Progress will correlate with economic utility, but tasks are too self-contained to indicate full automation.

By Florian Brand and Greg Burnham

$AI math capabilities could be jagged for a long time – Daniel Litt$

Podcast

Jan. 29, 2026

AI math capabilities could be jagged for a long time – Daniel Litt

In this episode, Daniel Litt chats with the hosts about AI’s limits in mathematics, accelerating math research, and how to measure progress on open problems.

By Daniel Litt, Greg Burnham, and Anson Ho

Benchmark Scores = General Capability + Claudiness

Newsletter

Nov. 20, 2025

Benchmark Scores = General Capability + Claudiness

Is this because skills generalize very well, or because developers are pushing on all benchmarks at once?

By Greg Burnham

What does OSWorld tell us about AI’s ability to use computers?

Report

Oct. 30, 2025

What does OSWorld tell us about AI’s ability to use computers?

We review OSWorld, a prominent computer use benchmark. Its tasks are relatively simple, many don’t require GUIs, and success often hinges on interpreting ambiguous instructions. It is also not stable over time.

By Florian Brand and Greg Burnham

Less than 70% of FrontierMath is within reach for today’s models

Newsletter

Oct. 17, 2025

Less than 70% of FrontierMath is within reach for today’s models

57% of problems have been solved at least once.

By Greg Burnham

OpenAI is projecting unprecedented revenue growth

Newsletter

Oct. 15, 2025

OpenAI is projecting unprecedented revenue growth

No company has gone from $10B to $100B as fast as OpenAI projects to do.

By Greg Burnham

$Evaluating Gemini 2.5 Deep Think's math capabilities$

Report

Oct. 9, 2025

Evaluating Gemini 2.5 Deep Think's math capabilities

It has improved at using background knowledge and doing precise computations. It can be a helpful research assistant and may take a more conceptual approach to geometry. It shows limited creativity and sometimes struggles with citations.

By Greg Burnham

$LLMs have not yet solved the hardest problems on high school math contests$

Data Insight

Sep. 3, 2025

LLMs have not yet solved the hardest problems on high school math contests

By Greg Burnham

Newsletter

Aug. 7, 2025

We didn’t learn much from the IMO

The problems gave AI only a slim chance to show new capabilities

By Greg Burnham

$Evaluating Grok 4’s math capabilities$

Report

Jul. 25, 2025

Evaluating Grok 4’s math capabilities

It’s good at involved computations, improving at proofs from a low base, and useful for literature search. It still favors low-level grinds and leans on background knowledge.

By Greg Burnham

Newsletter

Jul. 8, 2025

What will the IMO tell us about AI math capabilities?

Most discussion about AI and the IMO focuses on gold medals, but that's not the thing to pay most attention to.

By Greg Burnham

LLMs now accept longer inputs, and the best models can use them more effectively

Data Insight

Jun. 25, 2025

LLMs now accept longer inputs, and the best models can use them more effectively

By Greg Burnham and Tom Adamczewski

LLM providers offer a trade-off between accuracy and speed

Data Insight

Jun. 11, 2025

LLM providers offer a trade-off between accuracy and speed

By Greg Burnham and Tom Adamczewski

Newsletter

May 30, 2025

GPQA Diamond: What’s left?

Investigate GPQA Diamond benchmark's validity: uncover flawed questions, model challenges, and why it still informs AI evaluation.

By Greg Burnham

AI Progress

Industry

Infrastructure

Impacts

Featured

Publications

Data explorers

Benchmarks by Epoch AI

Papers & Reports

Data Insights

Newsletter

Podcast

Capabilities

Models

Frontier Data Centers

Chip Owners

Companies

Polling on AI Use

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

Greg Burnham

Filter

By Greg Burnham

Greg Burnham

AI Progress

Industry

Infrastructure

Impacts

Featured

Publications

Data explorers

Benchmarks by Epoch AI

Scaling

Software progress

Open models

Capabilities

Math

Leading companies

Finances

Geopolitics

Chips

Data centers

Energy

Adoption and use

Economic impact

Future of AI

Publications

Papers & Reports

Data Insights

Newsletter

Podcast

Data explorers

Capabilities

Models

Frontier Data Centers

Chip Owners

Companies

Polling on AI Use

Benchmarks by Epoch AI

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

About Epoch AI

Donate

Team

Careers

Consultations

For press

Transparency

Greg Burnham

Filter

By Greg Burnham