AI Capabilities: Data & Research

0 results

RIP Classic Reasoning Benchmarks. What's Next?

Newsletter

May 5, 2026

RIP Classic Reasoning Benchmarks. What's Next?

Give up at least one of: text only, short time horizon, easy to grade, and expert human superiority.

By Greg Burnham

Podcast

May 1, 2026

Are AI benchmarks doomed?

In this episode, Greg Burnham and Tom Adamczewski join Anson Ho to push back on benchmark pessimism and dig into what the next generation of AI benchmarks could look like.

By Greg Burnham, Tom Adamczewski, and Anson Ho

Report

Apr. 16, 2026

Have AI Capabilities Accelerated?

We investigate progress trends on four capability metrics to determine whether AI capabilities have recently accelerated. Three of four metrics show strong evidence of acceleration, driven by reasoning models.

By Jean-Stanislas Denain and Alexander Barry

MirrorCode: Evidence that AI can already do some weeks-long coding tasks

Report

Apr. 10, 2026

MirrorCode: Evidence that AI can already do some weeks-long coding tasks

In our new benchmark, MirrorCode, Claude Opus 4.6 autonomously reimplemented a 16,000-line bioinformatics toolkit — a task we believe would take a human engineer weeks.

By Tom Adamczewski, David Rein, David Owen, and Florian Brand

Newsletter

Apr. 7, 2026

Keeping up with the GPTs

Can Chinese and open model companies compete with the frontier through e.g. distillation and talent?

By Anson Ho

Expanding our analysis of biological AI models

Report

Feb. 20, 2026

Expanding our analysis of biological AI models

We release a database of over 1,100 biological AI models across nine categories. We analyze their safeguards, accessibility, training data sources, and the foundation models they build on.

By David Atanasov, Niccolò Zanichelli, and Jean-Stanislas Denain

What do “economic value” benchmarks tell us?

Report

Feb. 13, 2026

What do “economic value” benchmarks tell us?

These benchmarks track a wide range of digital work. Progress will correlate with economic utility, but tasks are too self-contained to indicate full automation.

By Florian Brand and Greg Burnham

Where Autonomy Works: Evaluating Robot Capabilities in 2026

Report

Feb. 10, 2026

Where Autonomy Works: Evaluating Robot Capabilities in 2026

We assess the current state of autonomous robotics by evaluating robot performance on concrete tasks across industrial, household, and navigation domains.

By Yann Rivière and Jean-Stanislas Denain

Newsletter

Feb. 6, 2026

How close is AI to taking my job?

Beyond benchmarks as leading indicators for task automation

By Anson Ho

$AI math capabilities could be jagged for a long time – Daniel Litt$

Podcast

Jan. 29, 2026

AI math capabilities could be jagged for a long time – Daniel Litt

In this episode, Daniel Litt chats with the hosts about AI’s limits in mathematics, accelerating math research, and how to measure progress on open problems.

By Daniel Litt, Greg Burnham, and Anson Ho

Benchmark scores are well correlated, even across domains

Data Insight

Jan. 23, 2026

Benchmark scores are well correlated, even across domains

By Luke Emberson and Yafah Edelman

How well did forecasters predict 2025 AI progress?

Newsletter

Jan. 16, 2026

How well did forecasters predict 2025 AI progress?

Mostly right about benchmarks, mixed results on real-world impacts

By Anson Ho

Chinese AI models have lagged the US frontier by 7 months on average since 2023

Data Insight

Jan. 2, 2026

Chinese AI models have lagged the US frontier by 7 months on average since 2023

By Luke Emberson

Data Insight

Dec. 23, 2025

AI capabilities progress has sped up

By Yafah Edelman and Jaeho Lee

Newsletter

Dec. 23, 2025

Why benchmarking is hard

Running benchmarks involves many moving parts, each of which can influence the final score. The two most impactful components are scaffolds and API providers.

By Florian Brand and Jean-Stanislas Denain

Paper

Dec. 2, 2025

A Rosetta Stone for AI benchmarks

Most benchmarks saturate too quickly to study long-run AI trends. We solve this using a statistical framework that stitches benchmarks together, with big implications for algorithmic progress and AI forecasting.

By Anson Ho, Jean-Stanislas Denain, David Atanasov, Samuel Albanie, and Rohin Shah

Benchmark Scores = General Capability + Claudiness

Newsletter

Nov. 20, 2025

Benchmark Scores = General Capability + Claudiness

Is this because skills generalize very well, or because developers are pushing on all benchmarks at once?

By Greg Burnham

Epoch’s Capabilities Index stitches together benchmarks across a wide range of difficulties

Data Insight

Nov. 6, 2025

Epoch’s Capabilities Index stitches together benchmarks across a wide range of difficulties

By Jaeho Lee and Luke Emberson

What does OSWorld tell us about AI’s ability to use computers?

Report

Oct. 30, 2025

What does OSWorld tell us about AI’s ability to use computers?

We review OSWorld, a prominent computer use benchmark. Its tasks are relatively simple, many don’t require GUIs, and success often hinges on interpreting ambiguous instructions. It is also not stable over time.

By Florian Brand and Greg Burnham

Open-weight models lag state-of-the-art by around 3 months on average

Data Insight

Oct. 30, 2025

Open-weight models lag state-of-the-art by around 3 months on average

By Luke Emberson

Less than 70% of FrontierMath is within reach for today’s models

Newsletter

Oct. 17, 2025

Less than 70% of FrontierMath is within reach for today’s models

57% of problems have been solved at least once.

By Greg Burnham

$Evaluating Gemini 2.5 Deep Think's math capabilities$

Report

Oct. 9, 2025

Evaluating Gemini 2.5 Deep Think's math capabilities

It has improved at using background knowledge and doing precise computations. It can be a helpful research assistant and may take a more conceptual approach to geometry. It shows limited creativity and sometimes struggles with citations.

By Greg Burnham

AI capabilities have steadily improved over the past year

Data Insight

Sep. 30, 2025

AI capabilities have steadily improved over the past year

By Luke Emberson

AI developers accurately report GPQA Diamond scores for recent models

Data Insight

Sep. 19, 2025

AI developers accurately report GPQA Diamond scores for recent models

By Jaeho Lee and Yafah Edelman

Newsletter

Sep. 19, 2025

The huge potential implications of long-context inference

Continual learning, scaling RL, and research feedback loops

By Jean-Stanislas Denain and Anson Ho

Report

Sep. 16, 2025

What will AI look like in 2030?

If scaling persists to 2030, AI investments will reach hundreds of billions of dollars and require gigawatts of power. Benchmarks suggest AI could improve productivity in valuable areas such as scientific R&D.

By David Owen

$LLMs have not yet solved the hardest problems on high school math contests$

Data Insight

Sep. 3, 2025

LLMs have not yet solved the hardest problems on high school math contests

By Greg Burnham

GPT-5 and GPT-4 were both major leaps in benchmarks from the previous generation

Data Insight

Aug. 29, 2025

GPT-5 and GPT-4 were both major leaps in benchmarks from the previous generation

By Luke Emberson and Josh You

Newsletter

Aug. 22, 2025

Why future AI agents will be trained to work together

Many multi-agent setups are based on fancy prompts, but this is unlikely to persist

By Anson Ho and Jean-Stanislas Denain

Compute is not a bottleneck for robotic manipulation

Data Insight

Aug. 8, 2025

Compute is not a bottleneck for robotic manipulation

By Ben Cottier, Scott Longwell, James Sanders, David Owen, Yafah Edelman, and Luke Emberson

Newsletter

Aug. 7, 2025

We didn’t learn much from the IMO

The problems gave AI only a slim chance to show new capabilities

By Greg Burnham

Newsletter

Aug. 2, 2025

Quantifying the algorithmic improvement from reasoning models

Reasoning models were as big of an improvement as the Transformer, at least on some benchmarks

By Anson Ho and Arden Berg

$Evaluating Grok 4’s math capabilities$

Report

Jul. 25, 2025

Evaluating Grok 4’s math capabilities

It’s good at involved computations, improving at proofs from a low base, and useful for literature search. It still favors low-level grinds and leans on background knowledge.

By Greg Burnham

How to run SWE-bench Verified in one hour on one machine

Update

Jul. 10, 2025

How to run SWE-bench Verified in one hour on one machine

We are releasing a public registry of optimized Docker images for SWE-bench. This allows us to run SWE-bench Verified in 62 minutes on a single GitHub actions VM.

By Tom Adamczewski

Newsletter

Jul. 8, 2025

What will the IMO tell us about AI math capabilities?

Most discussion about AI and the IMO focuses on gold medals, but that's not the thing to pay most attention to.

By Greg Burnham

LLMs now accept longer inputs, and the best models can use them more effectively

Data Insight

Jun. 25, 2025

LLMs now accept longer inputs, and the best models can use them more effectively

By Greg Burnham and Tom Adamczewski

What skills does SWE-bench Verified evaluate?

Report

Jun. 13, 2025

What skills does SWE-bench Verified evaluate?

We take a deep dive into SWE-bench Verified, a prominent agentic coding benchmark. While one of the best public tests of AI coding agents, it is limited by its focus on simple bug fixes in familiar open-source repositories.

By Florian Brand and Jean-Stanislas Denain

Newsletter

Jun. 13, 2025

Do the biorisk evaluations of AI labs actually measure the risk of developing bioweapons?

Assessing if AI labs' biorisk evaluations effectively measure models' potential to enable amateur bioweapons development.

By Anson Ho and Arden Berg

LLM providers offer a trade-off between accuracy and speed

Data Insight

Jun. 11, 2025

LLM providers offer a trade-off between accuracy and speed

By Greg Burnham and Tom Adamczewski

Newsletter

Jun. 6, 2025

Beyond benchmark scores: Analyzing o3-mini’s mathematical reasoning

Examining o3-mini's math reasoning: an erudite, vibes-based solver that excels in knowledge but lacks precision, creativity, and formal human rigor.

By Anson Ho, Jean-Stanislas Denain, and Elliot Glazer

Newsletter

May 30, 2025

GPQA Diamond: What’s left?

Investigate GPQA Diamond benchmark's validity: uncover flawed questions, model challenges, and why it still informs AI evaluation.

By Greg Burnham

Is AI already superhuman on FrontierMath?

Newsletter

May 23, 2025

Is AI already superhuman on FrontierMath?

How do humans and AIs compare on FrontierMath? We ran a competition at MIT to put this to the test.

By Anson Ho

Newsletter

May 2, 2025

Where’s my ten minute AGI?

Why don't AIs automate more real-world tasks if they can handle 1-hour ones? Anson Ho explores key capability and context bottlenecks.

By Anson Ho

LLM responses to benchmark questions are getting longer over time

Data Insight

Apr. 17, 2025

LLM responses to benchmark questions are getting longer over time

By Luke Emberson, Ben Cottier, Josh You, Tom Adamczewski, and Jean-Stanislas Denain

Newsletter

Mar. 28, 2025

The real reason AI benchmarks haven’t reflected economic impacts

The real reason that AI benchmarks haven’t reflected real-world impacts historically is that they weren’t optimized for this, not because of fundamental limitations – but this might be changing.

By Anson Ho and Jean-Stanislas Denain

Newsletter

Mar. 7, 2025

What AI can currently do is not the story

Forecasting AI progress requires more than extrapolating current capabilities; understanding fundamental task difficulty is key to predicting future breakthroughs.

By Ege Erdil

Newsletter

Feb. 28, 2025

The promise of reasoning models

AI reasoning models will achieve superhuman performance in math and coding, yet their economic applications will lag behind, limiting real-world impact.

By Matthew Barnett

Biology AI models are scaling 2-4x per year after rapid growth from 2019-2021

Data Insight

Feb. 21, 2025

Biology AI models are scaling 2-4x per year after rapid growth from 2019-2021

By Pablo Villalobos and David Atanasov

Newsletter

Feb. 21, 2025

AI progress is about to speed up

AI progress is accelerating, with next-gen models surpassing GPT-4 in compute power, driving major leaps in reasoning, coding, and math capabilities.

By Ege Erdil

A more systematic and transparent AI benchmarking hub

Update

Feb. 7, 2025

A more systematic and transparent AI benchmarking hub

We've overhauled our AI benchmarking infrastructure to provide more transparent, systematic, and up-to-date evaluations of AI model capabilities.

By Tom Adamczewski

Announcing our expanded biology AI coverage

Update

Jan. 29, 2025

Announcing our expanded biology AI coverage

We've expanded our Biology AI Dataset, now covering 360+ models. Our analysis reveals rapid scaling from 2017-2021, followed by a notable slowdown in biological model development.

By Pablo Villalobos and David Atanasov

Clarifying the creation and use of the FrontierMath benchmark

Update

Jan. 23, 2025

Clarifying the creation and use of the FrontierMath benchmark

We clarify that OpenAI commissioned Epoch AI to produce 300 math questions for the FrontierMath benchmark. They own these and have access to the statements and solutions, except for a 50-question holdout set.

By Tamay Besiroglu and Jaime Sevilla

FrontierMath competition: Setting benchmarks for AI evaluation

Update

Updated Mar. 18, 2025

FrontierMath competition: Setting benchmarks for AI evaluation

We are hosting a competition to establish rigorous human performance baselines for FrontierMath. With a prize pool of $10,000, your participation will contribute directly to measuring AI progress in solving challenging mathematical problems.

By Tamay Besiroglu, Elliot Glazer, and Caroline Falkman Olsson

Newsletter

Dec. 27, 2024

Moravec’s paradox and its implications

This Gradient Updates issue explains Moravec's paradox and offers a speculative picture of how hard various economic tasks are to automate based on the paradox.

By Ege Erdil

What is the future of AI in mathematics? Interviews with leading mathematicians

Report

Dec. 4, 2024

What is the future of AI in mathematics? Interviews with leading mathematicians

How will AI transform mathematics? Fields Medalists and other leading mathematicians discuss whether they expect AI to automate advanced math research.

By Anson Ho and Tamay Besiroglu

Introducing Epoch AI's AI benchmarking hub

Update

Nov. 27, 2024

Introducing Epoch AI's AI benchmarking hub

We are launching the AI Benchmarking Hub: a platform presenting our evaluations of leading models on challenging benchmarks, with analysis of trends in AI capabilities.

By The Epoch AI Team

Accuracy increases with estimated training compute

Data Insight

Updated Feb. 7, 2025

Accuracy increases with estimated training compute

By Jean-Stanislas Denain

Models with downloadable weights currently lag behind the top-performing models

Data Insight

Updated Feb. 7, 2025

Models with downloadable weights currently lag behind the top-performing models

By Jean-Stanislas Denain

US models currently outperform non-US models

Data Insight

Updated Feb. 7, 2025

US models currently outperform non-US models

By Jean-Stanislas Denain

FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI

Paper

Nov. 8, 2024

FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI

FrontierMath: a new benchmark of expert-level math problems designed to measure AI's mathematical abilities. See how leading AI models perform against the collective mathematics community.

By Tamay Besiroglu, Elliot Glazer, and Caroline Falkman Olsson

Report

Nov. 4, 2024

How far behind are open models?

We compare open and closed AI models, and study how openness has evolved. The best open model today is on par with closed models in performance and training compute, but with a lag of about one year.

By Ben Cottier, Josh You, Natalia Martemianova, and David Owen

Biological sequence models in the context of the AI directives

Report

Jan. 17, 2024

Biological sequence models in the context of the AI directives

The expanded Epoch database now includes biological sequence models, revealing potential regulatory gaps in the White House’s Executive Order on AI and the growth of the compute used in their training.

By Nicole Maug, Aidan O'Gara, and Tamay Besiroglu

How predictable is language model benchmark performance?

Paper

Jun. 9, 2023

How predictable is language model benchmark performance?

We investigate large language model performance across five orders of magnitude of compute scaling, finding that compute-focused extrapolations are a promising way to forecast AI capabilities.

By David Owen

Power laws in speedrunning and machine learning

Paper

Apr. 21, 2023

Power laws in speedrunning and machine learning

We develop a model for predicting record improvements in video game speedrunning and apply it to predicting machine learning benchmarks. This model suggests that machine learning benchmarks are not close to saturation, and that large sudden improvements are infrequent, but not ruled out.

By Ege Erdil and Jaime Sevilla

Literature review of transformative artificial intelligence timelines

Report

Jan. 17, 2023

Literature review of transformative artificial intelligence timelines

We summarize and compare several models and forecasts predicting when transformative AI will be developed.

By Keith Wynroe, David Atkinson, and Jaime Sevilla

A time-invariant version of Laplace’s rule

Report

Jul. 15, 2022

A time-invariant version of Laplace’s rule

We explore how to estimate the probability of an event given information of past occurrences. We explain a problem with the naive application of Laplace’s rule in this context, and suggest a modification to correct it.

By Jaime Sevilla and Ege Erdil

Grokking “Forecasting TAI with biological anchors”

Report

Jun. 6, 2022

Grokking “Forecasting TAI with biological anchors”

I give a visual explanation of Ajeya Cotra’s draft report, Forecasting TAI with biological anchors, summarising the key assumptions, intuitions, and conclusions.

By Anson Ho

AI Progress

Industry

Infrastructure

Impacts

Featured

Publications

Data explorers

Benchmarks by Epoch AI

Papers & Reports

Data Insights

Newsletter

Podcast

Capabilities

Models

Frontier Data Centers

Chip Owners

Companies

Polling on AI Use

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

AI capabilities

Featured

Filter

AI Capabilities: Data & Research

AI Progress

Industry

Infrastructure

Impacts

Featured

Publications

Data explorers

Benchmarks by Epoch AI

Scaling

Software progress

Open models

Capabilities

Math

Leading companies

Finances

Geopolitics

Chips

Data centers

Energy

Adoption and use

Economic impact

Future of AI

Publications

Papers & Reports

Data Insights

Newsletter

Podcast

Data explorers

Capabilities

Models

Frontier Data Centers

Chip Owners

Companies

Polling on AI Use

Benchmarks by Epoch AI

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

About Epoch AI

Donate

Team

Careers

Consultations

For press

Transparency

AI capabilities

Featured

Filter