AI capabilities

AI systems can now write code, pass professional exams, and assist with scientific research, and their capabilities are improving remarkably fast. But measuring exactly what AI can and cannot do is genuinely difficult, with benchmarks struggling to keep pace and real-world performance often diverging from test scores. Epoch tracks AI capabilities across tasks and benchmarks, examining how fast progress is happening, how predictable it is, and what it reveals about where the technology is heading.

Filter

Type
0 results
RIP Classic Reasoning Benchmarks. What's Next?
Newsletter
May 5, 2026
RIP Classic Reasoning Benchmarks. What's Next?

Give up at least one of: text only, short time horizon, easy to grade, and expert human superiority.

By Greg Burnham

Are AI benchmarks doomed?
Podcast
May 1, 2026
Are AI benchmarks doomed?

In this episode, Greg Burnham and Tom Adamczewski join Anson Ho to push back on benchmark pessimism and dig into what the next generation of AI benchmarks could look like.

By Greg Burnham, Tom Adamczewski, and Anson Ho

Have AI Capabilities Accelerated?
Report
Apr. 16, 2026
Have AI Capabilities Accelerated?

We investigate progress trends on four capability metrics to determine whether AI capabilities have recently accelerated. Three of four metrics show strong evidence of acceleration, driven by reasoning models.

By Jean-Stanislas Denain and Alexander Barry

MirrorCode: Evidence that AI can already do some weeks-long coding tasks
Report
Apr. 10, 2026
MirrorCode: Evidence that AI can already do some weeks-long coding tasks

In our new benchmark, MirrorCode, Claude Opus 4.6 autonomously reimplemented a 16,000-line bioinformatics toolkit — a task we believe would take a human engineer weeks.

By Tom Adamczewski, David Rein, David Owen, and Florian Brand

Keeping up with the GPTs
Newsletter
Apr. 7, 2026
Keeping up with the GPTs

Can Chinese and open model companies compete with the frontier through e.g. distillation and talent?

By Anson Ho

Expanding our analysis of biological AI models
Report
Feb. 20, 2026
Expanding our analysis of biological AI models

We release a database of over 1,100 biological AI models across nine categories. We analyze their safeguards, accessibility, training data sources, and the foundation models they build on.

By David Atanasov, Niccolò Zanichelli, and Jean-Stanislas Denain

What do “economic value” benchmarks tell us?
Report
Feb. 13, 2026
What do “economic value” benchmarks tell us?

These benchmarks track a wide range of digital work. Progress will correlate with economic utility, but tasks are too self-contained to indicate full automation.

By Florian Brand and Greg Burnham

Where Autonomy Works: Evaluating Robot Capabilities in 2026
Report
Feb. 10, 2026
Where Autonomy Works: Evaluating Robot Capabilities in 2026

We assess the current state of autonomous robotics by evaluating robot performance on concrete tasks across industrial, household, and navigation domains.

By Yann Rivière and Jean-Stanislas Denain

How close is AI to taking my job?
Newsletter
Feb. 6, 2026
How close is AI to taking my job?

Beyond benchmarks as leading indicators for task automation

By Anson Ho

AI math capabilities could be jagged for a long time – Daniel Litt
Podcast
Jan. 29, 2026
AI math capabilities could be jagged for a long time – Daniel Litt

In this episode, Daniel Litt chats with the hosts about AI’s limits in mathematics, accelerating math research, and how to measure progress on open problems.

By Daniel Litt, Greg Burnham, and Anson Ho

Benchmark scores are well correlated, even across domains
Data Insight
Jan. 23, 2026
Benchmark scores are well correlated, even across domains

By Luke Emberson and Yafah Edelman

How well did forecasters predict 2025 AI progress?
Newsletter
Jan. 16, 2026
How well did forecasters predict 2025 AI progress?

Mostly right about benchmarks, mixed results on real-world impacts

By Anson Ho

Chinese AI models have lagged the US frontier by 7 months on average since 2023
Data Insight
Jan. 2, 2026
Chinese AI models have lagged the US frontier by 7 months on average since 2023

By Luke Emberson

AI capabilities progress has sped up
Data Insight
Dec. 23, 2025
AI capabilities progress has sped up

By Yafah Edelman and Jaeho Lee

Why benchmarking is hard
Newsletter
Dec. 23, 2025
Why benchmarking is hard

Running benchmarks involves many moving parts, each of which can influence the final score. The two most impactful components are scaffolds and API providers.

By Florian Brand and Jean-Stanislas Denain

A Rosetta Stone for AI benchmarks
Paper
Dec. 2, 2025
A Rosetta Stone for AI benchmarks

Most benchmarks saturate too quickly to study long-run AI trends. We solve this using a statistical framework that stitches benchmarks together, with big implications for algorithmic progress and AI forecasting.

By Anson Ho, Jean-Stanislas Denain, David Atanasov, Samuel Albanie, and Rohin Shah

Benchmark Scores = General Capability + Claudiness
Newsletter
Nov. 20, 2025
Benchmark Scores = General Capability + Claudiness

Is this because skills generalize very well, or because developers are pushing on all benchmarks at once?

By Greg Burnham

Epoch’s Capabilities Index stitches together benchmarks across a wide range of difficulties
Data Insight
Nov. 6, 2025
Epoch’s Capabilities Index stitches together benchmarks across a wide range of difficulties

By Jaeho Lee and Luke Emberson

What does OSWorld tell us about AI’s ability to use computers?
Report
Oct. 30, 2025
What does OSWorld tell us about AI’s ability to use computers?

We review OSWorld, a prominent computer use benchmark. Its tasks are relatively simple, many don’t require GUIs, and success often hinges on interpreting ambiguous instructions. It is also not stable over time.

By Florian Brand and Greg Burnham

Open-weight models lag state-of-the-art by around 3 months on average
Data Insight
Oct. 30, 2025
Open-weight models lag state-of-the-art by around 3 months on average

By Luke Emberson

Less than 70% of FrontierMath is within reach for today’s models
Newsletter
Oct. 17, 2025
Less than 70% of FrontierMath is within reach for today’s models

57% of problems have been solved at least once.

By Greg Burnham

Evaluating Gemini 2.5 Deep Think's math capabilities
Report
Oct. 9, 2025
Evaluating Gemini 2.5 Deep Think's math capabilities

It has improved at using background knowledge and doing precise computations. It can be a helpful research assistant and may take a more conceptual approach to geometry. It shows limited creativity and sometimes struggles with citations.

By Greg Burnham

AI capabilities have steadily improved over the past year
Data Insight
Sep. 30, 2025
AI capabilities have steadily improved over the past year

By Luke Emberson

AI developers accurately report GPQA Diamond scores for recent models
Data Insight
Sep. 19, 2025
AI developers accurately report GPQA Diamond scores for recent models

By Jaeho Lee and Yafah Edelman

Newsletter
Sep. 19, 2025
The huge potential implications of long-context inference

Continual learning, scaling RL, and research feedback loops

By Jean-Stanislas Denain and Anson Ho

What will AI look like in 2030?
Report
Sep. 16, 2025
What will AI look like in 2030?

If scaling persists to 2030, AI investments will reach hundreds of billions of dollars and require gigawatts of power. Benchmarks suggest AI could improve productivity in valuable areas such as scientific R&D.

By David Owen

LLMs have not yet solved the hardest problems on high school math contests
Data Insight
Sep. 3, 2025
LLMs have not yet solved the hardest problems on high school math contests

By Greg Burnham

GPT-5 and GPT-4 were both major leaps in benchmarks from the previous generation
Data Insight
Aug. 29, 2025
GPT-5 and GPT-4 were both major leaps in benchmarks from the previous generation

By Luke Emberson and Josh You

Newsletter
Aug. 22, 2025
Why future AI agents will be trained to work together

Many multi-agent setups are based on fancy prompts, but this is unlikely to persist

By Anson Ho and Jean-Stanislas Denain

Compute is not a bottleneck for robotic manipulation
Data Insight
Aug. 8, 2025
Compute is not a bottleneck for robotic manipulation

By Ben Cottier, Scott Longwell, James Sanders, David Owen, Yafah Edelman, and Luke Emberson

Newsletter
Aug. 7, 2025
We didn’t learn much from the IMO

The problems gave AI only a slim chance to show new capabilities

By Greg Burnham

Newsletter
Aug. 2, 2025
Quantifying the algorithmic improvement from reasoning models

Reasoning models were as big of an improvement as the Transformer, at least on some benchmarks

By Anson Ho and Arden Berg

Evaluating Grok 4’s math capabilities
Report
Jul. 25, 2025
Evaluating Grok 4’s math capabilities

It’s good at involved computations, improving at proofs from a low base, and useful for literature search. It still favors low-level grinds and leans on background knowledge.

By Greg Burnham

How to run SWE-bench Verified in one hour on one machine
Update
Jul. 10, 2025
How to run SWE-bench Verified in one hour on one machine

We are releasing a public registry of optimized Docker images for SWE-bench. This allows us to run SWE-bench Verified in 62 minutes on a single GitHub actions VM.

By Tom Adamczewski

Newsletter
Jul. 8, 2025
What will the IMO tell us about AI math capabilities?

Most discussion about AI and the IMO focuses on gold medals, but that's not the thing to pay most attention to.

By Greg Burnham

LLMs now accept longer inputs, and the best models can use them more effectively
Data Insight
Jun. 25, 2025
LLMs now accept longer inputs, and the best models can use them more effectively

By Greg Burnham and Tom Adamczewski

What skills does SWE-bench Verified evaluate?
Report
Jun. 13, 2025
What skills does SWE-bench Verified evaluate?

We take a deep dive into SWE-bench Verified, a prominent agentic coding benchmark. While one of the best public tests of AI coding agents, it is limited by its focus on simple bug fixes in familiar open-source repositories.

By Florian Brand and Jean-Stanislas Denain

Newsletter
Jun. 13, 2025
Do the biorisk evaluations of AI labs actually measure the risk of developing bioweapons?

Assessing if AI labs' biorisk evaluations effectively measure models' potential to enable amateur bioweapons development.

By Anson Ho and Arden Berg

LLM providers offer a trade-off between accuracy and speed
Data Insight
Jun. 11, 2025
LLM providers offer a trade-off between accuracy and speed

By Greg Burnham and Tom Adamczewski

Newsletter
Jun. 6, 2025
Beyond benchmark scores: Analyzing o3-mini’s mathematical reasoning

Examining o3-mini's math reasoning: an erudite, vibes-based solver that excels in knowledge but lacks precision, creativity, and formal human rigor.

By Anson Ho, Jean-Stanislas Denain, and Elliot Glazer

Newsletter
May 30, 2025
GPQA Diamond: What’s left?

Investigate GPQA Diamond benchmark's validity: uncover flawed questions, model challenges, and why it still informs AI evaluation.

By Greg Burnham

Is AI already superhuman on FrontierMath?
Newsletter
May 23, 2025
Is AI already superhuman on FrontierMath?

How do humans and AIs compare on FrontierMath? We ran a competition at MIT to put this to the test.

By Anson Ho

Newsletter
May 2, 2025
Where’s my ten minute AGI?

Why don't AIs automate more real-world tasks if they can handle 1-hour ones? Anson Ho explores key capability and context bottlenecks.

By Anson Ho

LLM responses to benchmark questions are getting longer over time
Data Insight
Apr. 17, 2025
LLM responses to benchmark questions are getting longer over time

By Luke Emberson, Ben Cottier, Josh You, Tom Adamczewski, and Jean-Stanislas Denain

Newsletter
Mar. 28, 2025
The real reason AI benchmarks haven’t reflected economic impacts

The real reason that AI benchmarks haven’t reflected real-world impacts historically is that they weren’t optimized for this, not because of fundamental limitations – but this might be changing.

By Anson Ho and Jean-Stanislas Denain

Newsletter
Mar. 7, 2025
What AI can currently do is not the story

Forecasting AI progress requires more than extrapolating current capabilities; understanding fundamental task difficulty is key to predicting future breakthroughs.

By Ege Erdil

Newsletter
Feb. 28, 2025
The promise of reasoning models

AI reasoning models will achieve superhuman performance in math and coding, yet their economic applications will lag behind, limiting real-world impact.

By Matthew Barnett

Biology AI models are scaling 2-4x per year after rapid growth from 2019-2021
Data Insight
Feb. 21, 2025
Biology AI models are scaling 2-4x per year after rapid growth from 2019-2021

By Pablo Villalobos and David Atanasov

Newsletter
Feb. 21, 2025
AI progress is about to speed up

AI progress is accelerating, with next-gen models surpassing GPT-4 in compute power, driving major leaps in reasoning, coding, and math capabilities.

By Ege Erdil

A more systematic and transparent AI benchmarking hub
Update
Feb. 7, 2025
A more systematic and transparent AI benchmarking hub

We've overhauled our AI benchmarking infrastructure to provide more transparent, systematic, and up-to-date evaluations of AI model capabilities.

By Tom Adamczewski

Announcing our expanded biology AI coverage
Update
Jan. 29, 2025
Announcing our expanded biology AI coverage

We've expanded our Biology AI Dataset, now covering 360+ models. Our analysis reveals rapid scaling from 2017-2021, followed by a notable slowdown in biological model development.

By Pablo Villalobos and David Atanasov

Clarifying the creation and use of the FrontierMath benchmark
Update
Jan. 23, 2025
Clarifying the creation and use of the FrontierMath benchmark

We clarify that OpenAI commissioned Epoch AI to produce 300 math questions for the FrontierMath benchmark. They own these and have access to the statements and solutions, except for a 50-question holdout set.

By Tamay Besiroglu and Jaime Sevilla

FrontierMath competition: Setting benchmarks for AI evaluation
Update
Updated Mar. 18, 2025
FrontierMath competition: Setting benchmarks for AI evaluation

We are hosting a competition to establish rigorous human performance baselines for FrontierMath. With a prize pool of $10,000, your participation will contribute directly to measuring AI progress in solving challenging mathematical problems.

By Tamay Besiroglu, Elliot Glazer, and Caroline Falkman Olsson

Newsletter
Dec. 27, 2024
Moravec’s paradox and its implications

This Gradient Updates issue explains Moravec's paradox and offers a speculative picture of how hard various economic tasks are to automate based on the paradox.

By Ege Erdil

What is the future of AI in mathematics? Interviews with leading mathematicians
Report
Dec. 4, 2024
What is the future of AI in mathematics? Interviews with leading mathematicians

How will AI transform mathematics? Fields Medalists and other leading mathematicians discuss whether they expect AI to automate advanced math research.

By Anson Ho and Tamay Besiroglu

Introducing Epoch AI's AI benchmarking hub
Update
Nov. 27, 2024
Introducing Epoch AI's AI benchmarking hub

We are launching the AI Benchmarking Hub: a platform presenting our evaluations of leading models on challenging benchmarks, with analysis of trends in AI capabilities.

By The Epoch AI Team

Accuracy increases with estimated training compute
Data Insight
Updated Feb. 7, 2025
Accuracy increases with estimated training compute

By Jean-Stanislas Denain

Models with downloadable weights currently lag behind the top-performing models
Data Insight
Updated Feb. 7, 2025
Models with downloadable weights currently lag behind the top-performing models

By Jean-Stanislas Denain

US models currently outperform non-US models
Data Insight
Updated Feb. 7, 2025
US models currently outperform non-US models

By Jean-Stanislas Denain

FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI
Paper
Nov. 8, 2024
FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI

FrontierMath: a new benchmark of expert-level math problems designed to measure AI's mathematical abilities. See how leading AI models perform against the collective mathematics community.

By Tamay Besiroglu, Elliot Glazer, and Caroline Falkman Olsson

How far behind are open models?
Report
Nov. 4, 2024
How far behind are open models?

We compare open and closed AI models, and study how openness has evolved. The best open model today is on par with closed models in performance and training compute, but with a lag of about one year.

By Ben Cottier, Josh You, Natalia Martemianova, and David Owen

Biological sequence models in the context of the AI directives
Report
Jan. 17, 2024
Biological sequence models in the context of the AI directives

The expanded Epoch database now includes biological sequence models, revealing potential regulatory gaps in the White House’s Executive Order on AI and the growth of the compute used in their training.

By Nicole Maug, Aidan O'Gara, and Tamay Besiroglu

How predictable is language model benchmark performance?
Paper
Jun. 9, 2023
How predictable is language model benchmark performance?

We investigate large language model performance across five orders of magnitude of compute scaling, finding that compute-focused extrapolations are a promising way to forecast AI capabilities.

By David Owen

Power laws in speedrunning and machine learning
Paper
Apr. 21, 2023
Power laws in speedrunning and machine learning

We develop a model for predicting record improvements in video game speedrunning and apply it to predicting machine learning benchmarks. This model suggests that machine learning benchmarks are not close to saturation, and that large sudden improvements are infrequent, but not ruled out.

By Ege Erdil and Jaime Sevilla

Literature review of transformative artificial intelligence timelines
Report
Jan. 17, 2023
Literature review of transformative artificial intelligence timelines

We summarize and compare several models and forecasts predicting when transformative AI will be developed.

By Keith Wynroe, David Atkinson, and Jaime Sevilla

A time-invariant version of Laplace’s rule
Report
Jul. 15, 2022
A time-invariant version of Laplace’s rule

We explore how to estimate the probability of an event given information of past occurrences. We explain a problem with the naive application of Laplace’s rule in this context, and suggest a modification to correct it.

By Jaime Sevilla and Ege Erdil

Grokking “Forecasting TAI with biological anchors”
Report
Jun. 6, 2022
Grokking “Forecasting TAI with biological anchors”

I give a visual explanation of Ajeya Cotra’s draft report, Forecasting TAI with biological anchors, summarising the key assumptions, intuitions, and conclusions.

By Anson Ho