AI Benchmarking
Our database of benchmark results, featuring the performance of leading AI models on challenging tasks. It includes results from benchmarks evaluated internally by Epoch AI as well as data collected from external sources. Explore trends across time, by benchmark, or by model.
Benchmarking updates
September 29, 2025
Claude Sonnet 4.5 just established a new state-of-the-art performance in our evaluations of SWE-Bench Verified.
See results for Claude Sonnet 4.5
July 11, 2025
Introducing FrontierMath Tier 4: a benchmark of extremely challenging research-level math problems, designed to test the limits of AI’s reasoning capabilities.
Read our announcement thread
July 10, 2025
SWE-Bench can be tricky to run. We released a public registry of Docker containers that make it easy and fast.
See how