Benchmarking updates

September 29, 2025
Claude Sonnet 4.5 just established a new state-of-the-art performance in our evaluations of SWE-Bench Verified.
See results for Claude Sonnet 4.5
July 11, 2025
Introducing FrontierMath Tier 4: a benchmark of extremely challenging research-level math problems, designed to test the limits of AI’s reasoning capabilities.
Read our announcement thread
July 10, 2025
SWE-Bench can be tricky to run. We released a public registry of Docker containers that make it easy and fast.
See how