Data on AI Benchmarking

Dashboard Benchmarks Models ECI Data About

Benchmarking updates

September 29, 2025

Claude Sonnet 4.5 just established a new state-of-the-art performance in our evaluations of SWE-Bench Verified.

July 11, 2025

Introducing FrontierMath Tier 4: a benchmark of extremely challenging research-level math problems, designed to test the limits of AI’s reasoning capabilities.

Read our announcement thread

July 10, 2025

SWE-Bench can be tricky to run. We released a public registry of Docker containers that make it easy and fast.

See how