Benchmarking updates

February 13, 2026
We analyzed three benchmarks — RLI, GDPval, and APEX-Agents — designed to test AI’s ability to do economically valuable digital work.
Read the report
February 6, 2026
Opus 4.6 matched the top score on FrontierMath Tiers 1-3, the first time an Anthropic model has been on the frontier of this benchmark.
See the results
January 27, 2026
We released FrontierMath: Open Problems, which tests AI on unsolved math research problems.
Discover the benchmark
Trusted by leaders at OpenAI, DeepMind, and governments worldwide.
Need deeper insights? Our team offers custom research and advisory services.
Book a consultation