Benchmarks
Explore evaluations across 44 distinct benchmarks, covering mathematics, coding, agentic action, and more.
350 expert-written problems in advanced mathematics, requiring multiple hours or even days to solve.
A challenging multiple-choice question set in biology, chemistry, and physics, authored by PhD-level experts.
45 competition-style math problems from OTIS, harder than MATH Level 5 but easier than FrontierMath.
1,000 factoid questions about politics, science and technology, art, sports, geography, music, and more.
100 novel puzzles, generated programmatically with a chess engine. Each puzzle has a single best next move.