Explore evaluations across 50 distinct benchmarks, covering mathematics, coding, agentic action, and more.
An index aggregating many different benchmarks into a single, general capability scale.
50 exceptionally difficult research-level math problems.
300 expert-written math problems covering advanced undergrad through early career research.
GitHub issues from real-world Python repos, testing whether models can implement valid code fixes.
100 novel puzzles, generated programmatically with a chess engine. Each puzzle has a single best next move.
1,000 factoid questions about politics, science and technology, art, sports, geography, music, and more.
A challenging multiple-choice question set in biology, chemistry, and physics, authored by PhD-level experts.
45 competition-style math problems from OTIS, harder than MATH Level 5 but easier than FrontierMath.
The hardest tier of problems from the MATH dataset, drawn from competitions like the AMC 10, AMC 12, and AIME.
500 GitHub issues from real-world Python repos, testing whether models can implement valid code fixes.
A benchmark that evaluates models’ performance on a set of challenging programming problems from Exercism, an online programming education platform.
A benchmark of nonstandard ML engineering tasks in a variety of domains.
A benchmark testing common-sense reasoning including ‘trick questions’ and situations that require an understanding of space, time or social cues.
A benchmark measuring model performance on well-specified tasks drawn from selected real-world occupations.
A benchmark of tasks which require a model to complete tasks using a computer terminal, testing its ability to understand and employ the programs available to it.
A benchmark measuring how well CLI agents can post-train small base language models under a fixed compute budget.
Durations of the longest task that models can complete correctly more often than not, across a set of software engineering and related tasks.
A benchmark of models’ ability to gather information from the internet to answer questions, testing models’ ability to find and synthesize information.
A benchmark which evaluates models on their ability to play a series of games with widely varying difficulties.
A collection of software performance optimization challenges which test models’ ability to modify a program’s code to significantly increase its performance.
A platform where users vote on which of two anonymous models do a better job producing websites according to the user’s requests.
A benchmark testing models’ ability to understand long creative writing pieces.
A series of questions testing models’ ability to comprehend diagrams depicting simple scenarios involving balls falling down a series of ramps and landing in buckets.
A benchmark testing models’ ability to identify the real world location a picture was taken at, based on the popular game GeoGuessr.
A comprehensive benchmark evaluating multimodal large language models on multiple-choice video question answering across diverse domains, durations, and modalities, with and without subtitles.
A benchmark of adversarially collected natural language inference examples that tests robust textual reasoning under distribution shift.
A concept-learning benchmark that evaluates few-shot abstraction, pattern induction, and program-like reasoning from input–output examples.
A multiple-choice science benchmark assessing grade-school science knowledge and reasoning on questions from the AI2 ARC dataset.
A curated subset of BIG-bench tasks that remain difficult for large language models, probing compositional, symbolic, and multi-step reasoning.
A yes/no reading comprehension benchmark where models answer naturally occurring questions given a short supporting passage.
A text-to-CAD benchmark evaluating whether models can generate valid parametric 3D designs that meet geometric and rendering checks.
A harder, bias-reduced multiple-choice benchmark that probes everyday commonsense beyond lexical shortcuts.
A cybersecurity agent benchmark measuring autonomous vulnerability discovery and exploitation across sandboxed challenges.
A grade-school math word problem benchmark focused on multi-step arithmetic reasoning and exact-match solutions.
An adversarially filtered commonsense sentence-completion benchmark measuring plausibility in everyday scenarios.
A long-context language modeling benchmark where the final word of a passage must be predicted from broader discourse.
A writing-quality benchmark that scores models on multi-genre composition using a standardized rubric (clarity, coherence, style, and instruction-following).
A dynamic, broad-coverage benchmark of real-world tasks updated in periodic releases and scored with standardized judging protocols.
A multi-task exam-style benchmark covering dozens of academic and professional subjects to test breadth of knowledge and problem solving.
A small open-book science QA benchmark requiring the combination of a core fact with additional commonsense to answer 4-choice questions.
A computer-use benchmark where agents complete real desktop and web tasks in reproducible OS environments using keyboard/mouse actions and structured UI observations.
A physical commonsense benchmark where models choose the more feasible solution to everyday problems.
A multimodal multiple-choice science benchmark combining text, images, and diagrams with rich rationales.
A suite of diverse language understanding tasks designed to be more challenging than GLUE, emphasizing reasoning and sample efficiency.
A community-run evaluation of end-to-end software agents that attempt realistic tasks in reproducible environments.
An open-domain question answering benchmark with challenging trivia questions paired with evidence documents.
A large-scale pronoun resolution and coreference benchmark designed to reduce annotation artifacts and emphasize commonsense.
A benchmark evaluating whether AI models can perform economically valuable knowledge work across investment banking, management consulting, law, and primary medical care.
A harder successor to ARC-AGI that tests few-shot abstract reasoning and pattern generalization on grid-based tasks, with an added emphasis on efficiency of compute per task solved.
A set of 2,500 expert-authored questions spanning over 100 academic subjects, designed to test the limits of frontier AI models on problems that require deep, specialized knowledge.