Benchmarks

0 results

Evaluated by Epoch AI

Epoch Capabilities Index (ECI)

An index aggregating many different benchmarks into a single, general capability scale.

XX models evaluated models

Evaluated by Epoch AI

FrontierMath Tier 4

50 exceptionally difficult research-level math problems.

XX models evaluated models

Highest scoreXX%

Mathematics Agent

Evaluated by Epoch AI

FrontierMath Tiers 1-3

300 expert-written math problems covering advanced undergrad through early career research.

XX models evaluated models

Highest scoreXX%

Mathematics Agent

Evaluated by Epoch AI

SWE-bench Verified

GitHub issues from real-world Python repos, testing whether models can implement valid code fixes.

XX models evaluated models

Highest scoreXX%

Software engineering Agent

Evaluated by Epoch AI

Chess Puzzles

100 novel puzzles, generated programmatically with a chess engine. Each puzzle has a single best next move.

XX models evaluated models

Highest scoreXX%

Games

Evaluated by Epoch AI

SimpleQA Verified

1,000 factoid questions about politics, science and technology, art, sports, geography, music, and more.

XX models evaluated models

Highest scoreXX%

World knowledge

Evaluated by Epoch AI

GPQA Diamond

A challenging multiple-choice question set in biology, chemistry, and physics, authored by PhD-level experts.

XX models evaluated models

Highest scoreXX%

Science

Evaluated by Epoch AI

OTIS Mock AIME 2024-2025

45 competition-style math problems from OTIS, harder than MATH Level 5 but easier than FrontierMath.

XX models evaluated models

Highest scoreXX%

Mathematics

Evaluated by Epoch AI

MATH Level 5

The hardest tier of problems from the MATH dataset, drawn from competitions like the AMC 10, AMC 12, and AIME.

XX models evaluated models

Highest scoreXX%

Mathematics

Evaluated by benchmark creator

SWE-bench (Bash Only)

500 GitHub issues from real-world Python repos, testing whether models can implement valid code fixes.

XX models evaluated models

Highest scoreXX%

Software engineering Agent

Evaluated by benchmark creator

Aider Polyglot

A benchmark that evaluates models’ performance on a set of challenging programming problems from Exercism, an online programming education platform.

XX models evaluated models

Highest scoreXX%

Software engineering

Evaluated by benchmark creator

WeirdML (v2)

A benchmark of nonstandard ML engineering tasks in a variety of domains.

XX models evaluated models

Highest scoreXX%

Software engineering

Evaluated by benchmark creator

SimpleBench

A benchmark testing common-sense reasoning including ‘trick questions’ and situations that require an understanding of space, time or social cues.

XX models evaluated models

Highest scoreXX%

Evaluated by benchmark creator

GDPval

A benchmark measuring model performance on well-specified tasks drawn from selected real-world occupations.

XX models evaluated models

Highest scoreXX%

Evaluated by benchmark creator

Terminal-Bench 2.0

A benchmark of tasks which require a model to complete tasks using a computer terminal, testing its ability to understand and employ the programs available to it.

XX models evaluated models

Highest scoreXX%

Software engineering

Evaluated by benchmark creator

PostTrainBench

A benchmark measuring how well CLI agents can post-train small base language models under a fixed compute budget.

XX models evaluated models

Highest scoreXX%

Evaluated by benchmark creator

METR Time Horizons

Durations of the longest task that models can complete correctly more often than not, across a set of software engineering and related tasks.

XX models evaluated models

Software engineering Long context

Evaluated by benchmark creator

DeepResearchBench

A benchmark of models’ ability to gather information from the internet to answer questions, testing models’ ability to find and synthesize information.

XX models evaluated models

Long context

Evaluated by benchmark creator

BALROG

A benchmark which evaluates models on their ability to play a series of games with widely varying difficulties.

XX models evaluated models

Highest scoreXX%

Games Long context

Evaluated by benchmark creator

GSO

A collection of software performance optimization challenges which test models’ ability to modify a program’s code to significantly increase its performance.

XX models evaluated models

Highest scoreXX%

Software engineering

Evaluated by benchmark creator

WebDev Arena

A platform where users vote on which of two anonymous models do a better job producing websites according to the user’s requests.

XX models evaluated models

Software engineering

Evaluated by benchmark creator

Fiction.liveBench

A benchmark testing models’ ability to understand long creative writing pieces.

XX models evaluated models

Highest scoreXX%

Long context Writing & creativity

Evaluated by benchmark creator

VPCT

A series of questions testing models’ ability to comprehend diagrams depicting simple scenarios involving balls falling down a series of ramps and landing in buckets.

XX models evaluated models

Highest scoreXX%

Multimodal Science

Evaluated by benchmark creator

GeoBench

A benchmark testing models’ ability to identify the real world location a picture was taken at, based on the popular game GeoGuessr.

XX models evaluated models

Games Multimodal

Evaluated by benchmark creator

Video-MME

A comprehensive benchmark evaluating multimodal large language models on multiple-choice video question answering across diverse domains, durations, and modalities, with and without subtitles.

XX models evaluated models

Highest scoreXX%

Multimodal

Evaluated by model developer

Adversarial NLI

A benchmark of adversarially collected natural language inference examples that tests robust textual reasoning under distribution shift.

XX models evaluated models

Highest scoreXX%

Evaluated by benchmark creator

ARC-AGI-1

A concept-learning benchmark that evaluates few-shot abstraction, pattern induction, and program-like reasoning from input–output examples.

XX models evaluated models

Highest scoreXX%

Evaluated by model developer

ARC AI2

A multiple-choice science benchmark assessing grade-school science knowledge and reasoning on questions from the AI2 ARC dataset.

XX models evaluated models

Highest scoreXX%

Evaluated by model developer

BBH

A curated subset of BIG-bench tasks that remain difficult for large language models, probing compositional, symbolic, and multi-step reasoning.

XX models evaluated models

Highest scoreXX%

Evaluated by model developer

BoolQ

A yes/no reading comprehension benchmark where models answer naturally occurring questions given a short supporting passage.

XX models evaluated models

Highest scoreXX%

Evaluated by model developer

CadEval

A text-to-CAD benchmark evaluating whether models can generate valid parametric 3D designs that meet geometric and rendering checks.

XX models evaluated models

Highest scoreXX%

Evaluated by model developer

CommonSenseQA 2

A harder, bias-reduced multiple-choice benchmark that probes everyday commonsense beyond lexical shortcuts.

XX models evaluated models

Highest scoreXX%

Evaluated by model developer

Cybench

A cybersecurity agent benchmark measuring autonomous vulnerability discovery and exploitation across sandboxed challenges.

XX models evaluated models

Highest scoreXX%

Evaluated by model developer

GSM8K

A grade-school math word problem benchmark focused on multi-step arithmetic reasoning and exact-match solutions.

XX models evaluated models

Highest scoreXX%

Evaluated by model developer

HellaSwag

An adversarially filtered commonsense sentence-completion benchmark measuring plausibility in everyday scenarios.

XX models evaluated models

Highest scoreXX%

Evaluated by model developer

LAMBADA

A long-context language modeling benchmark where the final word of a passage must be predicted from broader discourse.

XX models evaluated models

Highest scoreXX%

Evaluated by model developer

Lech Mazur Writing

A writing-quality benchmark that scores models on multi-genre composition using a standardized rubric (clarity, coherence, style, and instruction-following).

XX models evaluated models

Writing & creativity

Evaluated by model developer

LiveBench

A dynamic, broad-coverage benchmark of real-world tasks updated in periodic releases and scored with standardized judging protocols.

XX models evaluated models

Highest scoreXX%

Evaluated by model developer

MMLU

A multi-task exam-style benchmark covering dozens of academic and professional subjects to test breadth of knowledge and problem solving.

XX models evaluated models

Highest scoreXX%

Evaluated by model developer

OpenBookQA

A small open-book science QA benchmark requiring the combination of a core fact with additional commonsense to answer 4-choice questions.

XX models evaluated models

Highest scoreXX%

Evaluated by model developer

OS World

A computer-use benchmark where agents complete real desktop and web tasks in reproducible OS environments using keyboard/mouse actions and structured UI observations.

XX models evaluated models

Highest scoreXX%

Evaluated by model developer

PIQA

A physical commonsense benchmark where models choose the more feasible solution to everyday problems.

XX models evaluated models

Highest scoreXX%

Evaluated by model developer

ScienceQA

A multimodal multiple-choice science benchmark combining text, images, and diagrams with rich rationales.

XX models evaluated models

Highest scoreXX%

Evaluated by model developer

SuperGLUE

A suite of diverse language understanding tasks designed to be more challenging than GLUE, emphasizing reasoning and sample efficiency.

XX models evaluated models

Highest scoreXX%

Evaluated by model developer

The Agent Company

A community-run evaluation of end-to-end software agents that attempt realistic tasks in reproducible environments.

XX models evaluated models

Highest scoreXX%

Evaluated by model developer

TriviaQA

An open-domain question answering benchmark with challenging trivia questions paired with evidence documents.

XX models evaluated models

Highest scoreXX%

Evaluated by model developer

WinoGrande

A large-scale pronoun resolution and coreference benchmark designed to reduce annotation artifacts and emphasize commonsense.

XX models evaluated models

Highest scoreXX%

Evaluated by benchmark creator

APEX-Agents

A benchmark evaluating whether AI models can perform economically valuable knowledge work across investment banking, management consulting, law, and primary medical care.

XX models evaluated models

Highest scoreXX%

Agent

Evaluated by benchmark creator

ARC-AGI-2

A harder successor to ARC-AGI that tests few-shot abstract reasoning and pattern generalization on grid-based tasks, with an added emphasis on efficiency of compute per task solved.

XX models evaluated models

Highest scoreXX%

Evaluated by benchmark creator

Humanity's Last Exam

A set of 2,500 expert-authored questions spanning over 100 academic subjects, designed to test the limits of frontier AI models on problems that require deep, specialized knowledge.

XX models evaluated models

Highest scoreXX%

Mathematics Science

AI Progress

Industry

Infrastructure

Impacts

Featured

Publications

Data explorers

Benchmarks by Epoch AI

Papers & Reports

Data Insights

Newsletter

Podcast

Capabilities

Models

Frontier Data Centers

Chip Owners

Companies

Polling on AI Use

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

Benchmarks

Filter

Benchmarks

AI Progress

Industry

Infrastructure

Impacts

Featured

Publications

Data explorers

Benchmarks by Epoch AI

Scaling

Software progress

Open models

Capabilities

Math

Leading companies

Finances

Geopolitics

Chips

Data centers

Energy

Adoption and use

Economic impact

Future of AI

Publications

Papers & Reports

Data Insights

Newsletter

Podcast

Data explorers

Capabilities

Models

Frontier Data Centers

Chip Owners

Companies

Polling on AI Use

Benchmarks by Epoch AI

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

About Epoch AI

Donate

Team

Careers

Consultations

For press

Transparency

Benchmarks

Filter