Humanity's Last Exam

Humanity’s Last Exam (HLE) is a benchmark created by the Center for AI Safety and Scale AI to address benchmark saturation – the problem of frontier models quickly achieving near-perfect scores on existing evaluations. The dataset contains 2,500 questions across over 100 subjects, including mathematics, physics, chemistry, biology, medicine, computer science, humanities, and social sciences. Questions were contributed by nearly 1,000 subject-matter experts affiliated with over 500 institutions across 50 countries, primarily professors, researchers, and graduate degree holders. Contributors competed for a $500,000 prize pool.

Questions are designed to require graduate-level expertise or highly specialized knowledge and cannot be quickly answered via internet retrieval. Approximately 80% of questions are exact-match (the model must produce a precise answer string), while the remaining 20% are multiple-choice with five or more answer options. About 10% of questions are multimodal, requiring the model to interpret a diagram or figure alongside text. All questions have unambiguous, verifiable answers. During curation, submitted questions were first filtered by frontier LLMs – only questions that stumped those models advanced through two subsequent rounds of human expert review before final inclusion.

AI Progress

Industry

Infrastructure

Impacts

Featured

Publications

Data explorers

Benchmarks by Epoch AI

Papers & Reports

Data Insights

Newsletter

Podcast

Capabilities

Models

Frontier Data Centers

Chip Owners

Companies

Polling on AI Use

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

Humanity's Last Exam