About Humanity's Last Exam

Humanity’s Last Exam (HLE) is a benchmark created by the Center for AI Safety and Scale AI to address benchmark saturation – the problem of frontier models quickly achieving near-perfect scores on existing evaluations. The dataset contains 2,500 questions across over 100 subjects, including mathematics, physics, chemistry, biology, medicine, computer science, humanities, and social sciences. Questions were contributed by nearly 1,000 subject-matter experts affiliated with over 500 institutions across 50 countries, primarily professors, researchers, and graduate degree holders. Contributors competed for a $500,000 prize pool.

Questions are designed to require graduate-level expertise or highly specialized knowledge and cannot be quickly answered via internet retrieval. Approximately 80% of questions are exact-match (the model must produce a precise answer string), while the remaining 20% are multiple-choice with five or more answer options. About 10% of questions are multimodal, requiring the model to interpret a diagram or figure alongside text. All questions have unambiguous, verifiable answers. During curation, submitted questions were first filtered by frontier LLMs – only questions that stumped those models advanced through two subsequent rounds of human expert review before final inclusion.