GSM8K

GSM8K contains short narrative problems that require models to decompose the prompt, carry out intermediate calculations, and produce a final numeric answer. The tasks are intentionally simple for humans but challenging for models that lack scratchpad reasoning or accurate arithmetic.

Because answers are free-form numbers, the benchmark is commonly used to assess chain-of-thought and tool-assisted math. Exact-match scoring makes mistakes easy to diagnose and compare across models and prompting strategies.

Featured

Publications

Data explorers

Benchmarks by Epoch AI

AI Progress

Industry

Infrastructure

Impacts

Papers & Reports

Data Insights

Newsletter

Podcast

Capabilities

Models

Data Centers

Chip Owners

Companies

Polling on AI Use

MirrorCode

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

GSM8K

GSM8K

GSM8K

Featured

Publications

Data explorers

Benchmarks by Epoch AI

AI Progress

Industry

Infrastructure

Impacts

Publications

Papers & Reports

Data Insights

Newsletter

Podcast

Data explorers

Capabilities

Models

Data Centers

Chip Owners

Companies

Polling on AI Use

Benchmarks by Epoch AI

MirrorCode

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

Scaling

Software progress

Open models

Capabilities

Math

Leading companies

Finances

Geopolitics

Chips

Data centers

Energy

Adoption and use

Economic impact

Future of AI

About Epoch AI

Donate

Team

Careers

Consultations

For press

Transparency

GSM8K

GSM8K