About GSM8K

GSM8K contains short narrative problems that require models to decompose the prompt, carry out intermediate calculations, and produce a final numeric answer. The tasks are intentionally simple for humans but challenging for models that lack scratchpad reasoning or accurate arithmetic.

Because answers are free-form numbers, the benchmark is commonly used to assess chain-of-thought and tool-assisted math. Exact-match scoring makes mistakes easy to diagnose and compare across models and prompting strategies.