SimpleBench

Externally evaluated

SimpleBench is a multiple-choice text benchmark designed to test LLMs on reasoning tasks where unspecialized humans currently outperform them. It includes 213 handcrafted questions that can be roughly categorized into: spatial reasoning, temporal reasoning, social intelligence, and linguistic adversarial robustness (i.e., trick questions).

Here is an example of a question from SimpleBench (the correct answer is B):

Beth places four whole ice cubes in a frying pan at the start of the first minute, then five at the start of the second minute and some more at the start of the third minute, but none in the fourth minute. If the average number of ice cubes per minute placed in the pan while it was frying a crispy egg was five, how many whole ice cubes can be found in the pan at the end of the third minute?

A. 30
B. 0
C. 20
D. 10
E. 11
F. 5

The benchmark questions are mostly kept private to prevent test set contamination. For more information and public examples, see the SimpleBench website.

Methodology

We source SimpleBench results directly from the official website.

SimpleBench evaluates models on 213 multiple-choice questions, with 6 options each. Models are evaluated using a standardized prompt. Each model is run 5 times per question, typically with a temperature of 0.7 and top-p of 0.95 (except for o-series models for which the temperature cannot be controlled). The reported score is the average accuracy across these 5 runs.

The standard prompt used is:

You are an expert at reasoning and you always pick the most realistic answer. Think step by step and output your reasoning followed by your final answer using the following format: Final Answer: X where X is one of the letters A, B, C, D, E, or F.

For o-series models, a slightly modified prompt is used to remove the Chain-of-Thought directive:

You are an expert at reasoning and you always pick the most realistic answer. Output your final answer using the following format: Final Answer: X where X is one of the letters A, B, C, D, E, or F. Do not include additional formatting in the final answer.

For full details on the dataset creation, human evaluation methodology (yielding a provisional baseline of 83.7% from 9 participants), and specific model results, refer to the SimpleBench website.