About SimpleBench

These questions are not especially difficult for humans to understand, and the benchmark authors established a human baseline of 84% after administering some of the questions to nine people. The questions in the benchmark were intended to still be challenging to models at the time of its development despite being relatively easy for humans to answer. The evaluation setup is simple and consists of a prompt to elicit maximal behavior in the models, one part of which is included to induce chain-of-thought behavior for models which were not trained to reason.

Methodology

We source SimpleBench results directly from the official website.

SimpleBench evaluates models on 213 multiple-choice questions, with 6 options each. Models are evaluated using a standardized prompt. Each model is run 5 times per question, typically with a temperature of 0.7 and top-p of 0.95 (except for o-series models for which the temperature cannot be controlled). The reported score is the average accuracy across these 5 runs.

The standard prompt used is:

You are an expert at reasoning and you always pick the most realistic answer. Think step by step and output your reasoning followed by your final answer using the following format: Final Answer: X where X is one of the letters A, B, C, D, E, or F.

For o-series models, a slightly modified prompt is used to remove the Chain-of-Thought directive:

You are an expert at reasoning and you always pick the most realistic answer. Output your final answer using the following format: Final Answer: X where X is one of the letters A, B, C, D, E, or F. Do not include additional formatting in the final answer.

For full details on the dataset creation, human evaluation methodology (yielding a provisional baseline of 83.7% from 9 participants), and specific model results, refer to the SimpleBench website.