SimpleQA Verified
1,000 factoid questions about politics, science and technology, art, sports, geography, music, and more.
About SimpleQA Verified
SimpleQA Verified is a factoid question-answering benchmark consisting of 1,000 questions. The table below shows the topics covered by questions in the benchmark.
| Topic | Percentage |
|---|---|
| Politics | 18% |
| Science and technology | 16% |
| Art | 15% |
| Sports | 12% |
| Geography | 11% |
| Music | 10% |
| History | 5% |
| Other | 14% |
The table below shows the answer types of questions in the benchmark.
| Answer Type | Percentage |
|---|---|
| Date | 22% |
| Person | 20% |
| Number | 19% |
| Place | 15% |
| Other | 25% |
The benchmark was created by Google, building on previous work by OpenAI. Questions were originally sourced from human workers. Questions were originally selected to be difficult for GPT-4, and subsequently partially filtered to be difficult for GPT-4o, Gemini 2.0 Flash, and Claude 3.7 Sonnet. Google’s main additions were to remove highly similar questions, balance topic areas, and improve answer key accuracy.
To see all questions along with AI answers and scoring details, open the log viewer for a run (e.g. click here for GPT-5). The log viewer is protected against bot access.
Methodology
Questions are posed to models with no additional prompting and no answer choice options. Models generally give answers with varying degrees of additional detail. A grader model determines whether the given answer contains the correct piece of information. The grader model is prompted with a grading prompt developed by Google along with the benchmark.
The grader model has a ternary output: correct, incorrect, and “not attempted”. This last case is to distinguish between a model giving an incorrect answer and saying that it is unable to come up with the answer, the latter being more desirable. We report the simple proportion of questions answered correctly. This deviates from Google’s methodology, which computes a single score as the harmonic mean (F1-Score) of overall correct (proportion of questions that were answered correctly) and correct-given-attempted (number of questions that were answered correctly out of those that were attempted). We favor the simple measure for interpretability.
Google maintains the official leaderboard here, and we generally aim to match their methodology. We have chosen to run this benchmark ourselves so that we can control which models we run it for, e.g. to have more scores to use in calculating ECI.
The code used in our implementation can be found here.