AI Developers Accurately Report GPQA Diamond Scores for Recent Models

Despite strong incentives to overstate performance, AI developers accurately report GPQA Diamond scores for their recent top models, according to our new analysis. We compared scores reported by developers to the scores from our independent, standardized evaluations on GPQA Diamond, which tests models on expert-level science questions. While the developers’ findings often differed from our own, they consistently fell well within our confidence intervals.

There is some inherent randomness in evaluating LLMs, so asking the same question repeatedly may produce different answers. To accommodate this fact, we test the models multiple times on each question and aggregate the results. We believe we can estimate the true performance of the models this way often to within 4-6 percentage points with 90% confidence. This means we are 90% confident a model’s performance falls within the ranges shown in the above chart, accounting for the natural variability in the evaluations.

Published

September 19, 2025

Learn more

Overview

Our approach compares self-reported GPQA diamond benchmark scores from leading AI developers against independent evaluations conducted internally by Epoch. We define two key quantities: (1) self-reported accuracy, taken directly from model release papers, blog posts, or documentation; and (2) Epoch evaluation accuracy, measured through standardized internal runs with controlled sampling. We treat Epoch’s evaluations as a strong estimate for the actual accuracy and variation of models, and compare all reported scores against it.

This allows us to quantify whether self-reported scores are consistent with our independent estimates. Broadly, we find that most developers’ reported scores align closely with Epoch’s evaluations—suggesting that systematic over-estimation is rare.

Data

The dataset comes from Epoch AI’s Benchmarking Hub, covering frontier models released between 2023 and 2025. For each developer we track, we selected their best-performing model for which we also ran an internal GPQA evaluation. We mostly collected self-reported scores from official model release papers, developer blog posts, or Hugging Face model documentation. In the case of Mistral Large, however, the company did not report the model’s GPQA results. So we used a third-party evaluator’s results for that model, which the company endorsed.

Epoch relied on evaluation scores from pre-computed values in our Benchmarking Hub, which have large sample sizes sufficient to keep the standard error around 2–3%. So the dataset represents a curated subset of models, rather than the full universe of released models.

Analysis

Our method applies a one-sample z-test at the 𝛼 = 0.05 level for each model. Specifically, we treat Epoch’s GPQA score as the baseline population mean, and take Epoch’s standard error as the population standard deviation, which is calculated following the method detailed in a 2024 paper. For each model, we test the null hypothesis that the lab’s self-reported GPQA score is drawn from the same distribution as Epoch’s model against the alternative that the distribution is significantly different.

The results show that for all top models tested, the computed p-values are well above 0.05, indicating that their self-reported GPQA scores do not significantly differ from Epoch’s baseline score. While it is technically possible for developers to inflate their GPQA scores by using majority-voting after N samples, we find no evidence of that here.

Model Name Self-Report GPQA Epoch GPQA Epoch SE Z-test Statistic p-value
GPT-5 85.7 84.5 2.2 0.5454545455 0.5854409346
Opus 4.1 74.9 73.2 3.2 0.53125 0.5952455487
DeepSeek R1 71.5 71.7 3.2 -0.0625 0.9501646619
Llama 4 Maverick 69.8 67 2.8 1 0.3173105079
Grok 4 87.5 87 2 0.25 0.8025873486
Gemini 2.5 Pro 86.4 84.8 2.6 0.6153846154 0.538300749
Qwen 3 71.1 70.7 2.7 0.1481481481 0.8822258519
Mistral Large 2407 48.6 51.3 2.7 -1 0.3173105079

Assumptions and limitations

List important assumptions and limitations that affect the interpretation of your analysis. For example:

  • Coverage limitations: The dataset only includes one model per lab (the highest-performing one available in our internal evaluation), so it does not cover the full distribution of models or reporting practices.
  • Benchmark scope: The analysis is restricted to GPQA diamond. Results may not generalize to other benchmarks with different properties or levels of noise
  • Sampling assumptions: We treat Epoch’s evaluation as a population mean with standard error, but if systematic evaluation biases exist in our methodology, these would not be captured.