Epoch AI’s work is free to use, distribute, and reproduce provided the source and authors are credited under the Creative Commons Attribution license.

Cite this work as

                
                  Jaeho Lee and Yafah Edelman (2025), "AI developers accurately report GPQA Diamond scores for recent models". Published online at epoch.ai. Retrieved from: 'https://epoch.ai/data-insights/self-reported-gpqa' [online resource]

BibTeX citation

  
  @misc{epoch2025selfreportedgpqa,
    title={AI developers accurately report GPQA Diamond scores for recent models},
    author={Jaeho Lee and Yafah Edelman},
    year={2025},
    url={https://epoch.ai/data-insights/self-reported-gpqa},
    note={Accessed: }
  }
  

              

AI developers accurately report GPQA Diamond scores for recent models

Despite strong incentives to overstate performance, AI developers accurately report GPQA Diamond scores for their recent top models, according to our new analysis. We compared scores reported by developers to the scores from our independent, standardized evaluations on GPQA Diamond, which tests models on expert-level science questions. While the developers’ findings often differed from our own, they consistently fell well within our confidence intervals.

Enable JavaScript to see an interactive visualization.

There is some inherent randomness in evaluating LLMs, so asking the same question repeatedly may produce different answers. To accommodate this fact, we test the models multiple times on each question and aggregate the results. We believe we can estimate the true performance of the models this way often to within 4-6 percentage points with 90% confidence. This means we are 90% confident a model’s performance falls within the ranges shown in the above chart, accounting for the natural variability in the evaluations.

Authors

Jaeho Lee, Yafah Edelman

Published

September 19, 2025

Explore this data

Data on AI Benchmarking

Benchmark results featuring the performance of leading AI models on challenging tasks.

Learn more

Overview

Our approach compares self-reported GPQA diamond benchmark scores from leading AI developers against independent evaluations conducted internally by Epoch. We define two key quantities: (1) self-reported accuracy, taken directly from model release papers, blog posts, or documentation; and (2) Epoch evaluation accuracy, measured through standardized internal runs with controlled sampling. We treat Epoch’s evaluations as a strong estimate for the actual accuracy and variation of models, and compare all reported scores against it.

This allows us to quantify whether self-reported scores are consistent with our independent estimates. Broadly, we find that most developers’ reported scores align closely with Epoch’s evaluations—suggesting that systematic over-estimation is rare.

Data

The dataset comes from Epoch AI’s Benchmarking Hub, covering frontier models released between 2023 and 2025. For each developer we track, we selected their best-performing model for which we also ran an internal GPQA evaluation. We mostly collected self-reported scores from official model release papers, developer blog posts, or Hugging Face model documentation. In the case of Mistral Large, however, the company did not report the model’s GPQA results. So we used a third-party evaluator’s results for that model, which the company endorsed.

Epoch relied on evaluation scores from pre-computed values in our Benchmarking Hub, which have large sample sizes sufficient to keep the standard error around 2–3%. So the dataset represents a curated subset of models, rather than the full universe of released models.

Analysis

Our method applies a one-sample z-test at the 𝛼 = 0.05 level for each model. Specifically, we treat Epoch’s GPQA score as the baseline population mean, and take Epoch’s standard error as the population standard deviation, which is calculated following the method detailed in a 2024 paper. For each model, we test the null hypothesis that the lab’s self-reported GPQA score is drawn from the same distribution as Epoch’s model against the alternative that the distribution is significantly different.

The results show that for all top models tested, the computed p-values are well above 0.05, indicating that their self-reported GPQA scores do not significantly differ from Epoch’s baseline score. While it is technically possible for developers to inflate their GPQA scores by using majority-voting after N samples, we find no evidence of that here.

Model Name	Self-Report GPQA	Epoch GPQA	Epoch SE	Z-test Statistic	p-value
GPT-5	85.7	84.5	2.2	0.5454545455	0.5854409346
Opus 4.1	74.9	73.2	3.2	0.53125	0.5952455487
DeepSeek R1	71.5	71.7	3.2	-0.0625	0.9501646619
Llama 4 Maverick	69.8	67	2.8	1	0.3173105079
Grok 4	87.5	87	2	0.25	0.8025873486
Gemini 2.5 Pro	86.4	84.8	2.6	0.6153846154	0.538300749
Qwen 3	71.1	70.7	2.7	0.1481481481	0.8822258519
Mistral Large 2407	48.6	51.3	2.7	-1	0.3173105079

Assumptions and limitations

Coverage limitations: The dataset only includes one model per lab (the highest-performing one available in our internal evaluation), so it does not cover the full distribution of models or reporting practices.
Benchmark scope: The analysis is restricted to GPQA diamond. Results may not generalize to other benchmarks with different properties or levels of noise
Sampling assumptions: We treat Epoch’s evaluation as a population mean with standard error, but if systematic evaluation biases exist in our methodology, these would not be captured.