Data Insight
Sep. 19, 2025

AI developers accurately report GPQA Diamond scores for recent models

Jaeho Lee's avatarYafah Edelman's avatar
By Jaeho Lee and Yafah Edelman

Despite strong incentives to overstate performance, AI developers accurately report GPQA Diamond scores for their recent top models, according to our new analysis. We compared scores reported by developers to the scores from our independent, standardized evaluations on GPQA Diamond, which tests models on expert-level science questions. While the developers’ findings often differed from our own, they consistently fell well within our confidence intervals.

There is some inherent randomness in evaluating LLMs, so asking the same question repeatedly may produce different answers. To accommodate this fact, we test the models multiple times on each question and aggregate the results. We believe we can estimate the true performance of the models this way often to within 4-6 percentage points with 90% confidence. This means we are 90% confident a model’s performance falls within the ranges shown in the above chart, accounting for the natural variability in the evaluations.

Epoch's work is free to use, distribute, and reproduce provided the source and authors are credited under the Creative Commons BY license.

Learn more about this graph

Our approach compares self-reported GPQA diamond benchmark scores from leading AI developers against independent evaluations conducted internally by Epoch. We define two key quantities: (1) self-reported accuracy, taken directly from model release papers, blog posts, or documentation; and (2) Epoch evaluation accuracy, measured through standardized internal runs with controlled sampling. We treat Epoch’s evaluations as a strong estimate for the actual accuracy and variation of models, and compare all reported scores against it.

This allows us to quantify whether self-reported scores are consistent with our independent estimates. Broadly, we find that most developers’ reported scores align closely with Epoch’s evaluations—suggesting that systematic over-estimation is rare.

Data

Analysis

Assumptions and limitations

Explore this data