Epoch's work is free to use, distribute, and reproduce provided the source and authors are credited under the Creative Commons BY license.
Learn more about this graph
Our approach compares self-reported GPQA diamond benchmark scores from leading AI developers against independent evaluations conducted internally by Epoch. We define two key quantities: (1) self-reported accuracy, taken directly from model release papers, blog posts, or documentation; and (2) Epoch evaluation accuracy, measured through standardized internal runs with controlled sampling. We treat Epoch’s evaluations as a strong estimate for the actual accuracy and variation of models, and compare all reported scores against it.
This allows us to quantify whether self-reported scores are consistent with our independent estimates. Broadly, we find that most developers’ reported scores align closely with Epoch’s evaluations—suggesting that systematic over-estimation is rare.
Data
Analysis
Assumptions and limitations
Explore this data
Benchmark results featuring the performance of leading AI models on challenging tasks.

