Cite this work as

              
                Jean-Stanislas Denain (2024), "Accuracy increases with estimated training compute". Published online at epoch.ai. Retrieved from: 'https://epoch.ai/data-insights/compute-vs-accuracy' [online resource]

BibTeX citation

@misc{epoch2024computevsaccuracy,
  title={Accuracy increases with estimated training compute},
  author={Jean-Stanislas Denain},
  year={2024},
  url={https://epoch.ai/data-insights/compute-vs-accuracy},
  note={Accessed: }
}


            

Accuracy increases with estimated training compute

GPQA Diamond and MATH Level 5 accuracies increase with estimated training compute. For GPQA Diamond, below 10²⁴ FLOP most models struggle to rise above random chance performance — or even perform worse than random chance, due to failing to understand question formatting. Past 10²⁴ FLOP, performance increases around 12 percentage points with every 10x increase of compute.

Enable JavaScript to see an interactive visualization.

On MATH Level 5, models with high compute estimates also tend to have higher scores: performance increases around 17 percentage points with every 10x increase in pretraining compute. However, the trend is much noisier than for GPQA Diamond.

On both benchmarks, more recent models such as DeepSeek-R1, Phi-4, or Mistral Small 3 outperform older models trained with the same amount of compute, highlighting the role of algorithmic progress. Finally, note that these trends exclude many of the top-performing models, such as OpenAI’s o1, which we lack compute estimates for.

Authors

Jean-Stanislas Denain

Published

November 27, 2024

Last major update

February 07, 2025

Explore this data

AI Benchmarking Dashboard

Our database of benchmark results, featuring the performance of leading AI models on challenging tasks. It includes results from benchmarks evaluated internally by Epoch AI as well as data collected from external sources. The dashboard tracks AI progress over time, and correlates benchmark scores with key factors like compute or model accessibility.

Updated November 27, 2024

Accuracy increases with estimated training compute

Related insights

Explore this data

AI Benchmarking Dashboard

We value your privacy