Epoch's work is free to use, distribute, and reproduce provided the source and authors are credited under the Creative Commons BY license.
Learn more about this graph
Using data from Epoch’s Benchmarking Hub, we identify the average length of model responses across GPQA Diamond, MATH Level 5, and OTIS Mock AIME 2024-2025, and plot the trend over time. We observe that output lengths from reasoning models have grown significantly faster than their non-reasoning counterparts (5x per year, vs 2.2x). Unsurprisingly, we also find that reasoning models use far more tokens than non-reasoning models.
Code for our analysis is available here.
Data
Analysis
Assumptions
Explore this data
Benchmark results featuring the performance of leading AI models on challenging tasks.