LLM responses to benchmark questions are getting longer over time

The length of responses generated by language models to benchmark questions has increased over time for both reasoning and non-reasoning models. According to our internal benchmarks, reasoning models’ responses are growing considerably faster (5x per year) than those from non-reasoning models (2.2x per year). Reasoning models also exhibit longer response lengths overall – currently, around 8x more tokens on average, compared to non-reasoning models.

Reasoning models have been shown to produce more reasoning tokens as the amount of RL training is scaled up. The explanation for longer responses from non-reasoning models is less clear, but it may reflect trends toward extended context lengths, judges’ preferences for longer responses when generating RLHF data, or a growing use of context distillation from Chain of Thought prompting. Finally, while today’s models are cleanly separable into reasoning vs. non-reasoning, there are indications that future models (like GPT-5) may blur this line.

Published

April 17, 2025

Last updated

April 17, 2025

Learn more

Overview

Using data from Epoch’s Benchmarking Hub, we identify the average length of model responses across GPQA Diamond, MATH Level 5, and OTIS Mock AIME 2024-2025, and plot the trend over time. We observe that output lengths from reasoning models have grown significantly faster than their non-reasoning counterparts (5x per year, vs 2.2x). Unsurprisingly, we also find that reasoning models use far more tokens than non-reasoning models.

Code for our analysis is available here.

Data

Our data comes from Epoch’s Benchmarking Hub. We focus on model responses to GPQA Diamond, MATH level 5 benchmarks, and OTIS Mock AIME 2024-2025, since these have the most coverage and elicit similar response lengths. Across both benchmarks, we have data on 77 unique models, evaluated a total of 235 times. Of these models, 11 are reasoning models, 65 are non-reasoning models, and 1 (Claude 3.7 Sonnet) is evaluated both with and without extended reasoning.

Some reasoning models are offered with varying levels of “effort”. For these models, we used only benchmark runs with the highest level of reasoning effort each model is capable of. Across OpenAI’s reasoning models, moving from “medium” to “high” effort resulted in a 1.6x increase in output tokens.

Analysis

We perform log-linear regressions of token output length on time, with separate models for reasoning and non-reasoning models. To generate confidence intervals, we bootstrap sample with replacement (n=500), then identify the 5th and 95th percentile slopes and predictions across samples. Our results were as follows, with 90% confidence intervals in parentheses:

GPQA Diamond MATH level 5 OTIS Mock AIME 2024-2025 Combined
Non-reasoning models 1.9x per year (1.6 - 2.3) 1.9x per year (1.7 - 2.2) 2.9x per year (2.3 - 3.9) 2.2x per year (2.0 - 2.6)
Reasoning models 10x per year (4 - 24) 3x per year (1 - 9) 4x per year (2 - 6) 5x per year (2 - 13)

While results for our non-reasoning model subset are quite stable, the slope for non-reasoning models is quite sensitive. Omission of individual models can shift the overall trend by significant amounts.

Assumptions

  • We assume that exponential growth is an appropriate model for the trend in output response length. To test this, we compared against a linear model, and found that for both reasoning and non-reasoning models, the exponential fit was weakly (but not statistically significantly) preferred to the linear fit.
  • We assume that trends in output length for the three benchmarks we focused on are indicative of trends among other benchmarks. We intend to examine results for other internal benchmark evaluations as they become available.