LLMs now accept longer inputs, and the best models can use them more effectively
Today’s LLMs can only ingest a limited number of tokens per query, often called the context window. Since mid-2023, the longest LLM context windows have grown by about 30x per year. Their ability to use that input effectively is improving even faster: on two long-context benchmarks, the input length where top models reach 80% accuracy has risen by over 250x in the past 9 months.
Still, headroom remains. Many long-context tasks are likely more difficult than these benchmarks, and some relevant content does not fit into even the longest available windows.
Learn more
Data
We collect context window size from Artificial Analysis. Model release dates are from Epoch’s AI Models database. This yields data for 123 models.
The two long-context benchmarks we use are Fiction.liveBench, which measures narrative comprehension, and MRCR, which measures the ability to retrieve context-dependent information. For MRCR we use the 2-needle setting. Data for Fiction.liveBench is available on Epoch’s Benchmarking Hub. MRCR data is from Context Arena. Thanks to the creators of both benchmarks, kas and Dillon Uzar, respectively, for their assistance in accessing this data.
Benchmark data is only available for newer models. We have data from Fiction.liveBench for 37 models and MRCR for 49 models, with 30 models in the intersection.
Analysis
For calculating the trend in frontier context window lengths, we first identify the models whose context windows were among the 10 longest at the date of their release. We then fit a linear model with release date as the independent variable and log of context window length as the dependent variable. The R² for this fit is 0.76. The regression coefficient translates to an annual growth rate of 30x. A bootstrap resample (n=1000) gives a 90% confidence interval for this growth rate of 10x-50x.
For calculating the 80% accuracy input length, we proceed as follows. For each model, on each benchmark, we fit a linear model with log of input length as the independent variable and accuracy as the dependent variable. On average, this fit has an R² of 0.79. Qualitatively, we see the expected pattern: accuracy declines as input grows. While a linear fit has the conceptual drawback that the estimator can be above 100% or below 0%, we found it to be a simple and effective estimate for this use-case. We use these fits to interpolate the input length at which the model would score 80% on the benchmark. We discard models whose interpolated lengths are less than 50 tokens: they cannot perform the task even on short inputs.
For models where we have data for both benchmarks, we average the resulting 80% accuracy input lengths.
We treat models that only have data for one benchmark differently. Because the benchmarks have different distributions of 80% accuracy input lengths, we prefer to interpolate missing data rather than simply use the one datapoint. This is justified by the 80% accuracy input lengths of the benchmarks being reasonably correlated: R² of a log-log fit is 0.37.
So, for models where only one benchmark is present, we use this fit to interpolate a value for the other benchmark. We then average the resulting two values for each benchmark to get an aggregate 80% accuracy input length.
We again identify the models whose 80% accuracy input lengths were among the 10 longest at the date of their release. The R² for this fit is 0.59. The annual growth rate implied by this fit is 3000x, but with a wide range: a bootstrap resample (n=1000) gives a 90% confidence interval of 200x to 20,000x. This is both because our benchmark data spans less than a year, and because only a few models have pushed the 80% accuracy input length into 6-digit territory. We can thus say that, by our metric, models’ long-context abilities are probably growing faster than overall context window length but the growth rate is difficult to characterize precisely.
Reference lines were chosen to reflect a variety of documents that would plausibly be useful to have in context at once. We avoided corpora that would likely appear in pre-training, e.g. all of Wikipedia. Some examples given (e.g., the PyTorch codebase) likely do appear in pretraining, but are indicative of cases that do not (e.g., large private codebases). To estimate token counts, we find public estimates of words, pages, or lines of code. We use heuristics of 250-500 words per page depending on the source, 4/3 tokens per word, and 10 tokens per line of code.
Code for our analysis is available here.
Assumptions
We exercised judgement in selecting the setting of MRCR to use when calculating the 80% accuracy input length. Specifically, there are three MRCR settings: 2-, 4-, and 8-needle. In each setting, the model must retrieve a passage (“need’e) from a conversation, but the settings differ in how many distractor passages are present: 1, 3, or 7. Naturally, higher-needle settings are more challenging.
We chose to use the 2-needle setting. This is both because Context Arena has the most data available for it and because it showed a greater distribution of scores among models. However, even the model with the highest score on our metric, Gemini 2.5 Pro (06-05), only scores above 80% on the 8K input length setting of the 8-needle setting.
Additionally, the max length tested by Fiction.liveBench is 192K tokens. Most models are below 80% accuracy by that point, but a few remain above. While these models still show degradation as input lengths grow up to this point, our extrapolation for the input length at which they would perform below 80% is likely somewhat less reliable.
Thus, we must emphasize that our effectiveness metric is far from the ultimate test of all long-context abilities. Rather, it shows the limits of today’s models on two benchmarks that are moderately challenging and reasonably reflective of capabilities relevant for some real-world long-context tasks.
Explore this data

AI Benchmarking Dashboard
Our database of benchmark results, featuring the performance of leading AI models on challenging tasks. It includes results from benchmarks evaluated internally by Epoch AI as well as data collected from external sources. The dashboard tracks AI progress over time, and correlates benchmark scores with key factors like compute or model accessibility.
Updated November 27, 2024

Data on Notable AI Models
Epoch AI's database contains over 900 notable ML models and 400 training compute estimates, offering a detailed exploration of trends in AI development.
Updated June 19, 2024