For calculating the trend in frontier context window lengths, we first identify the models whose context windows were among the 10 longest at the date of their release. We then fit a linear model with release date as the independent variable and log of context window length as the dependent variable. The R² for this fit is 0.76. The regression coefficient translates to an annual growth rate of 30x. A bootstrap resample (n=1000) gives a 90% confidence interval for this growth rate of 10x-50x.
For calculating the 80% accuracy input length, we proceed as follows. For each model, on each benchmark, we fit a linear model with log of input length as the independent variable and accuracy as the dependent variable. On average, this fit has an R² of 0.79. Qualitatively, we see the expected pattern: accuracy declines as input grows. While a linear fit has the conceptual drawback that the estimator can be above 100% or below 0%, we found it to be a simple and effective estimate for this use-case. We use these fits to interpolate the input length at which the model would score 80% on the benchmark. We discard models whose interpolated lengths are less than 50 tokens: they cannot perform the task even on short inputs.
For models where we have data for both benchmarks, we average the resulting 80% accuracy input lengths.
We treat models that only have data for one benchmark differently. Because the benchmarks have different distributions of 80% accuracy input lengths, we prefer to interpolate missing data rather than simply use the one datapoint. This is justified by the 80% accuracy input lengths of the benchmarks being reasonably correlated: R² of a log-log fit is 0.37.
So, for models where only one benchmark is present, we use this fit to interpolate a value for the other benchmark. We then average the resulting two values for each benchmark to get an aggregate 80% accuracy input length.
We again identify the models whose 80% accuracy input lengths were among the 10 longest at the date of their release. The R² for this fit is 0.59. The annual growth rate implied by this fit is 3000x, but with a wide range: a bootstrap resample (n=1000) gives a 90% confidence interval of 200x to 20,000x. This is both because our benchmark data spans less than a year, and because only a few models have pushed the 80% accuracy input length into 6-digit territory. We can thus say that, by our metric, models’ long-context abilities are probably growing faster than overall context window length but the growth rate is difficult to characterize precisely.
Reference lines were chosen to reflect a variety of documents that would plausibly be useful to have in context at once. We avoided corpora that would likely appear in pre-training, e.g. all of Wikipedia. Some examples given (e.g., the PyTorch codebase) likely do appear in pretraining, but are indicative of cases that do not (e.g., large private codebases). To estimate token counts, we find public estimates of words, pages, or lines of code. We use heuristics of 250-500 words per page depending on the source, 4/3 tokens per word, and 10 tokens per line of code.
Code for our analysis is available here.