Epoch AI’s work is free to use, distribute, and reproduce provided the source and authors are credited under the Creative Commons Attribution license.

Cite this work as

                
                  Venkat Somala and Yafah Edelman (2025), "Training open-weight models is becoming more data intensive". Published online at epoch.ai. Retrieved from: 'https://epoch.ai/data-insights/training-tokens-per-parameter' [online resource]

BibTeX citation

  
  @misc{epoch2025trainingtokensperparameter,
    title={Training open-weight models is becoming more data intensive},
    author={Venkat Somala and Yafah Edelman},
    year={2025},
    url={https://epoch.ai/data-insights/training-tokens-per-parameter},
    note={Accessed: }
  }
  

              

Training open-weight models is becoming more data intensive

The ratio of training data to active parameters in open-weight LLMs has grown 3.1x per year since 2022. Recent models have been trained with 20 times more data per parameter than the optimal ratio suggested by the 2022 Chinchilla scaling laws. Our analysis focuses on open-weights models, where information on training tokens and parameters is more available.

Enable JavaScript to see an interactive visualization.

This trend could be driven by economic incentives: models trained on higher tokens per parameter ratios can achieve comparable performance with fewer parameters, making them less expensive to serve at inference time. Open-weight developers may also favor scaling data rather than parameters to keep their models accessible for users to run on their local infrastructure.

Authors

Venkat Somala, Yafah Edelman

Published

August 1, 2025

Explore this data

Data on AI Models

Our comprehensive database of over 3000 models tracks key factors driving machine learning progress.

Learn more

Overview

We explore trends in the number of tokens per active parameter used to train notable open-weight language models. Tokens per parameter is the total number of training tokens - calculated as dataset size multiplied by epochs - divided by the number of activated parameters on a forward pass. Our analysis shows an upward trend: the average tokens per parameter was approximately 10 in 2022 and climbed to around 300 by 2025.

However, it is important to note that this trend may not hold for closed models, which include many current frontier models. We lack public data to estimate their token-to-parameter ratios.

Code for this analysis is available here.

Data

We use Epoch AI’s Notable Models database and pull relevant fields including publication date, number of parameters, estimated training compute, and training dataset size. To focus on models trained from scratch, we exclude non‑language systems as well as any fine‑tuned, continually trained, or distilled variants, since their token‑per‑parameter ratios reflect downstream adaptations rather than pre‑training dynamics.

For transformer-based models lacking reported dataset sizes but have available compute estimates (C) and active parameter counts (N), we infer their tokens to active param ratio by rearranging the relation C=6 N D to:

\[ \frac{D}{N} = \frac{C}{6N^2}. \]

We then include these estimated values of tokens per parameter in our overall trend analysis.

Analysis

We fit an exponential growth model by performing a linear regression on log(tokens per parameter) against model publication date. The resulting linear fit is statistically significant and shows a positive correlation between tokens per active parameter and model publication date.

To generate confidence intervals, we used bootstrap sampling with replacement (n=500). For each bootstrap sample, we resampled the 33 observations with replacement, refit the exponential growth model, and collected the resulting slope estimates. We then calculated the 5th and 95th percentiles of the bootstrap slope distribution to construct 90% confidence intervals.

The annual growth factor derived from the bootstrap median is 3.1x per year, with a 90% confidence interval of [2.1x, 4.9x].

Assumptions

The reliability of the findings are directly tied to the accuracy and completeness of the training dataset size, parameter count, and training compute estimates in our Notable AI models database.

An important limitation of our analysis is the lack of data on closed models. Many do not disclose key details such as training compute, parameter counts, or cumulative training tokens, which prevents us from estimating their tokens‑per‑parameter ratios.