The size of datasets used to train language models doubles approximately every eight months

Across all domains of ML, models are using more and more training data. In language modeling, datasets are growing at a rate of 3x per year. The largest models currently use datasets with tens of trillions of words. The largest public datasets are about ten times larger than this, for example Common Crawl contains hundreds of trillions of words before filtering.

Published

June 19, 2024

Last updated

February 21, 2025

Epoch’s work is free to use, distribute, and reproduce provided the source and authors are credited under the Creative Commons BY license.