The size of datasets used to train language models doubles approximately every eight months

Across all domains of ML, models are using more and more training data. In language modeling, datasets are growing at a rate of 3x per year. The largest models currently use datasets with tens of trillions of words. The largest public datasets are about ten times larger than this, for example Common Crawl contains hundreds of trillions of words before filtering.

Published

June 19, 2024

Last updated

March 25, 2025