Data Insight
Jun. 19, 2024

The size of datasets used to train language models doubles approximately every six months

By Robi Rahman and David Owen

Across all domains of ML, models are using more and more training data. In language modeling, datasets are growing at a rate of 3.7x per year. The largest models currently use datasets with tens of trillions of words. The largest public datasets are about ten times larger than this, for example Common Crawl contains hundreds of trillions of words before filtering.

Epoch's work is free to use, distribute, and reproduce provided the source and authors are credited under the Creative Commons BY license.

Explore this data