We fit log-linear models against time for each of hardware FLOP/s, hardware quantity and training time. We bootstrap with replacement over our models to estimate confidence intervals (n=500). Our results are as follows:
Estimated trend in training compute and its underlying drivers (growth per year)
| Variable
|
10th percentile
|
Median
|
90th percentile
|
| Hardware FLOP/s
|
1.36
|
1.41
|
1.48
|
| Hardware quantity
|
1.5
|
1.69
|
1.91
|
| Training time
|
1.37
|
1.53
|
1.71
|
| Combined product*
|
3.58
|
4.27
|
4.93
|
| Training compute
|
3.74
|
4.17
|
4.62
|
* We multiply the prior three variables for models where all exist, then fit an exponential trend to this product series.
The product of the individual slopes is somewhat lower than the overall compute trend; this is the result of different missingness in each variable. However, we find that the combined product trend aligns well with a direct estimate of the trend on recorded training compute. We statistically test the difference between the two slopes across bootstraps, finding no statistically significant difference (90% CI: -0.4 to 0.7).
After obtaining estimates of slopes for each component, we calculate the multiplicative contributions of each trend to compute scaling as the logarithm of each trend’s slope divided by the sum of the logarithms of all slopes. This is done for each bootstrap sample, producing the percentages shown in the plot.
In addition to the three main factors identified above, we consider hardware utilization as a potential fourth category. Only a small subset of our models have hardware utilization estimates (n=22); among these models there is a slight positive trend of 1.1x per year (90% CI: 1.0 - 1.2). As a factor driving growth, this would be 7% (90% CI: 0% - 12%). Due to the small sample size and magnitude of the trend, we omit potential contributions from hardware utilization for the purpose of this analysis.
We next run a sensitivity analysis, perturbing N in our selection of the top N models. We find that estimates of the slope of training compute scaling, as well as the slopes and relative contributions from each underlying trend remain relatively stable for N = {5, 10, 15}, with no statistically significant differences. The most notable differences come in the top-5 model subset, where hardware quantity appears to have a relatively larger contribution compared to the top-10 subset (median: 0.48 vs 0.40).
Finally, we test the impact of different choices of year to begin our analysis. Overall, results are highly similar when using initial years between 2017-2019; the most significant change is that FLOP/s grows somewhat faster in the 2017-present data (1.6x per year, vs. 1.4x from 2018 on).
Code for all analysis is available at this Colab notebook.