Distributed Training Simulator

FLOP Utilization Rates of Training Runs

How does utilization scale with the size of the training run?

Configurations

We obtain higher utilization rates than would be observed in practice for many reasons: we omit attention layers from the architecture, we assume that hardware in the training cluster never fails and all operations are always completed synchronously, and make some further simplifying assumptions.

How are training runs optimally parallelized?

Parallelism strategies

Maximum feasible training compute

What is the largest feasible training run that can be run at appreciable levels of utilization?

Configuration GPU Model Type Maximum feasible training compute

Settings

months

Running new simulation ...

10%