Distributed Training Simulator

FLOP Utilization Rates of Training Runs

How does utilization scale with the size of the training run?

Configurations

We obtain higher utilization rates than would be observed in practice for many reasons: we omit attention layers from the architecture, we assume that hardware in the training cluster never fails and all operations are always completed synchronously, and make some further simplifying assumptions.

How are training runs optimally parallelized?

Parallelism strategies

Absolute levels of parallelism

Maximum feasible training compute

What is the largest feasible training run that can be run at appreciable levels of utilization?

Configuration	GPU	Model Type	Maximum feasible training compute

Learn more

Data Movement Bottlenecks to Large-Scale Model Training: Scaling Past 1e28 FLOP

Data movement bottlenecks limit LLM scaling beyond 2e28 FLOP, with a “latency wall” at 2e31 FLOP. We may hit these in ~3 years. Aggressive batch size scaling could potentially overcome these limits.

Introducing the Distributed Training Interactive Simulator

We introduce an interactive simulation tool which can simulate distributed training runs of large language models under ideal conditions.

Publications & Commentary

Data & Resources

Projects

Company

@ 2025 Epoch AI

Privacy Notice Cookie Policy