Trends in Machine Learning Hardware
FLOP/s performance in 47 ML hardware accelerators doubled every 2.3 years. Switching from FP32 to tensor-FP16 led to a further 10x performance increase. Memory capacity and bandwidth doubled every 4 years.
Published
Resources
This report was originally published on Nov 09, 2023. For the latest research and updates on this subject, please see: Data on Machine Learning Hardware.
Executive summary
We study the performance of GPUs for computational performance across different number representations, memory capacities and bandwidth, and interconnect bandwidth using a dataset of 47 ML accelerators (GPUs and other AI chips) commonly used in ML experiments from 2010-2023, plus 1,948 additional GPUs from 2006-2021. Our main findings are:
- Lower-precision number formats like 16-bit floating point (FP16) and 8-bit integers (INT8), combined with specialized tensor core units, can provide order-of-magnitude performance improvements for machine learning workloads compared to traditionally used 32-bit floating point (FP32). For example, we estimate, though using limited amounts of data, that using tensor-FP16 can provide roughly 10x speedup compared to FP32.
- Given that the overall performance of large hardware clusters for state-of-the-art ML model training and inference depends on factors beyond just computational performance, we investigate memory capacity, memory bandwidth and interconnects, and find that:
- Memory capacity is doubling every ~4 years and memory bandwidth every ~4.1 years. They have increased at a slower rate than computational performance which doubles every ~2.3 years. This is a common finding and often described as the memory wall.
- The latest ML hardware often comes with proprietary chip-to-chip interconnect protocols (Nvidia’s NVLink or Google’s TPU’s ICI) that offer higher communication bandwidth between chips compared to the PCI Express (PCIe). For example, NVLink in H100 supports 7x the bandwidth of PCIe 5.0.
- Key hardware performance metrics and their improvement rates found in the analysis include: computational performance [FLOP/s] doubling every 2.3 years for both ML and general GPUs; computational price-performance [FLOP per $] doubling every 2.1 years for ML GPUs and 2.5 years for general GPUs; and energy efficiency [FLOP/s per Watt] doubling every 3.0 years for ML GPUs and 2.7 years for general GPUs.
Specification and unit | Growth rate | Datapoint of highest performance | N | |
---|---|---|---|---|
Computational Performance | FLOP/s (FP32) | 2x every 2.3 [2.1; 2.6] years 10x every 7.7 [6.9; 8.6] years 0.13 [0.15; 0.12] OOMs per year | ~90 TFLOP/s ~9e13 FLOP/s (NVIDIA L40) | 45 |
FLOP/s (tensor-FP32) |
NA1 |
~495 TFLOP/s ~4.95e14 FLOP/s (NVIDIA H100 SXM) | 7 | |
FLOP/s (tensor-FP16) | NA | ~990 TFLOP/s ~9.9e14 FLOP/s (NVIDIA H100 SXM) | 8 | |
OP/s (INT8) | NA | ~1980 TOP/s ~1.98e15 OP/s (NVIDIA H100 SXM) | 10 | |
Computational price-performance | FLOP per $ (FP32) | 2x every 2.1 [1.6; 2.91] years 10x every 7 [5; 9] years 0.14 [0.18; 0.10] OOMs per year | ~4.2 exaFLOP per $ ~4.2e18 FLOP per $ (AMD Radeon RX 7900 XTX) | 33 |
Computational energy-efficiency | FLOP/s per Watt (FP32) | 2x every 3.0 [2.7; 3.3] years 10x every 10 [9; 11] years 0.10 [0.11; 0.09] OOMs per year | ~302 GFLOP/s per W ~3e11 FLOP/s per W (NVIDIA L40) | 43 |
Memory capacity | DRAM capacity (Byte) | 2x every 4 [3; 6] years 10x every 13 [10; 19] years 0.08 [0.10; 0.05] OOMs per year | ~128 GB ~1.28e11 B (AMD Radeon Instinct MI250X) | 47 |
Memory bandwidth | DRAM bandwidth in Byte/s | 2x every 4 [3; 5] years 10x every 14 [11; 17] years 0.07 [0.09; 0.06] OOMs per year | ~3.3 TB/s ~3.3e12 B/s (NVIDIA H100 SXM) | 47 |
Interconnect bandwidth | Chip-to-chip communication bandwidth (Byte/s) | NA | ~900 GB/s ~9e11 B/s (NVIDIA H100) | 45 |
Introduction
Advances in machine learning over the last decade have in large part been the result of scaling up the amount of computational resources (compute) used for training (Sevilla et al., 2022), and advancements in hardware performance have played a modest role in this progress. Increased investments in ML R&D (Cottier, 2023) led to scaled-up hardware infrastructure as we move from a small number of chips to massive supercomputers.
This article provides an overview of trends in computational performance across a variety of number precisions and specialized components, such as tensor cores. Furthermore, we analyze additional performance factors such as memory capacity, memory bandwidth, and interconnect bandwidth. Overall, we want to provide a holistic picture of all ML hardware specifications and components that jointly determine practical hardware performance, especially in the era of large ML models.
We used peak performance of various metrics throughout this work for comparison, which we source from the specification sheets from hardware producers.2 Typically, only a fraction of the specified peak computational performance is utilized. This depends on a variety of factors, such as the workload specifications and the limits of other specifications, such as the memory capacity and bandwidth. For example, according to (Leland et al., 2016), for common supercomputing workloads, this might be between 5% and 20%, while in ML training this might range between 20% and 70%, depending on the size of the model, how it is parallelized, and other factors (Sevilla et al., 2022). Nevertheless, peak performance serves as a useful upper bound and standard basis for comparison across different hardware accelerators and generations.
Terminology
- Number representation: We differentiate number representation along three dimensions:
- Bit-length/precision: describes the number of bits used to store a number. This typically ranges from 4 to 64 bits.
- Number format: refers to a specific layout of bits, e.g., integer or floating-point. The number format typically includes the bit-length like in FP32, however, we separate the bit layout and bit-length in our piece.3
- Computation unit: shows if a dedicated matrix multiplication unit is used or not. In this piece, we only differentiate between tensor and non-tensor.
- Hardware accelerator: refers to a chip that accelerates ML workloads, e.g., GPU or TPU. We use the terms chip and hardware accelerator interchangeably as general terms and GPU and TPU when referring to the specialized accelerator.
Dataset
We compiled hardware specifications from two key datasets. The first includes 1948 GPUs released between 2006 and 2021 based on Sun et al., 2019, which we will refer to as the general GPU dataset (based on primarily general GPUs not commonly used in ML training). The second dataset includes only 47 ML hardware accelerators starting in 2010, such as NVIDIA GPUs and Google TPUs, which were commonly used in notable ML experiments (as defined in Sevilla et al., 2022). We have curated the latter dataset ourselves and it will be referred to as the ML hardware dataset or, in short, ML dataset (based on ML GPUs). This dataset is publicly available in our datasheet.
Trends of primary performance metrics
In this section, we present the trends for different number representations, memory capabilities, computational price-performance, and energy efficiency. We briefly explain each metric’s relevance for ML development and deployment, present our findings, and briefly discuss their implications.
Number representations
The numeric representation used for calculations strongly influences computational performance. More specifically, the number of bits per value determines arithmetic density (operations per chip area per second).4 In recent years, hardware manufacturers have introduced specialized lower-precision number formats for ML applications. While FP64 has been common in high-performance computing,5 FP32 performance has been the focus of most consumer applications for the last 15 or so years.
Number formats with less precision have become more prevalent in recent years since low precision is sufficient for both developing and deploying the ML models (Dettmers et al., 2022; Suyog Gupta et al., 2015; Courbariaux et al., 2014). According to Rodriguez, 2020, FP32 remains a most widely adopted number format for both ML training and inference today, with industry increasingly transitioning to lower precision number formats like FP16 and Google’s bfloat16 (BF16) for certain training and inference tasks, as well as integer formats INT8 for select inference workloads.6 Other well-known emerging number formats are the 16-bit standard floating-point format FP16, integer format INT4, and the NVIDIA-developed 19-bit floating-point format TF32.7
Computational performance for FP32 and FP16
Historically, the computational performance trend for FP32 precision has been regular over nearly two decades with a doubling time of 2.3 years, closely in line with the rate associated with Moore’s Law. Over the last few years, especially since 2016, we have seen an emergence of hardware with dedicated support of FP16 precision—increasing the absolute computational performance given the reduced bit length.