Understanding GPU Architecture and Memory: From GDDR6 to HBM3 and Beyond

Overview of GPU Architecture and Functionality GPUs, or Graphics Processing Units, are essential components in modern computing, particularly for gaming and artificial intelligence. The concept of the GPU originated in the early 1990s when Sony first used the term in its PlayStation consoles. However, it was NVIDIA that perfected the technology, becoming the industry leader. GPUs are designed to handle the massive parallelization required for rendering complex graphics and accelerating computationally intensive tasks, such as training large language models (LLMs). GPU Design and Components 1. GPU Memory Module — The VRAM Unlike traditional CPUs that rely on system RAM, GPUs use dedicated video RAM (VRAM) to optimize data retrieval and processing. VRAM, such as GDDR6 (6th generation Graphics Double Data Rate), is soldered onto the GPU's printed circuit board (PCB) very close to the GPU die. This proximity enhances data transfer speed. GDDR6 features 32 pins spread across 2 channels with a bandwidth of approximately 16 Gbits per pin, totaling 32 GB/s per channel. 1.1 What is DRAM? DRAM (Dynamic Random Access Memory) is the foundational technology for both CPU RAM and GPU VRAM. It consists of capacitors and transistors, where the capacitor's charge stores data. Periodic refresh circuits prevent data loss, hence the term "dynamic." DDR5 (Double Data Rate 5) is the latest standard for CPU RAM, offering high performance and low latency suitable for general-purpose computing. 1.2 What is SGRAM? SGRAM (Synchronous Graphics RAM) is specifically designed for graphics cards, with the current standard being GDDR6. While DDR and GDDR share a common origin, they diverge significantly in their design goals. GDDR6 prioritizes high throughput over low latency, making it ideal for handling large volumes of data processed in parallel. Conceptually, it can be likened to a highly efficient goods train rather than a bullet train used for quick, frequent operations. 1.3 GDDR VRAMs Explained in Detail GDDR memory chips are directly connected to the GPU via a memory interface. Each pin acts as a wire, contributing to the overall bandwidth. GDDR6 has 32 pins across 2 channels, with a data transfer rate of 8 per pin (due to its double data rate and quad pumping). This results in a bandwidth of 32 GB/s per channel, crucial for fast data processing. 1.4 Calculating GPU Memory Bandwidth Intuitively Memory bandwidth is the maximum rate of data transfer between the GPU and VRAM. It is calculated using the formula: Bandwidth = Clock * Bus Width * Data Rate. For instance, a 1750 MHz clock with a 128-bit bus width and a data rate of 8 achieves a bandwidth of 32 GB/s. The "boost clock" feature further increases this speed under favorable conditions. Advanced Memory Technologies: HBM 1.5 What is HBM VRAM in a GPU? HBM (High-Bandwidth Memory) offers even greater bandwidth than GDDR6, with 1024 pins spread across 8 channels. Each channel has a bandwidth of 2 Gbits per pin, resulting in a total bandwidth significantly higher than GDDR6. HBM achieves this through a wider bus and a 2.5D architecture that stacks memory dies vertically, linked via Through-Silicon Vias (TSVs). This design reduces latency and power consumption while increasing performance. HBM is widely used in data center GPUs, particularly for AI applications like ChatGPT, and is produced by leaders like SK Hynix, Samsung, and Micron. Cooling Mechanisms Higher clock speeds and intense parallel processing generate significant heat, necessitating effective cooling solutions. Common cooling methods include fans, liquid cooling, and advanced heat sink designs. GPU Computation Cores 2.1 CUDA Core vs. Tensor Core GPUs contain thousands of computational cores, unlike the few found in CPUs. NVIDIA's GPUs use Streaming Multiprocessors (SMs), which consist of CUDA cores and Tensor cores. CUDA cores perform regular mathematical operations, executing one operation per clock cycle. Tensor cores, introduced with the V100 GPU, specialize in matrix multiplications, crucial for deep learning tasks. They can multiply 4x4 FP16 matrices and add the result to an FP32 output matrix, significantly accelerating mixed-precision training. 2.2 GPU Operations — A FLOP Show Performance in GPUs is measured in TeraFLOP/s (trillions of floating-point operations per second). Matrix operations typically involve multiplication and addition, which can be fused into a Fused Multiply-Add (FMA) operation. Knowing the FMA speed, the number of SMs, and the clock rate helps calculate the peak FLOP/s. LLM Operations in a GPU Large Language Models (LLMs) are memory-intensive, not compute-bound. The challenge lies in efficiently moving data between VRAM and the shared memory (SRAM) used by the GPU's processing cores. During training and inference, data must constantly be shuttled between these memory types, as the compute cores require frequent data access. This process is streamlined by HBM's faster data transfer rates and wider bus widths, making it indispensable for large-scale AI deployments. Linking GPUs for LLM Training 3.1 Generic Concepts on Linking Processors In data centers, servers are connected using Network Interface Cards (NICs) and RDMA (Remote Direct Memory Access) technology. RoCE (RDMA over Converged Ethernet) allows servers to communicate over Ethernet networks. Switches, including Top-of-Rack (ToR) and Spine switches, form the network backbone, ensuring efficient communication. 3.2 Linking GPUs via Proprietary Technology like NVLink NVIDIA's NVLink technology provides high-speed connections between GPUs, while NVSwitch enables seamless communication in large clusters. NVLink was introduced with the P100 GPU and has evolved to support up to 256 H100 GPUs. Each H100 GPU connects to the NVSwitch3 chip via 18 NVLink4.0 connections, facilitating ultra-high bandwidth and enabling the training of trillion-parameter models. 3.3 Linking GPUs via RoCE in a Rail-Optimized Topology For even larger clusters, Meta uses a rail-optimized topology to link over 100,000 H100 GPUs. Each GPU in a DGX server (containing 8 GPUs) is indexed and connected via RDMA to others with the same index via dedicated rail switches. Spine switches are used for inter-rail communication, minimizing latency and maximizing efficiency. Recent research suggests that spine switches might be unnecessary for certain AI training workloads, further optimizing energy consumption. 3.4 Linking GPUs via RoCE in a Rail-Only Topology In a rail-only topology, high-bandwidth (HB) domains are created where GPUs communicate via NVLink. These domains are then interconnected using dedicated rail connections, bypassing the need for spine switches. This setup streamlines data transfer, reducing complexity and energy requirements for massive GPU clusters. Evaluation by Industry Insiders Industry experts highlight the critical role of HBM in supporting the vast memory demands of AI models. HBM's superior bandwidth and lower latency compared to GDDR6 make it indispensable for data centers. Companies like NVIDIA and Meta are continuously innovating to improve GPU performance and efficiency, with a focus on optimizing network topologies and cooling mechanisms. The energy consumption of large GPU clusters remains a significant concern, prompting exploration of nuclear power and other sustainable energy sources. NVIDIA's dominance in the GPU market is partly due to its early investment and continuous improvement in GPU architecture. The introduction of Tensor cores and advancements in NVLink technology have solidified its position in AI and high-performance computing. SK Hynix, Samsung, and Micron are pivotal in HBM production, contributing to the rapid development of AI infrastructure. This summary captures the essence of GPU architecture, focusing on memory types, cooling mechanisms, computational cores, and network topologies, providing a clear understanding of how GPUs support modern computing tasks, particularly in AI and gaming.

Understanding GPU Architecture and Memory: From GDDR6 to HBM3 and Beyond

Related Links