AMD Unveils CDNA 4: Boosting Matrix Multiplication and Maintaining Vector Performance Lead Over Nvidia
AMD recently unveiled its latest compute-oriented GPU architecture, CDNA 4, a moderate update over the previous CDNA 3. This new architecture is designed to enhance AMD's performance in matrix multiplication with lower precision data types, which are crucial for machine learning workloads. At the same time, CDNA 4 maintains and builds upon AMD’s existing strengths in vector operations and overall compute performance. The core of the CDNA 4 architecture is its chiplet design, a strategy that has proven highly effective in AMD's CPU products. CDNA 4 adopts a similar setup, featuring eight Accelerator Compute Dies (ACDs) sitting atop four base dies, each implementing 256 MB of memory side cache. The AMD Infinity Fabric ensures coherent memory access across the system, allowing seamless communication between multiple chips. This design approach is more aggressive than Nvidia's recent shift towards multi-die strategies, which breaks from their traditional monolithic designs. Compared to the CDNA 3-based MI300X, the CDNA 4-powered MI355X has a slightly reduced Compute Unit (CU) count per ACD and more disabled CUs to improve yield rates. Despite these adjustments, the new GPU compensates with higher clock speeds, maintaining a competitive edge in overall throughput. In contrast to Nvidia’s B200, both AMD’s MI355X and MI300X are larger GPUs with a greater number of basic building blocks. However, Nvidia’s B200, through its advanced software ecosystem and tensor cores, can often match or come close to AMD's performance in specific machine learning tasks. One of the significant improvements in CDNA 4 is the enhanced Local Data Share (LDS), which serves as a software-managed scratchpad for thread groups. The LDS capacity has been increased from 64 KB to 160 KB, and read bandwidth has doubled to 256 bytes per clock. These changes allow more data to be stored closer to the execution units, improving data locality and reducing memory latency. Kernels can now allocate more LDS capacity without worrying about lower occupancy, enabling more efficient use of resources. For instance, a kernel that allocates 16 KB of LDS can run ten workgroups on CDNA 4, compared to four on CDNA 3. In addition to the capacity increase, AMD introduced read-with-transpose LDS instructions. These instructions facilitate the transposition of matrices, converting inefficient row-to-column operations into more manageable row-to-row operations. This feature is particularly beneficial for matrix multiplication tasks, where data layout can significantly impact performance. However, even with these enhancements, AMD’s CDNA 4 CUs still have less data storage within the GPU cores compared to Nvidia’s Blackwell SMs. Blackwell SMs offer a 256 KB block of storage that can be partitioned for use as both L1 cache and Shared Memory. While AMD has 40 MB of LDS capacity across the GPU, Nvidia’s B200 can allocate up to 33 MB of Shared Memory with a 228 KB setting, and still retain 92 KB for L1 caching. The memory subsystem of the MI355X has also seen upgrades, integrating HBM3E memory for increased bandwidth and capacity. This leap brings the total memory bandwidth to 8 TB/s and capacity to 288 GB, surpassing Nvidia’s B200, which tops out at 7.7 TB/s and 180 GB. The enhanced bandwidth helps alleviate pressure on the memory system, especially in scenarios where large datasets are involved. This improvement also helps balance the compute-to-bandwidth ratio, increasing it from 0.03 bytes/FLOP on the MI300X to 0.05 bytes/FLOP on the MI355X, although it still lags behind Nvidia’s 0.10 bytes/FLOP on Blackwell. Overall, CDNA 4 represents a strategic refinement rather than a radical departure from CDNA 3. This approach is reminiscent of AMD’s transition from Zen 3 to Zen 4, where incremental improvements were made to an already successful formula. By optimizing matrix multiplication throughput and enhancing memory bandwidth and capacity, AMD aims to solidify its position in high-performance computing and machine learning workloads. The new architecture also reflects AMD’s confidence in the CDNA 3 framework, as evidenced by the MI300A’s top ranking on the TOP500 supercomputer list. Industry experts and analysts have noted that AMD’s gradual evolution strategy is prudent and aligns well with their goals. While Nvidia remains a formidable competitor, especially in low-precision matrix operations, AMD’s higher core count and clock speeds give it a significant edge in vector compute. With CDNA 4, AMD is effectively balancing its resources to cater to a broader spectrum of computational demands, making it a versatile choice for both general compute and specialized tasks. The aggressive chiplet design and enhanced memory features suggest that AMD is committed to leveraging its strengths to stay ahead in the rapidly evolving GPU market.