Quartet: Native FP4 Training Can Be Optimal for Large Language Models

Castro, Roberto L. ; Panferov, Andrei ; Tabesh, Soroush ; Sieberling, Oliver ; Chen, Jiale ; Nikdan, Mahdi ; Ashkboos, Saleh ; Alistarh, Dan

Veröffentlichungsdatum: 5/26/2025

Quartet: Native FP4 Training Can Be Optimal for Large Language Models

Abstract

The rapid advancement of large language models (LLMs) has been paralleled byunprecedented increases in computational demands, with training costs forstate-of-the-art models doubling every few months. Training models directly inlow-precision arithmetic offers a solution, by improving both computationalthroughput and energy efficiency. Specifically, NVIDIA's recent Blackwellarchitecture facilitates extremely low-precision operations, specifically FP4variants, promising substantial efficiency gains. Yet, current algorithms fortraining LLMs in FP4 precision face significant accuracy degradation and oftenrely on mixed-precision fallbacks. In this paper, we systematically investigatehardware-supported FP4 training and introduce Quartet, a new approach enablingaccurate, end-to-end FP4 training with all the major computations (in e.g.linear layers) being performed in low precision. Through extensive evaluationson Llama-type models, we reveal a new low-precision scaling law that quantifiesperformance trade-offs across varying bit-widths and allows us to identify a"near-optimal" low-precision training technique in terms ofaccuracy-vs-computation, called Quartet. We implement Quartet using optimizedCUDA kernels tailored for NVIDIA Blackwell GPUs, and show that it can achievestate-of-the-art accuracy for FP4 precision, successfully trainingbillion-scale models. Our method demonstrates that fully FP4-based training isa competitive alternative to standard-precision and FP8 training. Our code isavailable at https://github.com/IST-DASLab/Quartet.

Details der Forschungsarbeit anzeigen