Quartet: Native FP4 Training Can Be Optimal for Large Language Models

The rapid advancement of large language models (LLMs) has been paralleled byunprecedented increases in computational demands, with training costs forstate-of-the-art models doubling every few months. Training models directly inlow-precision arithmetic offers a solution, by improving both computationalthroughput and energy efficiency. Specifically, NVIDIA's recent Blackwellarchitecture facilitates extremely low-precision operations, specifically FP4variants, promising substantial efficiency gains. Yet, current algorithms fortraining LLMs in FP4 precision face significant accuracy degradation and oftenrely on mixed-precision fallbacks. In this paper, we systematically investigatehardware-supported FP4 training and introduce Quartet, a new approach enablingaccurate, end-to-end FP4 training with all the major computations (in e.g.linear layers) being performed in low precision. Through extensive evaluationson Llama-type models, we reveal a new low-precision scaling law that quantifiesperformance trade-offs across varying bit-widths and allows us to identify a"near-optimal" low-precision training technique in terms ofaccuracy-vs-computation, called Quartet. We implement Quartet using optimizedCUDA kernels tailored for NVIDIA Blackwell GPUs, and show that it can achievestate-of-the-art accuracy for FP4 precision, successfully trainingbillion-scale models. Our method demonstrates that fully FP4-based training isa competitive alternative to standard-precision and FP8 training. Our code isavailable at https://github.com/IST-DASLab/Quartet.