Scaling Law for Quantization-Aware Training

Chen, Mengzhao ; Zhang, Chaoyi ; Liu, Jing ; Zeng, Yutao ; Xue, Zeyue ; Liu, Zhiheng ; Li, Yunshui ; Ma, Jin ; Huang, Jie ; Zhou, Xun ; Luo, Ping

Veröffentlichungsdatum: 5/22/2025

Scaling Law for Quantization-Aware Training

Abstract

Large language models (LLMs) demand substantial computational and memoryresources, creating deployment challenges. Quantization-aware training (QAT)addresses these challenges by reducing model precision while maintainingperformance. However, the scaling behavior of QAT, especially at 4-bitprecision (W4A4), is not well understood. Existing QAT scaling laws oftenignore key factors such as the number of training tokens and quantizationgranularity, which limits their applicability. This paper proposes a unifiedscaling law for QAT that models quantization error as a function of model size,training data volume, and quantization group size. Through 268 QAT experiments,we show that quantization error decreases as model size increases, but riseswith more training tokens and coarser quantization granularity. To identify thesources of W4A4 quantization error, we decompose it into weight and activationcomponents. Both components follow the overall trend of W4A4 quantizationerror, but with different sensitivities. Specifically, weight quantizationerror increases more rapidly with more training tokens. Further analysis showsthat the activation quantization error in the FC2 layer, caused by outliers, isthe primary bottleneck of W4A4 QAT quantization error. By applyingmixed-precision quantization to address this bottleneck, we demonstrate thatweight and activation quantization errors can converge to similar levels.Additionally, with more training data, weight quantization error eventuallyexceeds activation quantization error, suggesting that reducing weightquantization error is also important in such scenarios. These findings offerkey insights for improving QAT research and development.

Details der Forschungsarbeit anzeigen