Comprehensive Guide to Quantisation Methods for Large Language Models
Quantisation is the process of reducing the numerical precision used to represent a model’s weights and activations—such as storing them as 8-bit integers instead of 32-bit floating-point numbers. This transformation significantly reduces model size, lowers memory usage, and accelerates inference, often with minimal impact on accuracy. For large language models (LLMs), which can contain billions of parameters, quantisation is essential to enable deployment on devices with limited computational resources, such as mobile phones, edge devices, or consumer-grade GPUs. To illustrate, consider a model weight stored as a 32-bit float (4 bytes). By quantising it to an 8-bit integer (1 byte), the model size shrinks by 75%, while inference speed improves due to reduced memory bandwidth and faster arithmetic operations. This efficiency gain is critical for real-world applications where latency, power consumption, and hardware constraints matter. Over time, researchers have developed multiple quantisation strategies, broadly categorised into two main types: training-time and post-training methods. Post-Training Quantisation (PTQ) is the simplest and most widely used approach. It involves quantising a pre-trained model without any further training. The process typically includes calibrating the model using a small set of representative input data to determine optimal scaling factors and zero points for each layer. PTQ is fast and efficient, making it ideal for deployment scenarios where minimal latency is required. However, because no fine-tuning occurs after quantisation, accuracy loss can be more pronounced, especially in complex or sensitive layers. To mitigate this, several advanced PTQ techniques have emerged. For example, Layer-wise Quantisation adjusts quantisation parameters per layer based on the distribution of weights and activations. Dynamic Quantisation applies different precision levels to different parts of the model, such as using 8-bit for weights and 16-bit for activations. Mixed Precision Quantisation goes further by assigning the most accurate representation (e.g., 16-bit) to the most sensitive layers while using lower precision elsewhere. Quantisation Aware Training (QAT) takes a different approach. Instead of quantising after training, QAT simulates quantisation during the training process. This allows the model to learn to compensate for quantisation errors, resulting in better accuracy retention after quantisation. QAT typically involves inserting quantisation and dequantisation operations into the model graph during training, enabling the network to adapt to the reduced precision. While more accurate than PTQ, QAT requires access to the original training data and retraining, making it more resource-intensive. Beyond these two main categories, newer methods have been developed to balance performance, efficiency, and accuracy. For instance, Smooth Quant applies quantisation to both weights and activations in a way that preserves model behavior across layers, especially in models with imbalanced layer distributions. Another method, Quantisation with Feedback (QF), uses a feedback loop to iteratively refine quantisation parameters based on model output, improving robustness. Additionally, emerging techniques like Group-wise Quantisation and Low-Rank Adaptation (LoRA) in conjunction with quantisation allow for even more efficient fine-tuning of quantised models. These approaches are particularly useful in scenarios where models need to be adapted to specific tasks without full retraining. In practice, the choice of quantisation method depends on the use case. For deployment on edge devices or mobile applications, PTQ is often preferred due to its speed and simplicity. For high-accuracy requirements in server environments, QAT or hybrid approaches may be better. The trend is toward smarter, adaptive quantisation strategies that dynamically adjust precision based on context, model layer, or input data. Ultimately, quantisation remains a cornerstone of making LLMs practical and accessible. As hardware evolves and model complexity grows, the development of more sophisticated, accurate, and efficient quantisation methods will continue to play a vital role in the advancement of AI.
