HyperAI

Model Quantization

Quantization is a technique used to reduce the size and memory footprint of a neural network model. Model quantization can reduce the memory footprint and computational requirements of deep neural network models. Weight quantization is a common quantization technique that involves converting the weights and activations of a neural network from high-precision floating point numbers to a lower-precision format, such as 16-bit or 8-bit integers. Converting the model's weights from a standard floating point data type (e.g., 32-bit float) to a lower-precision data type (e.g., 8-bit integer) reduces model size and memory requirements and increases speed of inference (by reducing computational complexity). Model quantization can make large models (e.g., LLMs) easier to deploy on edge devices with limited compute and memory resources.

Floating point representation:

Among various data types, floating point numbers are mainly used in deep learning because they can represent various values with high precision. Usually, floating point numbers are represented using n bits to store the value. n Bits are further divided into three different components:

  1. symbol: The sign bit indicates whether a number is positive or negative. It uses one bit where 0 indicates a positive number and 1 indicates a negative number.
  2. index: The exponent is a range of bits that represents the base (usually 2 in binary representation) raised to a power. The exponent can also be positive or negative, allowing the number to represent very large or very small values.
  3. Significant number/mantissa: The remaining bits are used to store the significand, also known as the mantissa. This represents the significant digits of the number. The precision of a number depends largely on the length of the significand.

Some of the most commonly used data types in deep learning are float32 (FP32) and float16 (FP16):

FP32 is often referred to as “full precision” (4 bytes), while FP16 is referred to as “half precision” (2 bytes). The INT8 data type does better and uses a single byte to store weights, consisting of an 8-bit representation capable of storing 2⁸ = 256 different values.

References

【1】https://www.hopsworks.ai/dictionary/model-quantization

【2】https://medium.com/@sachinsoni600517/introduction-to-model-quantization-4effc7a17000