A brief introduction to Machine Learning models quantization
25 May 2023 by David CorvoysierEven before the development of Large Language Models (LLM), the increasing memory and computing requirements of Deep Neural Networks (DNN) has been a concern.
Functionally, DNN are graphs of arithmetic operations: the inputs are fed at the stem and the chain of operations produces the outputs at the head.
From an implementation perspective, the operations are performed on floating point numbers, which are a digital representation of decimal numbers composed of a mantissa and an exponent:
\[x = mantissa . 2^{exponent}\]The 32bit floating point representation if the most common, as it allows to represent
numbers in a range that is sufficient for most operations. The float32
mantissa is composed of
24bit (including sign), and the exponent is 8bit.
Each operation performed at an operating node in the inference device requires its inputs to be transferred from either a static memory location or the previous processing nodes.
The cost of these transfers addsup with the cost of the operations themselves.
The DNN terminology for operation data is “weights” for static inputs and “activations” for dynamic inputs/outputs.
Note: the outputs of an operation are designated as “activations” even if it is not actually an activation.
The process of representating the nbit weights and activations of a DNN into a smaller number of bits is called quantization^{1}.
It is typically used in DNN to “quantize” float32
weights and activations into 8bit integer.
This brings several benefits:
 reducing the weights to 8bit requires 4 times less memory on the device to store them,
 reducing the activations to 8bits reduces the amount of data exchanged between nodes, which impacts latency,
 using 8bit instead of 32bit inputs for an operation improves vectorization (multiple data processed at the same time for a single operation),
 all standard integer arithmetic operations but the division are faster than their floating point counterpart,
 GPU devices may include specific mechanisms to process 8bit inputs (like NVIDIAS 8bit Tensor cores).
A mathematical formulation of linear quantization
The most widespread type of quantization is the linear or affine quantization scheme first introduced in tensorflow lite^{2}.
The representation of a linearly quantized number is composed of:
 an integer mantissa,
 a float scale,
 an integer zeropoint.
The scale is used to project back the integer numbers into a float representation.
The zero point corresponds to the value that zero takes in the target representation.
If we compare that formula with the floating point representation one can see immediately that each floating point number can be represented exactly with the same mantissa, a scale corresponding to the exponent and a null zeropoint.
Of course this representation would be very inefficient because it would require two integer and a float to represent each number.
Applicability of quantization to MachineLearning
When quantizing MachineLearning models, one can take advantage of the fact that the training produces weights and activations stay within reasonably stable ranges for a given operation.
This comes from several empirical techniques used to improve convergence:
 weights initialization^{3},
 weights and/or activation regularization^{4},
 explicit normalization layers^{5}.
This means that the weights and activations tensors for a specific operation can be represented using the same scale and zeropoint, thus leading to a very compact representation.
Note: this is why quantization is often categorized as a form of compression, although unlike most compression techniques, it produces numbers that can be directly used for arithmetic operations.
There are various subtypes of quantization.
The first two subtypes are related to the dimensions of the scale and zeropoint:
 pertensor quantization uses a single scalar value for scale and zeropoint for a whole tensor of weights or activations,
 peraxis quantization uses a vector of scales and zeropoints whose length corresponds to a single axis of the tensor (typically the channels or embeddings axis).
The second subtypes are related to the symmetry of the resulting quantized numbers:
 symmetric quantization assumes that the quantization range is symmetric, which leads to a zeropoint equal to zero and a signed integer representation of the values,
 asymmetric quantization does not assume anything, and zeropoint is typically nonnull.
Weights are typically quantized symmetrically peraxis.
Activations are typically quantized asymmetrically, most of the time pertensor.
Quantizing a float tensor
The first step to quantize a float tensor is to choose the quantization range, i.e. the minimum and maximum float values one wants to represent: $[Min, Max]$.
Since the weights are constant tensors, they are typically quantized using the mimimum and maximum values of the tensor, globally or along the channel axis.
Evaluating the quantization range of the activations is more difficult as they are dependent of the inputs of the previous operation. Their range is therefore evaluated globally inside a model, as explained in the next paragraph.
For a target bit width of n for the mantissa, one evaluates the scale as:
\[scale = \frac{Max  Min}{2^n  1}\]The zeropoint is then deduced from the scale to make sure that $Min$ is mapped to the lowest integer value and $Max$ to the highest integer value.
This leads to the following formulas for signed/unsigned representations:
 unsigned: $zeropoint = round(\frac{Min}{scale})$
 signed: $zeropoint = round(\frac{Min}{scale})  2^{n  1}$
The quantization of a float tensor is then:
\[mantissa = saturate(round(\frac{x}{scale}) + zeropoint)\]Again, the saturation depends of the signed of the target representation:
 unsigned: $[0, 2n  1]$,
 signed: $[2^{n1}, 2^{n1}  1]$.
Note that the zeropoint always has the same signedness as the mantissa.
Quantizing a Machine Learning Model
As mentioned before, a Machine Learning model uses two types of tensors: weights and activations.
The static weights need to be quantized only once, each weight tensor producing three new static tensors for the mantissa, scale and zeropoint.
Since weights can contain positive and negative values, they are typically quantized into int8
.
..  Weights   float32   constant  +++ /  \ v v v .. .. ..  Weights   scale   zeropoint   int8   float32   int8   constant   constant   constant  '' '' ''
The dynamic activations on the other hand need to be quantized onthefly by inserting the quantization operations in the graph:
 evaluate the quantization range,
 quantize.
The evaluation of the quantization range is costly because is requires a fullscan of the activations tensor, which is a bottleneck for parallel processing.
For that reason, the activations quantization ranges are often evaluated before the inference on a selected number of samples: this is called the calibration of the quantized model.
Note: the operations that clip their outputs like the bounded ReLU are an exception and don’t require an explicit calibration, since the exact range of their outputs is known in advance.
After calibration, each activation float variable is mapped to an integer variable and two static tensors.
..  Activations   float32   variable  /'+'\ /  \ v v v .. .. ..  Activations   scale   zeropoint   (u)int8   float32   (u)int8   variable   constant   constant  '' '' ''
Note: the activations can be quantized into either
int8
oruint8
. It is simpler to quantize them touint8
if they correspond to the output of a ReLU operation, since zeropoint will be in that case 0.
Conceptually, the resulting graph is a clone of the original graph where all compatible operations are replaced by a version that operates on tuples of (mantissa, scale, zeropoint).
Separating the constant and variable tensors, this leads to the following graphs:
.. .. .. ..  Inputs   Inputs   scale   zeropoint   float32   (u)int8   float32   (u)int8   variable   variable   constant   constant  '+' '+' '+' '+'  . '+' .. v \ ..   Weights  .. +' \  Weights    float32 +> Matmul  +. /  int8 +.   constant  '+' /  constant    .. ''  ' ''    scale  v   .+ float32  .. ..  v   constant   Outputs   scale   ..  ''  float32   float32 ++> QMatMul <+  variable   constant   '+'  .. '' ''     zeropoint    '+ (u)int8  ..    constant  zeropoint   ''  int8 +'   constant   ''  v ..  Outputs   (u)int8   variable  ''
Quantized linear operations
Most basic Machine Learning operations can be performed using integer arithmetics, which makes them compatible with linearly quantized inputs.
This does not mean however that one can just replace all floating point operations by an equivalent integer operation: the scale and zeropoint of all weights and activations must be taken into account to produce an equivalent graph.
Also, there are two important restrictions with respect to the inputs quantization:
 additions between the integer mantissa of inputs can only be performed if they are in the same scale,
 operations that combine the integer mantissa of inputs channels can only be performed if the channels are in the same scale, i.e if the inputs are quantized pertensor.
Note: in another post I explain how it is possible to add two inputs quantized with different scales by adding an explicit alignment operation beforehand.
From an implementation perspective, operations accepting linearly quantized inputs are very specific to each device.
In the next paragraph, I will detail a possible implementation of a quantized matrix multiplication.
Wrapup example: a quantized matrix multiplication
Let’s consider a simple matrix multiplication of an $X(I, J)$ input by a $W(J, K)$ set of weights:
$Y = X.W$
Since the matrix multiplication multiplies all inputs along the dimension of length $J$ and adds them, $X$ cannot be quantized peraxis, because it will lead to the addition of quantized numbers that are not in the same scale.
There is no such restriction on $W$, since the filters along $K$ are all applied independently.
After quantization of the weights peraxis and calibration of the inputs pertensor, we obtain:
$X \approx X_s * (X_q  X_{zp})$, with $X_s()$, $X_q(I, J)$, $X_{zp}()$
$W \approx W_s * (W_q  W_{zp})$, with $W_s(K)$, $W_q(J, K)$, $W_{zp}(K)$
We can also approximate the outputs peraxis, assuming that the next operation does not require pertensor inputs.
$Y \approx Y_s * (Y_q  Y_{zp})$, with $Y_s(K)$, $Y_q(I, K)$, $Y_{zp}(K)$
The operation is summarized on the graph below (note that the intermediate integer output Y_q can be implicit):
.. .. ..  X_q   X_s   X_zp  '+' '+' '+' '+' ..   W_q +.  ''   ..  v .+ Y_s  ..  ..  ''  W_s ++> QMatMul <+ ''  '+'  ..   '+ Y_zp ..   '' W_zp +' (Y_q) ''  v ..  Y  ''
Going through the graph step by step:
 evaluate the matrix multiplication of the quantized inputs to produce float outputs
$O = X_s * (X_q  X_{zp}) . W_s * (W_q  W_{zp})$
 quantize the float outputs to obtain 8bit integer outputs
$Y_q = saturate(round(\frac{O}{Y_s}) + Y_{zp})$
 convert back the 8bit integer outputs to float outputs
$Y \approx Y_s * (Yq  Y_{zp})$
Since $X_s$ is a scalar, and $W_s$ has the same dimension as the outputs last dimension, the first operation can also be written:
$O = (X_s * W_s) * (X_q  X_{zp}) . (W_q  W_{zp})$
This means that the matrix multiplication can be operated equivalently on integer values, and the result is a quantized integer number with a scale corresponding to the product of the inputs and weights scale and a null zeropoint.
The quantized sequence of operations is then to:
 evaluate the matrix multiplication of the 8bit integer inputs to produce nbit integer outputs
$O_q = (X_q  X_{zp}) . (W_q  W_{zp})$
 convert the nbit integer outputs to float outputs
$O = (X_s * W_s) * O_q$
 quantize the float outputs to obtain 8bit integer outputs
$Y_q = saturate(round(\frac{O}{Y_s}) + Y_{zp})$
 convert back the 8bit integer outputs to float outputs
$Y \approx Y_s * (Yq  Y_{zp})$
The question that should immediately arise at this stage is why we need another quantization operation after the matrix multiplication, since we already have a quantized output ?
The reason is simply the bitwidth of the outputs: we need an explicit quantization to make sure that the results of the integer matrix multiplication fit in 8bit.
Note: when the operation is followed by a bias addition, the biases are typically quantized to 32bit with a scale precisely equal to $X_s * W_s$ so that they can be added directly to the outputs before quantizing.
Going one step further and replacing $O$, since $Y_s$ has the same shape as $X_s * W_s$, we can omit the third step and write directly:
 evaluate the matrix multiplication of the integer inputs to produce nbit integer outputs
$O_q = (X_q  X_{zp}) . (W_q  W_{zp})$
 quantize the nbit integer outputs to obtain 8bit integer outputs
$Y_q = saturate(round(\frac{X_s * W_s}{Y_s} * O_q) + Y_{zp})$
 convert back the 8bit integer outputs to float outputs
$Y \approx Y_s * (Yq  Y_{zp})$
This reveals that we can directly ‘downscale’ the integer outputs of the operation with a folded scale $F_s = \frac{Y_s}{X_s * W_s}$.
The downscaling operation can be implemented as a float division and a round.
Note: I will detail in another post an implementation using only integer arithmetic.
The simplified graph can be summarized below:
.. ..  X_q   X_zp  '+' '+' '+' ..   W_q +. v ''  .. +>IntMatMul  ..  '+' W_zp +'  .. '' v .+ F_s  ..  '' Downscale<+ '+'  .. v '+ Y_zp .. ''  Y  ''
This can be further simplified by removing the zeropoints if we assume a symmetric quantization.
..  X_q  '+'  v .. ..  W_q +>IntMatMul  '' '+'  v .. .. Downscale<+ Y_s  '+' '' v ..  Y  ''
Note: the quantized matrix multiplication can be implemented in very different ways on devices that do not have efficient implementations of the integer Matrix Multiplication.
References

Yunchao Gong, Liu Liu, Ming Yang, Lubomir Bourdev, “Compressing Deep Convolutional Networks using Vector Quantization” arxiv, 2014. ↩

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, Dmitry Kalenichenko, “Quantization and Training of Neural Networks for Efficient IntegerArithmeticOnly Inference” arxiv, 2017. ↩

Stone Yun, Alexander Wong, “Where Should We Begin? A LowLevel Exploration of Weight Initialization Impact on Quantized Behaviour of Deep Neural Networks”, arxiv, 2020. ↩

Arash Ahmadian, Saurabh Dash, Hongyu Chen, Bharat Venkitesh, Stephen Gou, Phil Blunsom, Ahmet Üstün, Sara Hooker, “Intriguing Properties of Quantization at Scale”, arxiv, 2023. ↩

Elaina Teresa Chai, “Analysis of quantization and normalization effects in deep neural networks”, stanford, 2021. ↩