Resolve quantization scales after an operation

29 May 2023 by David Corvoysier

As explained in my introduction to Machine Learning quantization, the inputs, weights and outputs of a quantized operation are quantized each with a different scale.

In the same post, I explain how these scales can be folded into a single output scale, allowing the operation to be performed on the integer mantissa of the quantized inputs and weights:

$scale_{folded} = \frac{scale_{out}}{scale_{in} . scale_{w}}$

In another post I explain how heterogenous input scales could be converted to a fixed-point representation and aligned before the operation, resulting in yet another implicit scale expressed as a power-of-two that needs to be applied to the output scale.

In this post I explain how these output scales can be applied using integer arithmetics only.

Reminder: how are output scales applied in a quantized graph

As a general principle, the last step of a quantized operation is a downscale to reduce the output bitwidth.

When applied to float outputs, the general formula for the downscale is:

$outputs_{uint8} = saturate(round(\frac{outputs_{float32)}}{scale_{out}}) + {zp_{out}})$

For a quantized output of scale $y_{s}$ and zero-point $y_{zp}$.

As explained in my quantization introduction, some compatible operations can be applied directly on the integer mantissa of the quantized inputs and weights, folding the inputs and weights scale into the output scale.

The downscale operation becomes then:

$outputs_{uint8} = saturate(round(\frac{outputs_{int32}}{scale_{folded}}) + zp_{out})$

with $scale_{folded} = \frac{scale_{out}}{scale_{in} . scale_{w}}$

This operation still requires a division and a round that is not easily implemented using integer arithmetic operators.

Use fixed-point folded scale reciprocal to obtain rescaled fixed-point outputs

The idea is to convert the scale to a fixed-point representation to be able to take advantage of integer arithmetic operators and obtain a fixed-point representation of the downscaled outputs.

Since the fixed-point division is a lossy operation, instead of dividing by the folded output scale, we can multiply by its reciprocal $\frac{1}{scale_{folded}}$.

The first step is to obtain a fixed-point representation of the reciprocal of the folded scale:

$rec_{folded} = to_fixed_point(\frac{scale_{in}.scale_{w}}{scale_{out}}) = rec_{int} . 2^{-fracbits_{rec}}$

You can refer to this fixed-point conversion algorithm for an example of how we can convert the scale to a fixed-point representation.

Then the rescaled outputs are simply evaluated as:

$outputs_{int32} = outputs_{int32}.rec_{folded}$

Reduce the precision of the fixed-point rescaled outputs using a rounded right-shift

The rescaled outputs are represented as a fixed-point number with an implicit scale of $2^{-fracbits_{rec}}$.

To obtain the actual 8-bit integer values corresponding to the original downscale operation, we must apply this implicit scale.

We use the rounded right-shift operation described in the fixed-point introduction post

$outputs_{int8} = outputs_{int32} + 2^{fracbits_{rec} - 1}» frac_bits_{rec}$

Then we can apply the zero-point:

$outputs_{uint8} = saturate(outputs_{int8} + zp_{out})$

comments powered by Disqus

Hi

I am David Corvoysier, versatile developer and open Source enthusiast.

Aligning quantization scales before incompatible operations

30 May 2023 by David Corvoysier

As explained in my introduction to [Machine Learning quantization](/2023/05/ml-quantization-introduction#quantized-linear-operations), important restrictions apply to operations performed on quantized inputs. First, additions between the integer mantissa of quantized inputs can only be performed if they are in the same scale. This comes from the representation of the quantized numbers: $a = (n - zeropoint_a) * scale_a$ $b = (m - zeropoint) * scale_b$ $a$ and $b$ integer mantissa can only be added if $scale_a == scale_b$, allowing us to write directly: $a + b = (n - zeropoint_a + m - zeropoint_b) * scale_a$ Intuitively, this is analog to say that you cannot add two quantities expressed in different units (like bytes and kilobytes) without converting one number representation to the other.

(more…)

Fixed-point representation for quantization

26 May 2023 by David Corvoysier

As explained in my introduction to Machine Learning quantization, the quantization of a ML model produces a graph of operations applied on quantized tensors.

Quantized tensors are actually integer tensors that share the same float scale and integer zero-point.

The implementation of the quantized operations is device-specific.

One of the main design decision is how the inputs, weights and output float scales are propagated and applied in the quantized graph.

In two other posts I will explain how is is possible to use integer arithmetic operators for that purpose if the scales are represented as fixed-point numbers.

This posts is a brief introduction to the fixed-point representation and to the fixed-point arithmetic operators.

(more…)

A brief introduction to Machine Learning models quantization

25 May 2023 by David Corvoysier

Even before the development of Large Language Models (LLM), the increasing memory and computing requirements of Deep Neural Networks (DNN) has been a concern.

Functionally, DNN are graphs of arithmetic operations: the inputs are fed at the stem and the chain of operations produces the outputs at the head.

From an implementation perspective, the operations are performed on floating point numbers, which are a digital representation of decimal numbers composed of a mantissa and an exponent:

\[x = mantissa . 2^{exponent}\]

(more…)

Identify Repeating Patterns using Spiking Neural Networks in Tensorflow

26 Jul 2018 by David Corvoysier

Spiking neural networks (SNN) are the 3rd generation of neural networks.

SNN do not react on each stimulus, but rather accumulate inputs until they reach a threshold potential and generate a ‘spike’.

Because of their very nature, SNNs cannot be trained like 2nd generation neural networks using gradient descent.

Spike Timing Dependent Plasticity (STDP) is a biological process that inspired an unsupervised training method for SNNs.

In this article, I will provide an illustration of how STDP can be used to teach a single neuron to identify a repeating pattern in a continuous stream of input spikes.

(more…)