# Fixed-point representation for quantization

26 May 2023 by David CorvoysierAs explained in my introduction to Machine Learning quantization, the quantization of a ML model produces a graph of operations applied on quantized tensors.

Quantized tensors are actually integer tensors that share the same float scale and integer zero-point.

The implementation of the quantized operations is device-specific.

One of the main design decision is how the inputs, weights and output float scales are propagated and applied in the quantized graph.

In two other posts I will explain how is is possible to use integer arithmetic operators for that purpose if the scales are represented as fixed-point numbers.

This posts is a brief introduction to the fixed-point representation and to the fixed-point arithmetic operators.

## Fixed-point representation

Before the introduction of the floating point representation, decimal values were expressed using a fixed-point representation.

This representation also uses a mantissa and an exponent, but the latter is implicit: it defines the number of bits in the mantissa dedicated to the fractional part of the number.

The minimum non-zero value that can be represented for a given number of fractional bits is $2^{-fracbits}$.

For instance, with three fractional bits, the smallest float number than can be represented is $2^{-3} = 0.125$.

Below is an example of an unsigned 8-bit fixed-point number with 4 fractional bits.

.------------------------------------. | 0 1 0 1 | 1 1 1 0 | .------------------------------------. | integer bits | fractional bits | .------------------------------------. | 3 2 1 0 | -1 -2 -3 -4 | '------------------------------------'

The decimal value of that number is: $2^{2} + 2^{0} + 2^{-1} + 2^{-2} + 2^{-3} = 5.875$

The precision of the representation is directly related to the number of fractional bits.

Below are some more examples of PI represented with unsigned 8-bit fixed-point numbers different fractional bits:

float | frac_bits | mantissa | binary |
---|---|---|---|

3.140625 | 6 | 201 | 11001001 |

3.15625 | 5 | 101 | 01100101 |

3.125 | 4 | 50 | 00110010 |

3.125 | 3 | 25 | 00011001 |

3.25 | 2 | 13 | 00001100 |

3.0 | 1 | 6 | 00000110 |

## Obtaining a fixed-point representation of a float

As a reminder, a float number is represented as:

\[x = mantissa * 2^{exponent}\]Our goal here is to obtain a fixed-point representation of $x$.

Technically, we could directly take the float mantissa, but it is 24-bit, with a high risk of overflows in the downstream fixed-point operations.

For the range of numbers used in Machine Learning, an 8-bit mantissa is usually enough to accurately represent a `float32`

number.

As a consequnce, we only need to keep the 8 most significant bits of the mantissa, which effectively means quantizing the float to 8-bit with the power-of-two scale that minimizes the precision loss.

This can be achieved in several ways depending on the level of abstraction you are comfortable with: below is an algorithm relying only on high-level mathematical operations.

```
def to_fixed_point(x, bitwidth, signed=True):
"""Convert a number to a FixedPoint representation
The representation is composed of a mantissa and an implicit exponent expressed as
a number of fractional bits, so that:
x ~= mantissa . 2 ** -frac_bits
The mantissa is an integer whose bitwidth and signedness are specified as parameters.
Args:
x: the source number or array
"""
if not isinstance(x, np.ndarray):
x = np.array(x)
# Evaluate the number of bits available for the mantissa
mantissa_bits = bitwidth - 1 if signed else bitwidth
# Evaluate the number of bits required to represent the whole part of x
# as the power of two enclosing the absolute value of x
# Note that it can be negative if x < 0.5
whole_bits = np.ceil(np.log2(np.abs(x))).astype(np.int32)
# Deduce the number of bits required for the fractional part of x
# Note that it can be negative if the whole part exceeds the mantissa
frac_bits = mantissa_bits - whole_bits
# Evaluate the 'scale', which is the smallest value that can be represented (as 1)
scale = 2. ** -frac_bits
# Evaluate the minimum and maximum values for the mantissa
mantissa_min = -2 ** mantissa_bits if signed else 0
mantissa_max = 2 ** mantissa_bits - 1
# Evaluate the mantissa by quantizing x with the scale, clipping to the min and max
mantissa = np.clip(np.round(x / scale), mantissa_min, mantissa_max).astype(np.int32)
return mantissa, frac_bits
```

The algorithm above produces a fixed-point representation of $x$ such that:

\[x_{float} \approx x_{int} . 2^{-x_{fracbits}}\]## Fixed-point addition (or subtraction)

The reason why the fixed-point representation comes to mind when it comes to quantization is that it has exactly the same restrictions regarding the addition of numbers: they must be expressed using the same amount of fractional bits.

The addition can then be performed directly on the underlying integer.

The resulting sum is a fixed-point number with the same fractional bits. It is exact unless it overflows.

What is really interesting here is that the alignment of fixed-point numbers is trivial: it can just be performed using a left bitshift.

Example:

The following fixed-point (values, fractional bits) pairs represent the following float values:

$a: (84, 3) = 84 * 2^{-3} = 10.5$

$b: (113, 4) = 113 * 2^{-4} = 7.0625$

Before summing a and b, we need to shift $a$ to the left to align it with $b$:

$s = a + b = 84 « 1 + 113 = 168 + 113 = 281$

The sum is a fixed-point number with 4 fractional bits:

$s: (281, 4) = 281 * 2^{-4} = 17.5625$

## Fixed-point multiplication

The multiplication of two fixed-point numbers can be performed directly on the underlying integer numbers.

The resulting product is a fixed-point number with a number of fractional bits corresponding to the sum of the fractional bits of the inputs. It is exact unless it overflows.

Example:

Going back to our two numbers:

$a: (84, 3) = 84 * 2^{-3} = 10.5$

$b: (113, 4) = 113 * 2^{-4} = 7.0625$

Their fixed-point product is:

$p = a.b = (84 . 113, 3 + 4) = (9492, 7) = 74.15625$

## Fixed-point downscale

The mantissa of the resulting product of two fixed-point numbers can go very quickly, which would eventually lead to an overflow when chaining multiple operations.

It is therefore common to ‘downscale’ the result of a multiplication using a right-shift.

Example:

Going back to our previous product:

$p = a.b = (84 . 113, 3 + 4) = (9492, 7) = 74.15625$

It can be downscaled to fit in 8-bit by shifting right and adjusting the fractional bits:

$downscale(p) = p » 6 = (148, 1) = 74$

Note that the right-shift operation always perform a floor, which may lead to a loss of precision.

For that reason, it is often implemented as a ‘rounded’ right-shift by adding $2^{n-1}$ before shifting of $n$.

Note: this is mathematically equivalent to adding $0.5$ to $\frac${x}{2^{n}}$ before taking its floor.

## Fixed-point division

The division of two fixed-point numbers can be performed directly on the underlying integer numbers.

The resulting quotient is a fixed-point number with a number of fractional bits corresponding to the subtraction of the fractional bits of the inputs. It is usually not exact.

Example:

Going back to our two numbers:

$a: (84, 3) = 84 * 2^{-3} = 10.5$

$b: (113, 4) = 113 * 2^{-4} = 7.0625$

Their fixed-point division is:

$p = \frac{b}{a} = (\frac{113}{84}, 4 - 3) = (1, 1) = 0.5$

A possible mitigation is to left-shift the dividend before the division to increase its precision: the resulting quotient will in turn have an increased precision.

$b: (113, 4) « 3 = (113 « 3, 4 + 3) = (904, 7) = 904 * 2^{-7} = 7.0625$

$p = \frac{b}{a} = (\frac{904}{84}, 7 - 3) = (10, 4) = 0.625$

comments powered by Disqus