Floating-point basics

Floating point

What every computer scientist should know about floating-point arithmetic (1, 2).
Floating-point arithmetic (1).
A gigantic mess with properties very different from those of real numbers. They are fractions: MeN: M/10^-N, Me-N: M/10^N

N is the exponent (of 10) and is represented straightforwardly as base 2 integer.
M is called mantissa and is not straightforward and that matters a lot.
Mantissa can be normalised or de-normalised (or subnormal).
Mantissa is an integer but not necessarily in base-2 (the exponent is almost always in base-2).

Since it is the numerator of a fraction, for addition and subtraction the denominator (therefore the exponent of 10) must be the same for both.
The exponent of one or both must be adjusted so that they are the same.
This implies that one or both mantissas must be scaled (the standard says the one with the largest exponent); example (illustrative, not actual)::
```
0.123456789e-10 + 0.87654321e-4
 =>  0.00000012e-4 + 0.87654321e-4
 =>  0.12345678e-10 + 876543.21e-10
```
De-normalisation is done by shifting the mantissa, which can be base-2, base-8 or even base-16 (1, 3, 4 bit shifts). (the standard uses base-2).
Implications: addition and subtraction are fast but lose precision, multiplication and division are slow but keep precision.

Standard formats are 32, 64, 128 bits. Small vectors (64, 128, 256, 512 bits) can be extremely fast, if the data allows it.
Storage size need not be the same as operation (register) size. In some cases Intel CPUs have 80 bit registers and operations which are not IEEE-754.
For variational/relaxation algorithms there are 8 and 16 bit formats (typically on GPUs).
A little known format is log-based representation, where the mantissa is the logarithm of the represented number. Addition and subtraction are skiw and lose precision, multiplication and division are fast and keep precision. Very suited to signal processing and control systems. Typically 8 and 16 bits.

These are modulo-N rational numbers.
The order of operations matters a lot (not associative).
Vector/matrix (block) layouts aligned to 64, 128, 256, 512 bits can give huge speed improvements by using short-vector registers and instructions, where all elements of each short-vector are processed in parallel.
Fused-multiply-add is a big deal for variational algorithms.
Some algorithms are non-obvious.