Floating-point basics
Updated: 2024-07-25
Created: 2024-07
- What every computer scientist should
know about floating-point arithmetic
(1,
2).
- Floating-point arithmetic
(1).
- A gigantic mess with properties very different from those
of
real
numbers. They are
fractions:
MeN:
M/10-N,
Me-N:
M/10N
- N is the exponent (of 10) and is represented
straightforwardly as base 2 integer.
- M is called
mantissa
and is
not straightforward and that matters a lot.
- Mantissa can be normalised or de-normalised (or
subnormal
).
- Mantissa is an integer but not necessarily in base-2 (the
exponent is almost always in base-2).
- Standard formats are 32, 64, 128 bits. Small vectors (64,
128, 256, 512 bits) can be extremely fast, if the data allows
it.
- Storage size need not be the same as operation (
register
) size. In some cases Intel CPUs have 80
bit registers and operations which are not IEEE-754.
- For variational/relaxation algorithms there are 8 and 16 bit
formats (typically on GPUs).
- A little known format is
log-based representation,
where the mantissa is the logarithm of the represented number.
Addition and subtraction are skiw and lose precision,
multiplication and division are fast and keep precision. Very
suited to signal processing and control systems. Typically 8 and
16 bits.
- Addition and subtraction with very different exponents.
- Different rounding modes.
- Comparisons need to be approximate, with a scaled
tolerance
.
- Underflow and overflow.
NaN propagation and comparisons
.
- Infinities are less troublesome.
- These are modulo-N rational
numbers.
- The order of operations matters a lot (not
associative).
- Vector/matrix (block) layouts aligned to 64,
128, 256, 512 bits can give huge speed improvements by using
short-vector registers and instructions, where all elements
of each short-vector are processed in
parallel.
- Fused-multiply-add is a big deal for variational
algorithms.
- Some algorithms are non-obvious.