Floating-point basics
    Updated: 2024-07-25
Created: 2024-07
    
      
    
    
      
      
	- What every computer scientist should
	    know about floating-point arithmetic
	    (1,
	    2).
	
- Floating-point arithmetic
	  (1).
	
- A gigantic mess with properties very different from those
	  of real numbers. They are
	  fractions:
	    MeN:
	    M/10-N,
	Me-N:
	    M/10N
 
    
      
      
	- N is the exponent (of 10) and is represented
	  straightforwardly as base 2 integer.
- M is called mantissa and is
	  not straightforward and that matters a lot.
- Mantissa can be normalised or de-normalised (or
	  subnormal ).
- Mantissa is an integer but not necessarily in base-2 (the
	  exponent is almost always in base-2).
 
    
    
      
      
      
	- Standard formats are 32, 64, 128 bits. Small vectors (64,
	128, 256, 512 bits) can be extremely fast, if the data allows
	it.
- Storage size need not be the same as operation (register ) size. In some cases Intel CPUs have 80
	bit registers and operations which are not IEEE-754.
- For variational/relaxation algorithms there are 8 and 16 bit
	formats (typically on GPUs).
- A little known format is
	log-based representation,
	where the mantissa is the logarithm of the represented number.
	Addition and subtraction are skiw and lose precision,
	multiplication and division are fast and keep precision. Very
	suited to signal processing and control systems. Typically 8 and
	16 bits.
 
    
    
      
      
      
	- Addition and subtraction with very different exponents.
- Different rounding modes.
- Comparisons need to be approximate, with a scaled
	tolerance .
- Underflow and overflow.
- NaN propagation and comparisons .
- Infinities are less troublesome.
 
    
      
      
      
	- These are modulo-N rational
	  numbers.
- The order of operations matters a lot (not
	associative).
- Vector/matrix (block) layouts aligned to 64,
	128, 256, 512 bits can give huge speed improvements by using
	short-vector registers and instructions, where all elements
	of each short-vector are processed in
	parallel.
- Fused-multiply-add is a big deal for variational
	algorithms.
- Some algorithms are non-obvious.
 
    
    