Floating-point basics examples

Gotchas

What every computer scientist should know about floating-point arithmetic (1, 2).
Acta Numerica: Floating-point arithmetic (1).
Addition and subtraction with very different exponents.
Different rounding modes.
Comparisons need to be approximate, with a scaled tolerance.
Underflow and overflow.
NaN propagation and comparisons.
Infinities are less troublesome.

Generalities

https://stackoverflow.com/a/872762
For single precision...

If you want an accuracy of +/-0.5 (or 2^-1), the maximum size that the number can be is 2^23. Any X larger than this limit leads to the distance between floating point numbers being greater than 0.5.

If you want an accuracy of +/-0.0005 (about 2^-11), the maximum size that the number can be is 2^13. Any X larger than this limit leads to the distance between floating point numbers being greater than 0.0005.

For double precision...

If you want an accuracy of +/-0.5 (or 2^-1), the maximum size that the number can be is 2^52. Any X larger than this limit leads to the distance between floating point numbers being greater than 0.5.

If you want an accuracy of +/-0.0005 (about 2^-11), the maximum size that the number can be is 2^42. Any X larger than this limit leads to the distance between floating point numbers being greater than 0.0005.
https://stackoverflow.com/a/873367

For floating-point integers (I'll give my answer in terms of IEEE double-precision), every integer between 1 and 2^53 is exactly representable. Beyond 2^53, integers that are exactly representable are spaced apart by increasing powers of two.

Representation

Floating-Point Arithmetic: Issues and Limitations
https://docs.python.org/3/tutorial/floatingpoint.html
In base 2, 1/10 is the infinitely repeating fraction
```
0.0001100110011001100110011001100110011001100110011...
```
Stop at any finite number of bits, and you get an approximation. On most machines today, floats are approximated using a binary fraction with the numerator using the first 53 bits starting with the most significant bit and with the denominator as a power of two. In the case of 1/10, the binary fraction is 3602879701896397 / 2 ** 55 which is close to but not exactly equal to the true value of 1/10.

[...] On most machines, if Python were to print the true decimal value of the binary approximation stored for 0.1, it would have to display:
```
>>> 0.1
0.1000000000000000055511151231257827021181583404541015625
```
That is more digits than most people find useful, so Python keeps the number of digits manageable by displaying a rounded value instead:
```
>>> 1 / 10
0.1
```
Just remember, even though the printed result looks like the exact value of 1/10, the actual stored value is the nearest representable binary fraction.

Normalisation:

1.23456789e-9 + 0.87654321e-2
 =>  0.00000012e-2 + 0.87654321e-2
 =>  1.2345678e-9 + 876543.21e-6

Simple surprises

Approximation examples:

$ perl -e 'printf ("%0.26f\n",0.1)'
0.10000000000000000555111512

$ perl -e 'printf ("%0.26f\n",0.3)'
0.29999999999999998889776975

$ perl -e 'printf ("%0.26f\n",0.3000000000000001)'
0.30000000000000009992007222
$ perl -e 'printf ("%0.26f\n",0.30000000000000001)'
0.29999999999999998889776975

Note that this is different from printing 0.3:

$ perl -e 'printf ("%0.26f\n",0.1+0.2)'
0.30000000000000004440892099

$ perl -e 'printf ("%0.26F\n",0.00000000000000001+0.1)'
0.10000000000000001942890293

Equality

More surprises:

$ perl -e 'printf ("%d\n",(0.1+0.2) == 0.3)'
0

$ perl -e 'printf ("%d\n",(1.0+2.0) == 3.0)'
1

$ perl -e 'printf ("%0.26f\n",(0.1+0.2) - 0.3)'
0.00000000000000005551115123

$ perl -e 'printf ("%d\n",((0.1+0.2)*10.0) == 3.0)'
0

INF sort of works, NAN is weird (mode of knowledge):

#include 

int main()
{
  float inf = 1.0f/0.0f;
  float nan = 0.0f/0.0f;

  printf("1.0f/0.0f: %0.26f\n",inf);
  printf("inf == inf: %d\n", inf == inf);
  printf("inf != inf: %d\n", inf < inf);
  printf("inf > inf: %d\n", inf > inf);
  printf("inf < inf: %d\n", inf < inf);

  printf("0.0f/0.0f: %0.26f\n",nan);
  printf("nan == nan: %d\n", nan == nan);
  printf("nan != nan: %d\n", nan < nan);
  printf("nan > nan: %d\n", nan > nan);
  printf("nan < nan: %d\n", nan < nan);

  return 0;
}

1.0f/0.0f: inf
inf == inf: 1
inf != inf: 0
inf > inf: 0
inf < inf: 0

0.0f/0.0f: -nan
nan == nan: 0
nan != nan: 0
nan > nan: 0
nan < nan: 0

Using scaled tolerance:

$ perl -e 'my ($a,$b,$c)=(0.1,0.00000003,0.09999997,0.1); printf ("%0.26f\n",($a-($b+$c)))'
0.00000000000000001387778781

$ perl -e 'my ($a,$b,$c)=(0.1,0.00000003,0.09999997,0.1); printf ("%d\n",($a-($b+$c)) < (0.000000000000001*($a+$b+$c)/2))'
1

$ perl -e 'my ($a,$b,$c)=(0.1,0.00000003,0.09999997,0.1); printf ("%d\n",($a-($b+$c)) < (0.0000000000000001*($a+$b+$c)/2))'
0

Algorithm changes

Pivoting (Gauss-Seidel) matrix inversion is very sensitive to floating-point issues and parallax issues.
If the norm of a matrix is less then 1.0 then the inverse is given by 1 + x + x² + x³ + ... which is much less sensitive.
There are sophisticated papers about solving the quadratic equation ax² + bx +c = 0. A very important subject.
https://softwareengineering.stackexchange.com/a/63082

A few years ago I had a some spherical geometry that needed to be very accurate, and still fast. 80 bit double on PC's was not cutting it, so I added some types to the program that sorted terms before performing commutative operations. Problem solved.

Floating-point pitfalls examples

Gotchas

Generalities

For single precision...

For double precision...

Representation

Simple surprises

Equality

Algorithm changes