1.3. Floating Point Numbers#

Mathematically, real numbers are continuous and there are uncountablly many of them. There is no way to express real numbers in discrete systems. Therefore, it is quite challenging to express real numbers accurately in computers. A clever method known as floating point arithmetic was developed and it is now a standard method to implement real numbers in modern computers.[1] Since scientific computation relies on the properties of floating point, we need to understand them.[1]

In the floating point arithmetic, a real number is expressed in scientific notation such as \(1325.67 \times 10^{12}\). In computer languages, it is usually expressed as 1325.67E12 or 1325.67e12. The idea is simple. The mantissa and exponent of scientific notation are treated separately as two integers. For example, 1325.67E12 = 132567E10 can be expressed with two integers 132567 and 10. The significant figure is determined by the size of integer expressing the mantissa.

Real numbers stored in a 32-bit string is known as type float32 or single precision. It uses 24 bits for mantissa and 8 bits for exponent. One bit is used to specify \(\pm\) sign. The corresponding significant figure is \(\log_{10} 2^{23} \approx 7\). The exponent part is \(2^{2^7}=2^{-128}\) to \(2^{2^7-1}=2^{127}\) which is approximately \(10^{-38}\) to \(10^{+38}\). A strange feature of the standard floating point arithmetic is that there are two different zeroes, \(+0.0\) and \(-0.0\). This is not a bug. Inside the computers, they are treated as two different numbers.

Real numbers stored in a 64-bit string is known as type float64 (simply float in python) or double precision. It uses 54 bits for mantissa and 10 bits for exponent as shown in Fig. 1.1. The largest value the mantissa can express is \(2^{53} = 9007,199,254,740,992\), which corresponds to significant figure 16. The maximum exponent part is between \(2^{-2^9} = 2^{-512} \approx 10^{-308}\) and \(2^{2^9-1} = 2^{511} \approx 10^{308}\).[2][3]

Usually, the single precision (float32) is not accurate enough for computational physics and thus we should use the double precision (float64). In python core the double precision (python float) is default. The math package uses the python float. The double precision (float64) is default in numpy.[4][5] Unlike integer type, python float and numpy float are the same and thus we don;t worry about conversion from python to numpy.

Some advanced computers are equipped with special arithmetic engine capable of 128-bits floating point arithmetic. The default size of floating point in python is 64 and its type is just float, which is good enough for most of numerical calculation in physics. In numpy, we can use both float32 and float64 (default is float64). [2]

../_images/double-float.png

Fig. 1.1 64-bit string for floating point expression. The last bit is used for the sign and 11 bits from \(b_{52}\) to \(b_{62}\) express the exponent. The remaining 52 bits express the mantissa.#

1.3.1. The range of floating points#

As discussed above, floating point has a finite range based on the size of bit string. In most computers, the range is

Type

Minimum value

Maximum value

single

1.175494351E-38

3.402823466E+38

double

2.2250738585072014E-308

1.7976931348623158E+308

You don’t have to memorize these numbers since pythonknows them. The follwoing example extract those information from python.


Example 1.4.1: Range of floating point numbers

Let us try to find the largest and smallest positive numbers in your computer system. float_info class in sys package tells the range of python float and the same information in numpy can be obtained with `np.finfo. The both should output the same value.

# float in python core
import sys

python_fmin = sys.float_info.min
python_fmax = sys.float_info.max

# float in numpy
import numpy as np

numpy_fmin = np.finfo(np.float64).tiny
numpy_fmax = np.finfo(np.float64).max

print("python float type =",type(python_fmax))
print(" numpy float type =",type(numpy_fmax))
print()
print("python smallest positive float =",python_fmin)
print(" numpy smallest positive float =",numpy_fmin)
print()
print("python largest float =",python_fmax)
print(" numpy largest float =",numpy_fmax)
python float type = <class 'float'>
 numpy float type = <class 'numpy.float64'>

python smallest positive float = 2.2250738585072014e-308
 numpy smallest positive float = 2.2250738585072014e-308

python largest float = 1.7976931348623157e+308
 numpy largest float = 1.7976931348623157e+308

1.3.2. Special value “inf”#

If the value exceeds the max, python outputs “inf”. Although it is not real infinity, python thinks it is. Getting inf means your calculation failed due to overflow error.


Example 1.4.2: number above fmax

Find what is the outcome of a number larger than the maximum or smaller than minimum.

print("2 x fmax =", 2*fmax)
2 x fmax = inf

1.3.3. Special value “nan”#

If mathematical operation is undefined, python just outputs “nan” whcih stands for “not a number”.


Example 1.4.3: Infinity - Infinity

\(\infty - \infty\) is not zero. The result is undefined. If the result of a mathematical operation is undefined, python declares it is not a number or nan. Let us computer \(2 \times \infty - \infty\). The floating point expression of \(\infty\) is given by float('inf') which is a number inf. However, float('inf')*2 - float('inf') is nan, which is not a number.

x = float('inf')
print('x =',x)
print('x*2-x =',x*2-x)
x = inf
x*2-x = nan

1.3.4. Overflow errors#

When the output of a calculation exceeds the maximum of floating point (fmax), normally you get inf. In som cases, python issues overflow error instead. Usually, overflow error happens when you enter too large value into a function. For example exp(1000) causes overflow error. inf is just a number and not an error, the computation is not interupted. However, overflow error is serious and the compuation stops there. There is no universal mitigation of overflow error. You need to find a better algorithm to compute it. The following example explains how to evaluate factorial of a large integer.


Example 1.4.4: Factorial

Factorial of a large integer is astronomically large. For example, 1000!. Let try to compute 100! first. While the paython core can deal with arbitrary large integer numbers. It does not provide mathematical functions. For mathematical computation, numpy package is a common choice but it can’t deal with a huge integer beyond we discussed in Section Section 1.2. Now we resort to math package which can handle arbitrary large integer number in mathetical computation.

# Here we use math package to compute a large factorial.
import math

# 100! using math factorial function

x=math.factorial(100)
print('100! =',x)
100! = 93326215443944152681699238856266700490715968264381621468592963895217599993229915608941463976156518286253697920827223758251185210916864000000000000000000000000

It is too large to store in 64-bit string discussed in Section Section 1.2. Our brain cannot comprehend such value. It is better to express it approximaltely in scientific notation. In phython, that is floating point expression.

# converting x=100! to float

print('Approxiamate value in float =',float(x))
Approxiamate value in float = 9.332621544394415e+157

So, \(100! \approx 9.33 \times 10^{157}\). It fits to 64bit floating point expression. Now, try 1000!

x=math.factorial(1000)
print('1000! =',x)
1000! = 402387260077093773543702433923003985719374864210714632543799910429938512398629020592044208486969404800479988610197196058631666872994808558901323829669944590997424504087073759918823627727188732519779505950995276120874975462497043601418278094646496291056393887437886487337119181045825783647849977012476632889835955735432513185323958463075557409114262417474349347553428646576611667797396668820291207379143853719588249808126867838374559731746136085379534524221586593201928090878297308431392844403281231558611036976801357304216168747609675871348312025478589320767169132448426236131412508780208000261683151027341827977704784635868170164365024153691398281264810213092761244896359928705114964975419909342221566832572080821333186116811553615836546984046708975602900950537616475847728421889679646244945160765353408198901385442487984959953319101723355556602139450399736280750137837615307127761926849034352625200015888535147331611702103968175921510907788019393178114194545257223865541461062892187960223838971476088506276862967146674697562911234082439208160153780889893964518263243671616762179168909779911903754031274622289988005195444414282012187361745992642956581746628302955570299024324153181617210465832036786906117260158783520751516284225540265170483304226143974286933061690897968482590125458327168226458066526769958652682272807075781391858178889652208164348344825993266043367660176999612831860788386150279465955131156552036093988180612138558600301435694527224206344631797460594682573103790084024432438465657245014402821885252470935190620929023136493273497565513958720559654228749774011413346962715422845862377387538230483865688976461927383814900140767310446640259899490222221765904339901886018566526485061799702356193897017860040811889729918311021171229845901641921068884387121855646124960798722908519296819372388642614839657382291123125024186649353143970137428531926649875337218940694281434118520158014123344828015051399694290153483077644569099073152433278288269864602789864321139083506217095002597389863554277196742822248757586765752344220207573630569498825087968928162753848863396909959826280956121450994871701244516461260379029309120889086942028510640182154399457156805941872748998094254742173582401063677404595741785160829230135358081840096996372524230560855903700624271243416909004153690105933983835777939410970027753472000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

which is hugh and practically useless. Let us convert it to float.

print('Approxiamate value in float =',float(x))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[19], line 1
----> 1 print('Approxiamate value in float =',math.float(x))

NameError: name 'math' is not defined

The direct evaluation of 1000! seems hopeless. We need to find a manual way to find its scientific notation \(1000! \approx a \times 10^{b}\). In order to find the mantissa \(a\) and exponent \(b\), first we evaluate \(\log N!\) as follows.

\[\begin{split} \begin{eqnarray} y &=& \log(N!) = \log(1 \cdot 2 \cdot 3 \cdots N-1 \cdot N) \\ &=& \log(1)+\log(2)+\log(3)+\cdots + \log(N-1)+\log(N) \end{eqnarray} \end{split}\]

where we assume the base is 10, that is \(\log = \log_{10}\) Notice that \(\log 1000 = 3\). Hence, \(\log N!\) is just a sum of small number. Once we found \(y\), the factorial can be obtained by \(n! = 10^y\). Next we split \(y\) to the whole number k=\(\lfloor y \rfloor\) and the residual \(\delta=y - \lfloor y \rfloor\). Now, we have \(n! = 10^{k+\delta} = 10^\delta \times 10^k\) and thus the mantissa is \(10^\delta\) and power is \(k\).

import math
x = math.factorial(1000)
y = math.log10(x)
b = math.floor(y)
a = 10**(y-b)

print("power=",b)
print("mantissa=",a)
power= 2567
mantissa= 4.023872600773956

which tells that \(1000! \approx 4.0239 \times 10^{2567}\). You can see that the number is far bigger the maximum of float64.


Last modified on 02/09/2024 by R. Kawai.