Floating Point Numbers

1.4. Floating Point Numbers#

There is no way to express real numbers in discrete systems. For example, we cannot express any irrational number using a finite number of letters 0-9. Therefore, we express real number approximately using scientific notation such as \(1.32567 \times 10^{12}\). Similarly digital computers use so-called floating point representation. \(1.32567 \times 10^{12}\) is expressed as \(1.32567E12\). Since scientific computation relies on the properties of floating point, we need to unser stand them.[1]

A single precision floating point stores a real number in a 32-bit string, of which 24 bits are used for mantissa. The corresponding significant figure is \(\log_{10} 2^{24} \approx 7\). The exponent part is \(2^{2^7}=2^{-128}\) to \(2^{2^7-1}=2^{127}\) which is approximately \(10^{-38}\) to \(10^{+38}\). One bit is used for the sign. This method allows two different expressions of zero, \(+0.\) and \(-0.\). Inside the computers, they are treated as two different numbers. Usually, the single precision is not accurate enough for computational physics and thus we should avoid it.

A double precision floating point uses a 64-bit string, 54 bits for mantissa and 10 bits for exponent as shown in Fig. 1.1. The largest value the mantissa can express is \(2^{53} = 9007,199,254,740,992\), which corresponds to significant figure 16. The maximum exponent part is between \(2^{-2^9} = 2^{-512} \approx 10^{-308}\) and \(2^{2^9-1} = 2^{511} \approx 10^{308}\). Modern computers implement more advanced floating point arithmatics known as IEEE 754. Common CPUs used in desktop and laptop computers are capable of double precision floating point arithmetic. Some advanced computers are equipped with special arithmatic engine capable ofr 128-bits floating point arithmatics. The default size of floating point in python is 64, which is good enough for most of numerical calculation in physics. Modern computers implement more advanced floating point arithmatics known as IEEE 754. Common CPUs used in desktop and laptop computers are capable of double precision floating point arithmetic. Some advanced computers are equipped with special arithmatic engine capable ofr 128-bits floating point arithmatics. The default size of floating point in python is 64, which is good enough for most of numerical calculation in physics.

../_images/double-float.png — Fig. 1.1 64-bit string for floating point expression. The last bit is used for the sign and 11 bits from \(b_{52}\) to \(b_{62}\) express the exponent. The remaining 52 bits express the mantissa.#

1.4.1. The range of floating points#

As discussed above, floating point has a finite range based on the size of bit string. In most computers, the range is

Type	Minimum value	Maximum value
single	1.175494351E-38	3.402823466E+38
double	2.2250738585072014E-308	1.7976931348623158E+308

You don’t have to memorize these numbers since pythonknows them. The follwoing example extract those information from python.

Example 1.4.1: Range of floating point numbers

Let us try to find the largest and smallest positive numbers in your computer system. We use the float_info class in sys library.

# load sys package for system information
import sys

fmin = sys.float_info.min
fmax = sys.float_info.max
print("Smallest float =",fmin)
print(" Largest float =",fmax)

Smallest float = 2.2250738585072014e-308
 Largest float = 1.7976931348623157e+308

1.4.2. Special value “inf”#

If the value exceeds the max, python outputs “inf”. Although it is not real infinity, python thinks it is. Getting inf means your calculation failed due to overflow error.

Example 1.4.2: number above fmax

Find what is the outcome of a number larger than the maximum or smaller than minimum.

print("2 x fmax =", 2*fmax*2)

2 x fmax = inf

1.4.3. Special value “nan”#

If mathematical operation is undefined, python just outputs “nan” whcih stands for “not a number”.

Example 1.4.3: square root of -1.

\(\sqrt{-1}\) is not defined within floating point. (The squre root function is defined as floating point and thus it does not understand complex number.) Python issues a runtim warning message.

# square root is not available in python core.
# we use numpy
import numpy as np

np.sqrt(-1)

/tmp/ipykernel_11450/965727510.py:5: RuntimeWarning: invalid value encountered in sqrt
  np.sqrt(-1)

nan

1.4.4. Overflow errors#

When the output of a calculation exceeds the maximum of floating point, you need to find a better algorithm to compute it. Thre is no universal mitigation of “inf”. The following example is often used in statistical physcs.

Example 1.4.4

Factorial of a large integer is astronomically large. For example, 1000!. Let try to write it down. (We are wasting space.)

# Here we use math package to compute a large factorial.
# (numpy math.factorial has been deprecated.)
import math

# 1000! using math factorial function
math.factorial(1000)

402387260077093773543702433923003985719374864210714632543799910429938512398629020592044208486969404800479988610197196058631666872994808558901323829669944590997424504087073759918823627727188732519779505950995276120874975462497043601418278094646496291056393887437886487337119181045825783647849977012476632889835955735432513185323958463075557409114262417474349347553428646576611667797396668820291207379143853719588249808126867838374559731746136085379534524221586593201928090878297308431392844403281231558611036976801357304216168747609675871348312025478589320767169132448426236131412508780208000261683151027341827977704784635868170164365024153691398281264810213092761244896359928705114964975419909342221566832572080821333186116811553615836546984046708975602900950537616475847728421889679646244945160765353408198901385442487984959953319101723355556602139450399736280750137837615307127761926849034352625200015888535147331611702103968175921510907788019393178114194545257223865541461062892187960223838971476088506276862967146674697562911234082439208160153780889893964518263243671616762179168909779911903754031274622289988005195444414282012187361745992642956581746628302955570299024324153181617210465832036786906117260158783520751516284225540265170483304226143974286933061690897968482590125458327168226458066526769958652682272807075781391858178889652208164348344825993266043367660176999612831860788386150279465955131156552036093988180612138558600301435694527224206344631797460594682573103790084024432438465657245014402821885252470935190620929023136493273497565513958720559654228749774011413346962715422845862377387538230483865688976461927383814900140767310446640259899490222221765904339901886018566526485061799702356193897017860040811889729918311021171229845901641921068884387121855646124960798722908519296819372388642614839657382291123125024186649353143970137428531926649875337218940694281434118520158014123344828015051399694290153483077644569099073152433278288269864602789864321139083506217095002597389863554277196742822248757586765752344220207573630569498825087968928162753848863396909959826280956121450994871701244516461260379029309120889086942028510640182154399457156805941872748998094254742173582401063677404595741785160829230135358081840096996372524230560855903700624271243416909004153690105933983835777939410970027753472000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

which is practically useless. It is diffuclt even to see how big the value is. So, we want to express it in scientific notation \(a \times 10^{b}\). If you try to convert the above integer number to a floating point number using float(math.factorial(1000)), the number is obviously too large and the conversion causes overflow error. We need to calculate the scientific notation manually. In order to find the mantissa \(a\) and exponent \(b\), first we evaluate \(\log N!\) as follows.

\[\begin{split} \begin{eqnarray} y &=& \log(N!) = \log(1 \cdot 2 \cdot 3 \cdots N-1 \cdot N) \\ &=& \log(1)+\log(2)+\log(3)+\cdots + \log(N-1)+\log(N) \end{eqnarray} \end{split}\]

which is just a sum of small number. Once we found \(y\), the factorial can be obtained by \(n! = e^y\). However, it is still not in scientific notation. First we change the base from \(e\) to \(10\) as \(e^y = 10^z \), where \(z = y \log_{10}(e)\). Then, \(n! = 10^z\). Next we split \(z\) to the floor k=\(\lfloor z \rfloor\) and the residual \(\delta=z - \lfloor z \rfloor\). Now, we have \(n! = 10^{k+\delta} = 10^\delta \times 10^k\) and thus the mantissa is \(10^\delta\) and power is \(k\).

# evaluate log(1) + log(2) + ... + log(1000)
y=np.log(np.arange(1,1001)).sum()

# change the base from e to 10
z=np.log10(np.exp(1))*y

# find power
b = int(np.floor(z))

# find mantissa
a = 10**(z-b)

print("power=",b)
print("mantissa=",a)

power= 2567
mantissa= 4.023872600769742

which tells that \(1000! \approx 4.0239 \times 10^{2567}\). You can see that the number is far above the maximum of double precision.

Last modified on 02/09/2024 by R. Kawai.