Numerical Computing in Python

(C) 2016 Steve Phelps

Overview

  • Floating-point representation
  • Arrays and Matrices with numpy
  • Basic plotting with matplotlib
  • Pseudo-random variates with numpy.random

Representing continuous values

  • Digital computers are inherently discrete.

  • Real numbers $x \in R$ cannot always be represented exactly in a digital computer.

  • They are stored in a format called floating-point.

  • IEEE Standard 754 specifies a universal format across different implementations.

    • As always there are deviations from the standard.
  • There are two standard sizes of floating-point numbers: 32-bit and 64-bit.

  • 64-bit numbers are called double precision, are sometimes called double values.

  • IEEE floating-point calculations are performed in hardware on modern computers.

  • How can we represent aribitrary real values using only 32 bits?

Fixed-point verses floating-point

  • One way we could discretise continuous values is to represent them as two integers $x$ and $y$.

  • The final value is obtained by e.g. $r = x + y \times 10^{-5}$.

  • So the number $500.4421$ would be represented as the tuple $x = 500$, $y = 44210$.

  • The exponent $5$ is fixed for all computations.

  • This number represents the precision with which we can represent real values.

  • It corresponds to the where we place we place the decimal point.

  • This scheme is called fixed precision.

  • It is useful in certain circumstances, but suffers from many problems.

  • In practice, we use variable precision, also known as floating point.

Scientific Notation

  • Humans also use a form of floating-point representation.

  • In Scientific notation, all numbers are written in the form $m \times 10^n$.

  • When represented in ASCII, we abbreviate this <m>e<n>, for example 6.72e+11 = $6.72 \times 10^{11}$.

  • The integer $m$ is called the significand or mantissa.

  • The integer $n$ is called the exponent.

  • The integer $10$ is the base.

Scientific Notation in Python

  • Python uses Scientific notation when it displays floating-point numbers:
In [1]:
print 672000000000.0
6.72e+11
  • Note that internally, the value is not represented exactly like this.

  • Scientific notation is a convention for writing or rendering numbers, not representing them digitally.

Floating-point representation

  • Floating point numbers use a base of $2$ instead of $10$.

  • Additionally, the mantissa are exponent are stored in binary.

  • The mantissa uses two's complement to represent positive and negative numbers.

    • One bit is reserved as the sign-bit: 1 for negative, 0 for positive values.
  • The mantissa is normalised, so we assume that it starts with the digit $1$ (which is not stored).

Bias

  • We also need to represent signed exponents.

  • The exponent does not use two's complement.

  • Instead a bias value is subtracted from the stored value to obtain the final stored exponent.

  • Double-precision values use a bias of $1023$, and single-precision uses a bias value of $127$.

  • For example, in double-precision the exponent $1022$ would be encoded as $1$.

  • The stored exponent values $0$ and $1024$ are reserved for special values- discussed later.

Double and single precision formats

The number of bits allocated to represent each integer component of a float is given below:

Format Sign Exponent Mantissa Total
single 1 8 23 32
double 1 11 52 64
  • Python normally works 64-bit precision.

  • numpy allows us to specify the type when storing data in arrays.

  • This is particularly useful for big data where we may need to be careful about the storage requirements of our data-set.

Loss of precision

  • We cannot represent every value in floating-point.

  • Consider single-precision (32-bit).

  • Let's try to represent $4,039,944,879$.

  • As a binary integer we write this:

11110000 11001100 10101010 10101111

  • This already takes up 32-bits.

  • The mantissa only allows us to store 24-bit integers.

  • So we have to round. We store it as:

+1.1110000 11001100 10101011e+31

  • Which gives us

+11110000 11001100 10101011 0000000

$= 4,039,944,960$

Ranges of floating-point values

In single precision arithmetic, we cannot represent the following values:

  • Negative numbers less than $-(2-2^{-23}) \times 2^{127}$

  • Negative numbers greater than $-2^{-149}$

  • Positive numbers less than $2^{-149}$

  • Positive numbers greater than $(2-2^{-23}) \times 2^{127}$

Attempting to represent these numbers results in overflow or underflow.

Effective floating-point range

Format Binary Decimal
single $\pm (2-2^{-23}) \times 2^{127}$ $\approx \pm 10^{38.53}$
double $\pm (2-2^{-52}) \times 2^{1023}$ $\approx \pm 10^{308.25}$

Representing Zero

  • Zero cannot be represented straightforwardly because we assume that all mantissa values start with the digit 1.

  • Zero is stored as a special-case, by setting mantissa and exponent both to zero.

  • The sign-bit can either be set or unset, so there are distinct positive and negative representations of zero.

Zero in Python

In [2]:
x = +0.0
x
Out[2]:
0.0
In [3]:
y = -0.0
y
Out[3]:
-0.0
  • However, these are considered equal:
In [4]:
x == y
Out[4]:
True

Infinity

  • Positive overflow results in a special value of infinity (in Python inf).

  • This is stored with an exponent consiting of all 1s, and a mantissa of all 0s.

  • The sign-bit allows us to differentiate between negative and positive overflow: $-\infty$ and $+\infty$.

  • This allows us to carry on calculating past an overflow event.

Infinity in Python

In [5]:
x = 1e300 * 1e100
x
Out[5]:
inf
In [6]:
x = x + 1
x
Out[6]:
inf

Negative infinity in Python

In [7]:
x > 0
Out[7]:
True
In [8]:
y = -x
y
Out[8]:
-inf
In [9]:
y < x
Out[9]:
True

Not A Number (NaN)

  • Some mathematical operations on real numbers do not map onto real numbers.

  • These results are represented using the special value to NaN which represents "not a (real) number".

  • NaN is represented by an exponent of all 1s, and a non-zero mantissa.

NaN in Python

In [10]:
from numpy import sqrt, inf, isnan, nan
x = sqrt(-1)
x
-c:2: RuntimeWarning: invalid value encountered in sqrt
Out[10]:
nan
In [11]:
y = inf - inf
y
Out[11]:
nan

Comparing nan values in Python

  • Beware of comparing nan values
In [12]:
x == y
Out[12]:
False
  • To test whether a value is nan use the isnan function:
In [13]:
isnan(x)
Out[13]:
True

NaN is not the same as None

  • None represents a missing value.

  • NaN represents an invalid floating-point value.

  • These are fundamentally different entities:

In [14]:
nan is None
Out[14]:
False
In [15]:
isnan(None)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-15-7d785ead34f4> in <module>()
----> 1 isnan(None)

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Numerical Methods

  • In e.g. simulation models or quantiative analysis we typically repeatedly update numerical values inside long loops.

  • Programs such as these implement numerical algorithms.

  • It is very easy to introduce bugs into code like this.

Numerical stability

  • The round-off error associated with a result can be compounded in a loop.

  • If the error increases as we go round the loop, we say the algorithm is numerically unstable.

  • Mathematicians design numerically stable algorithms using numerical analysis.

Catastrophic Cancellation

  • Suppose we have two real values $x$ and $y = x + \epsilon$.

  • $\epsilon$ is very small and $x$ is very large.

  • $x$ has an exact floating point representation

  • However, because of lack of precision $x$ and $y$ have the same floating point representation.

    • i.e. they are represented as the same sequence of 64-bits

Relative and absolute error

  • The absolute error $r = x - y = (x + \epsilon) - x$.

  • The relative error is $R = r/x = (x + \epsilon)/x - 1 = \epsilon/x$.

  • So the relative error is tiny because $x$ is large relative to $\epsilon$.

  • Ok, but how about the error in computing the difference between these values.

  • If we perform this in floating-point we will get 0.

  • The correct value is $(x + \epsilon) - x = \epsilon$.

  • The relative error is infinite!

Catastrophic Cancellation in Python

In [16]:
x = 3.141592653589793
x
Out[16]:
3.141592653589793
In [17]:
y = 6.022e23
x = (x + y) - y
In [18]:
x
Out[18]:
0.0
  • Avoid subtracting two nearly-equal numbers.

  • Especially in a loop!

  • Better-yet use a well-validated existing implementation in the form of a numerical library.

Importing numpy

  • Functions for numerical computiing are provided by a separate module called numpy.

  • Before we use the numpy module we must import it.

  • By convention, we import numpy using the alias np.

  • Once we have done this we can prefix the functions in the numpy library using the prefix np.

In [19]:
import numpy as np
  • We can now use the functions defined in this package by prefixing them with np.

Arrays

  • Arrays represent a collection of values.

  • In contrast to lists:

    • arrays typically have a fixed length
      • they can be resized, but this involves an expensive copying process.
    • and all values in the array are of the same type.
      • typically we store floating-point values.
  • Like lists:

    • arrays are mutable;
    • we can change the elements of an existing array.

Arrays in numpy

  • Arrays are provided by the numpy module.

  • The function array() creates an array given a list.

In [20]:
import numpy as np
x = np.array([0, 1, 2, 3, 4])
x
Out[20]:
array([0, 1, 2, 3, 4])

Array indexing

  • We can index an array just like a list
In [21]:
x[4]
Out[21]:
4
In [22]:
x[4] = 2
x
Out[22]:
array([0, 1, 2, 3, 2])

Arrays are not lists

  • Although this looks a bit like a list of numbers, it is a fundamentally different type of value:
In [23]:
type(x)
Out[23]:
numpy.ndarray
  • For example, we cannot append to the array:
In [24]:
x.append(5)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-24-2f1e4fff5cec> in <module>()
----> 1 x.append(5)

AttributeError: 'numpy.ndarray' object has no attribute 'append'

Functions over arrays

  • When we use arithmetic operators on arrays, we create a new array with the result of applying the operator to each element.
In [25]:
y = x * 2
y
Out[25]:
array([0, 2, 4, 6, 4])
  • The same goes for numerical functions:
In [26]:
x = np.array([-1, 2, 3, -4])
y = abs(x)
y
Out[26]:
array([1, 2, 3, 4])

Vectorized functions

  • Note that not every function automatically works with arrays.

  • Functions that have been written to work with arrays of numbers are called vectorized functions.

  • Most of the functions in numpy are already vectorized.

  • You can create a vectorized version of any other function using the higher-order function numpy.vectorize().

vectorize example

In [27]:
def myfunc(x):
    if x >= 0.5:
        return x
    else:
        return 0.0
    
fv = np.vectorize(myfunc)
In [28]:
x = np.arange(0, 1, 0.1)
x
Out[28]:
array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9])
In [29]:
fv(x)
Out[29]:
array([ 0. ,  0. ,  0. ,  0. ,  0. ,  0.5,  0.6,  0.7,  0.8,  0.9])

Populating Arrays

  • To populate an array with a range of values we use the np.arange() function:
In [30]:
x = np.arange(0, 10)
print x
[0 1 2 3 4 5 6 7 8 9]
  • We can also use floating point increments.
In [31]:
x = np.arange(0, 1, 0.1)
print x
[ 0.   0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9]

Basic Plotting

  • We will use a module called matplotlib to plot some simple graphs.

  • This module has a nested module called pyplot.

  • By convention we import this with the alias plt.

  • This module provides functions which are very similar to MATLAB plotting commands.

In [32]:
import matplotlib.pyplot as plt
%matplotlib inline

y = x*2 + 5
plt.plot(x, y)
Out[32]:
[<matplotlib.lines.Line2D at 0x7f33229deb90>]

Plotting a sine curve

In [33]:
from numpy import pi, sin

x = np.arange(0, 2*pi, 0.01)
y = sin(x)
plt.plot(x, y)
Out[33]:
[<matplotlib.lines.Line2D at 0x7f33228e8d50>]

Multi-dimensional data

  • Numpy arrays can hold multi-dimensional data.

  • To create a multi-dimensional array, we can pass a list of lists to the array() function:

In [34]:
import numpy as np

x = np.array([[1,2], [3,4]])
x
Out[34]:
array([[1, 2],
       [3, 4]])

Arrays containing arrays

  • A multi-dimensional array is an array of an arrays.

  • The outer array holds the rows.

  • Each row is itself an array:

In [35]:
x[0]
Out[35]:
array([1, 2])
In [36]:
x[1]
Out[36]:
array([3, 4])
  • So the element in the second row, and first column is:
In [37]:
x[1][0]
Out[37]:
3

Matrices

  • We can create a matrix from a multi-dimensional array.
In [38]:
M = np.matrix(x)
M
Out[38]:
matrix([[1, 2],
        [3, 4]])

Plotting multi-dimensional with matrices

  • If we supply a matrix to plot() then it will plot the y-values taken from the columns of the matrix (notice the transpose in the example below).
In [39]:
from numpy import pi, sin, cos

x = np.arange(0, 2*pi, 0.01)
y = sin(x)
ax = plt.plot(x, np.matrix([sin(x), cos(x)]).T)

Performance

  • When we use numpy matrices in Python the corresponding functions are linked with libraries written in C and FORTRAN.

  • For example, see the BLAS (Basic Linear Algebra Subprograms) library.

  • These libraries are very fast, and can be configured so that operations are performed in parallel on multiple CPU cores, or GPU hardware.

Matrix Operators

  • Once we have a matrix, we can perform matrix computations.

  • To compute the transpose and inverse use the T and I attributes:

To compute the transpose $M^{T}$

In [40]:
M.T
Out[40]:
matrix([[1, 3],
        [2, 4]])

To compute the inverse $M^{-1}$

In [41]:
M.I
Out[41]:
matrix([[-2. ,  1. ],
        [ 1.5, -0.5]])

Matrix Dimensions

  • The total number of elements, and the dimensions of the array:
In [42]:
M.size
Out[42]:
4
In [43]:
M.shape
Out[43]:
(2, 2)
In [44]:
len(M.shape)
Out[44]:
2

Creating Matrices from strings

  • We can also create arrays directly from strings, which saves some typing:
In [45]:
I2 = np.matrix('2 0; 0 2')
I2
Out[45]:
matrix([[2, 0],
        [0, 2]])
  • The semicolon starts a new row.

Matrix Multiplication

Now that we have two matrices, we can perform matrix multiplication:

In [46]:
M * I2
Out[46]:
matrix([[2, 4],
        [6, 8]])

Matrix Indexing

In [47]:
M[:,1]
Out[47]:
matrix([[2],
        [4]])

Slices are references

  • If we use this is an assignment, we create a reference to the sliced elements, not a copy.
In [48]:
V = M[:,1]  # This does not make a copy of the elements!
V
Out[48]:
matrix([[2],
        [4]])
In [49]:
M[0,1] = -2
V
Out[49]:
matrix([[-2],
        [ 4]])

Copying matrices and vectors

  • To copy a matrix, or a slice of its elements, use the function np.copy():
In [50]:
M = np.matrix('1 2; 3 4')
V = np.copy(M[:,1])  # This does copy the elements.
V
Out[50]:
array([[2],
       [4]])
In [51]:
M[0,1] = -2
V
Out[51]:
array([[2],
       [4]])

Sums

One way we could sum a vector or matrix is to use a for loop.

In [52]:
vector = np.arange(0.0, 100.0, 10.0)
vector
Out[52]:
array([  0.,  10.,  20.,  30.,  40.,  50.,  60.,  70.,  80.,  90.])
In [53]:
result = 0.0
for x in vector:
    result = result + x
result
Out[53]:
450.0
  • This is not the most efficient way to compute a sum.

Efficient sums

  • Instead of using a for loop, we can use a numpy function sum().

  • This function is written in the C language, and is very fast.

In [54]:
vector = np.array([0, 1, 2, 3, 4])
print np.sum(vector)
10

Summing rows and columns

  • When dealing with multi-dimensional data, the 'sum()' function has a named-argument axis which allows us to specify whether to sum along, each rows or columns.
In [55]:
matrix = np.matrix('1 2 3; 4 5 6; 7 8 9')
print matrix
[[1 2 3]
 [4 5 6]
 [7 8 9]]
  • To sum along rows:
In [56]:
np.sum(matrix, axis=0)
Out[56]:
matrix([[12, 15, 18]])
  • To sum along columns:
In [57]:
np.sum(matrix, axis=1)
Out[57]:
matrix([[ 6],
        [15],
        [24]])

Cumulative sums

  • Suppose we want to compute $y_n = \sum_{i=1}^{n} x_i$ where $\mathbf{x}$ is a vector.
In [58]:
import numpy as np
x = np.array([0, 1, 2, 3, 4])
y = np.cumsum(x)
print y
[ 0  1  3  6 10]

Cumulative sums along rows and columns

In [59]:
x = np.matrix('1 2 3; 4 5 6; 7 8 9')
print x
[[1 2 3]
 [4 5 6]
 [7 8 9]]
In [60]:
y = np.cumsum(x)
np.cumsum(x, axis=0)
Out[60]:
matrix([[ 1,  2,  3],
        [ 5,  7,  9],
        [12, 15, 18]])
In [61]:
np.cumsum(x, axis=1)
Out[61]:
matrix([[ 1,  3,  6],
        [ 4,  9, 15],
        [ 7, 15, 24]])

Cumulative products

  • Similarly we can compute $y_n = \Pi_{i=1}^{n} x_i$ using cumprod():
In [62]:
import numpy as np
x = np.array([1, 2, 3, 4, 5])
np.cumprod(x)
Out[62]:
array([  1,   2,   6,  24, 120])
  • We can compute cummulative products along rows and columns using the axis parameter, just as with the cumsum() example.

Generating (pseudo) random numbers

  • The nested module numpy.random contains functions for generating random numbers from different probability distributions.
In [63]:
from numpy.random import normal, uniform, exponential, randint
  • Suppose that we have a random variable $\epsilon \sim N(0, 1)$.

  • In Python we can draw from this distribution like so:

In [64]:
epsilon = normal()
print epsilon
-0.906592753436
  • If we execute another call to the function, we will make a new draw from the distribution:
In [65]:
epsilon = normal()
print epsilon
-0.0841335075421

Pseudo-random numbers

  • Strictly speaking, these are not random numbers.

  • They rely on an initial state value called the seed.

  • If we know the seed, then we can predict with total accuracy the rest of the sequence, given any "random" number.

  • Nevertheless, statistically they behave like independently and identically-distributed values.

    • Statistical tests for correlation and auto-correlation give insignificant results.
  • For this reason they called pseudo-random numbers.

  • The algorithms for generating them are called Pseudo-Random Number Generators (PRNGs).

  • Some applications, such as cryptography, require genuinely unpredictable sequences.

    • never use a standard PRNG for these applications!

Managing seed values

  • In some applications we need to reliably reproduce the same sequence of pseudo-random numbers that were used.

  • We can specify the seed value at the beginning of execution to achieve this.

  • Use the function seed() in the numpy.random module.

Setting the seed

In [66]:
from numpy.random import seed

seed(5)
In [67]:
normal()
Out[67]:
0.44122748688504143
In [68]:
normal()
Out[68]:
-0.33087015189408764
In [69]:
seed(5)
In [70]:
normal()
Out[70]:
0.44122748688504143
In [71]:
normal()
Out[71]:
-0.33087015189408764

Drawing multiple variates

  • To generate more than number, we can specify the size parameter:
In [72]:
normal(size=10)
Out[72]:
array([ 2.43077119, -0.25209213,  0.10960984,  1.58248112, -0.9092324 ,
       -0.59163666,  0.18760323, -0.32986996, -1.19276461, -0.20487651])
  • If you are generating very many variates, this will be much faster than using a for loop
  • We can also specify more than one dimension:
In [73]:
normal(size=(5,5))
Out[73]:
array([[-0.35882895,  0.6034716 , -1.66478853, -0.70017904,  1.15139101],
       [ 1.85733101, -1.51117956,  0.64484751, -0.98060789, -0.85685315],
       [-0.87187918, -0.42250793,  0.99643983,  0.71242127,  0.05914424],
       [-0.36331088,  0.00328884, -0.10593044,  0.79305332, -0.63157163],
       [-0.00619491, -0.10106761, -0.05230815,  0.24921766,  0.19766009]])

Histograms

  • We can plot a histograms of randomly-distributed data using the hist() function from matplotlib:
In [74]:
import matplotlib.pyplot as plt
%matplotlib inline

data = normal(size=10000)
ax = plt.hist(data)

Computing histograms as matrices

  • The function histogram() in the numpy module will count frequencies into bins and return the result as a 2-dimensional array.
In [75]:
import numpy as np
np.histogram(data)
Out[75]:
(array([  23,  136,  618, 1597, 2626, 2635, 1620,  599,  130,   16]),
 array([-3.59780883, -2.87679609, -2.15578336, -1.43477063, -0.71375789,
        0.00725484,  0.72826758,  1.44928031,  2.17029304,  2.89130578,
        3.61231851]))

Summary statistics

  • We can compute the summary statistics of a sample of values using the numpy functions mean() and var() to compute the sample mean $\bar{X}$ and sample variance $\sigma_{X}^2$ .
In [76]:
np.mean(data)
Out[76]:
-0.00045461080333495795
In [77]:
np.var(data)
Out[77]:
1.0016048722546327
  • These functions also have an axis parameter to compute mean and variances of columns or rows of a multi-dimensional data-set.

Summary statistics with nan values

  • If the data contains nan values, then the summary statistics will also be nan.
In [78]:
from numpy import nan
import numpy as np
data = np.array([1, 2, 3, 4, nan])
np.mean(data)
Out[78]:
nan
  • To omit nan values from the calculation, use the functions nanmean() and nanvar():
In [79]:
np.nanmean(data)
Out[79]:
2.5

Discrete random numbers

  • The randint() function in numpy.random can be used to draw from a uniform discrete probability distribution.

  • It takes two parameters: the low value (inclusive), and the high value (exclusive).

  • So to simulate one roll of a die, we would use the following Python code.

In [80]:
die_roll = randint(0, 6) + 1
die_roll
Out[80]:
4
  • Just as with the normal() function, we can generate an entire sequence of values.

  • To simulate a Bernoulli process with $n=20$ trials:

In [81]:
bernoulli_trials = randint(0, 2, size = 20)
bernoulli_trials
Out[81]:
array([1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0])

Acknowledgements

The earlier sections of this notebook were adapted from an article on floating-point numbers written by Steve Hollasch.