(c) 2019 Steve Phelps
numpy
matplotlib
numpy.random
Digital computers are inherently discrete.
Real numbers $x \in R$ cannot always be represented exactly in a digital computer.
They are stored in a format called floating-point.
IEEE Standard 754 specifies a universal format across different implementations.
There are two standard sizes of floating-point numbers: 32-bit and 64-bit.
64-bit numbers are called double precision, are sometimes called double values.
IEEE floating-point calculations are performed in hardware on modern computers.
How can we represent aribitrary real values using only 32 bits?
One way we could discretise continuous values is to represent them as two integers $x$ and $y$.
The final value is obtained by e.g. $r = x + y \times 10^{-5}$.
So the number $500.4421$ would be represented as the tuple $x = 500$, $y = 44210$.
The exponent $5$ is fixed for all computations.
This number represents the precision with which we can represent real values.
It corresponds to the where we place we place the decimal point.
This scheme is called fixed precision.
It is useful in certain circumstances, but suffers from many problems, in particular it can only represent a very limited range of values.
In practice, we use variable precision, also known as floating point.
Humans also use a form of floating-point representation.
In Scientific notation, all numbers are written
in the form $m \times 10^n$.
When represented in ASCII, we abbreviate this <m>e<n>
, for example 6.72e+11
= $6.72 \times 10^{11}$.
The integer $m$ is called the significand or mantissa.
The integer $n$ is called the exponent.
The integer $10$ is the base.
print(672000000000000000.0)
6.72e+17
Note that internally, the value is not represented exactly like this.
Scientific notation is a convention for writing or rendering numbers, not representing them digitally.
Floating point numbers use a base of $2$ instead of $10$.
Additionally, the mantissa are exponent are stored in binary.
Therefore we represent floating-point numbers as $m \times 2 ^ e$.
The integers $m$ (mantissa) and $e$ (exponent) are stored in binary.
The mantissa uses two's complement to represent positive and negative numbers.
The mantissa is normalised, so we assume that it starts with the digit $1$ (which is not stored).
We also need to represent signed exponents.
The exponent does not use two's complement.
Instead a bias value is subtracted from the stored exponent ($s$) to obtain the final value ($e$).
Double-precision values use a bias of $b = 1023$, and single-precision uses a bias value of $b = 127$.
The actual exponent is given by $e = s - b$ where $s$ is the stored exponent.
The stored exponent values $s=0$ and $s=1024$ are reserved for special values- discussed later.
The stored exponent $s$ is represented in binary without using a sign bit.
The number of bits allocated to represent each integer component of a float is given below:
Format | Sign | Exponent | Mantissa | Total |
---|---|---|---|---|
single | 1 | 8 | 23 | 32 |
double | 1 | 11 | 52 | 64 |
By default, Python uses 64-bit precision.
We can specify alternative precision by using the numpy numeric data types.
We cannot represent every value in floating-point.
Consider single-precision (32-bit).
Let's try to represent $4,039,944,879$.
11110000 11001100 10101010 10101111
This already takes up 32-bits.
The mantissa only allows us to store 24-bit integers.
So we have to round. We store it as:
+1.1110000 11001100 10101011e+31
+11110000 11001100 10101011 0000000
$= 4,039,944,960$
In single precision arithmetic, we cannot represent the following values:
Negative numbers less than $-(2-2^{-23}) \times 2^{127}$
Negative numbers greater than $-2^{-149}$
Positive numbers less than $2^{-149}$
Positive numbers greater than $(2-2^{-23}) \times 2^{127}$
Attempting to represent these numbers results in overflow or underflow.
Format | Binary | Decimal |
---|---|---|
single | $\pm (2-2^{-23}) \times 2^{127}$ | $\approx \pm 10^{38.53}$ |
double | $\pm (2-2^{-52}) \times 2^{1023}$ | $\approx \pm 10^{308.25}$ |
import sys
sys.float_info.max
1.7976931348623157e+308
sys.float_info.min
2.2250738585072014e-308
sys.float_info
sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)
With a fixed number of bits, we have to choose between:
These are conflicting objectives:
Floating-point addresses this dilemma by allowing the precision to vary ("float") according to the magnitude of the number we are trying to represent.
Floating-point numbers are unevenly-spaced over the line of real-numbers.
The precision decreases as we increase the magnitude.
Zero cannot be represented straightforwardly because we assume that all mantissa values start with the digit 1.
Zero is stored as a special-case, by setting mantissa and exponent both to zero.
The sign-bit can either be set or unset, so there are distinct positive and negative representations of zero.
x = +0.0
x
0.0
y = -0.0
y
-0.0
x == y
True
Positive overflow results in a special value of infinity (in Python inf
).
This is stored with an exponent consiting of all 1s, and a mantissa of all 0s.
The sign-bit allows us to differentiate between negative and positive overflow: $-\infty$ and $+\infty$.
This allows us to carry on calculating past an overflow event.
x = 1e300 * 1e100
x
inf
x = x + 1
x
inf
x > 0
True
y = -x
y
-inf
y < x
True
Some mathematical operations on real numbers do not map onto real numbers.
These results are represented using the special value to NaN
which represents "not a (real) number".
NaN
is represented by an exponent of all 1s, and a non-zero mantissa.
NaN
in Python¶from numpy import sqrt, inf, isnan, nan
x = sqrt(-1)
x
/home/sphelps/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: RuntimeWarning: invalid value encountered in sqrt
nan
y = inf - inf
y
nan
nan
values in Python¶nan
valuesx == y
False
nan
use the isnan
function:isnan(x)
True
NaN
is not the same as None
¶None
represents a missing value.
NaN
represents an invalid floating-point value.
These are fundamentally different entities:
nan is None
False
isnan(None)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-20-4a0142ec4134> in <module>() ----> 1 isnan(None) TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
x = 0.1 + 0.2
x == 0.3
False
x
0.30000000000000004
Consider a floating point number $x_{fp}$ which represents a real number $x \in \mathbf{R}$.
In general, we cannot precisely represent the real number; that is $x_{fp} \neq x$.
The absolute error $r$ is $r = x - x_{fp}$.
The relative error $R$ is:
In e.g. simulation models or quantiative analysis we typically repeatedly update numerical values inside long loops.
Programs such as these implement numerical algorithms.
It is very easy to introduce bugs into code like this.
The round-off error associated with a result can be compounded in a loop.
If the error increases as we go round the loop, we say the algorithm is
numerically unstable.
Suppose we have two real values $x$, and $y = x + \epsilon$.
$\epsilon$ is very small and $x$ is very large.
$x$ has an exact floating point representation
However, because of lack of precision $x$ and $y$ have the same floating
point representation.
i.e. they are represented as the same sequence of 64-bits
Consider wheat happens when we compute $y - x$ in floating-point.
Catestrophic cancellation results in very large relative error.
If we calculate $y - x$ in floating-point we will obtain the result 0.
The correct value is $(x + \epsilon) - x = \epsilon$.
The relative error is
That is, the relative error is $100\%$.
This can result in catastrophy.
x = 3.141592653589793
x
3.141592653589793
y = 6.022e23
x = (x + y) - y
x
0.0
z = x + y
z
6.022e+23
The above result is still inaccurate with an absolute error $r \approx \pi$.
However, let's examine the relative error:
You can hardly-ever eliminate absolute rounding error when using floating-point.
The best we can do is to take steps to minimise error, and prevent it from increasing as your calculation progresses.
Cancellation can be catastrophic, because it can greatly increase the relative error in your calculation.
Avoid subtracting two nearly-equal numbers.
Especially in a loop!
Better-yet use a well-validated existing implementation in the form of a numerical library.
import numpy as np
np
.Arrays represent a collection of values.
In contrast to lists:
Like lists:
numpy
¶Arrays are provided by the numpy
module.
The function array()
creates an array given a list.
import numpy as np
x = np.array([0, 1, 2, 3, 4])
x
array([0, 1, 2, 3, 4])
x[4]
4
x[4] = 2
x
array([0, 1, 2, 3, 2])
type(x)
numpy.ndarray
x.append(5)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-32-7e52d4acf950> in <module>() ----> 1 x.append(5) AttributeError: 'numpy.ndarray' object has no attribute 'append'
np.arange()
function:x = np.arange(0, 10)
print(x)
[0 1 2 3 4 5 6 7 8 9]
x = np.arange(0, 1, 0.1)
print(x)
[0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9]
y = x * 2
y
array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8])
x = np.array([-1, 2, 3, -4])
y = abs(x)
y
array([1, 2, 3, 4])
Note that not every function automatically works with arrays.
Functions that have been written to work with arrays of numbers are called vectorized functions.
Most of the functions in numpy
are already vectorized.
You can create a vectorized version of any other function using the higher-order function numpy.vectorize()
.
vectorize
example¶def myfunc(x):
if x >= 0.5:
return x
else:
return 0.0
fv = np.vectorize(myfunc)
x = np.arange(0, 1, 0.1)
x
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
fv(x)
array([0. , 0. , 0. , 0. , 0. , 0.5, 0.6, 0.7, 0.8, 0.9])
Because of finite precision we need to take great care when comparing floating-point values.
The numpy function allclose()
can be used to test equality of floating-point numbers within a relative tolerance.
It is a vectorized function so it will work with arrays as well as single floating-point values.
x = 0.1 + 0.2
y = 0.3
x == y
False
np.allclose(x, y)
True
matplotlib
¶We will use a module called matplotlib
to plot some simple graphs.
This module has a nested module called pyplot
.
By convention we import this with the alias plt
.
This module provides functions which are very similar to MATLAB plotting commands.
import matplotlib.pyplot as plt
x = np.arange(0, 1, 0.1)
y = x*2 + 5
plt.plot(x, y)
plt.xlabel('$x$')
plt.ylabel('$y = 2x + 5$')
plt.title('Linear plot')
plt.show()
from numpy import pi, sin
x = np.arange(0, 2*pi, 0.01)
y = sin(x)
plt.plot(x, y)
plt.xlabel('x')
plt.ylabel('sin(x)')
plt.show()
Numpy arrays can hold multi-dimensional data.
To create a multi-dimensional array, we can pass a list of lists to the array()
function:
import numpy as np
x = np.array([[1,2], [3,4]])
x
array([[1, 2], [3, 4]])
A multi-dimensional array is an array of an arrays.
The outer array holds the rows.
Each row is itself an array:
x[0]
array([1, 2])
x[1]
array([3, 4])
x[1][0]
3
M = np.matrix(x)
M
matrix([[1, 2], [3, 4]])
plot()
then it will plot the y-values taken from the columns of the matrix (notice the transpose in the example below).from numpy import pi, sin, cos
x = np.arange(0, 2*pi, 0.01)
y = sin(x)
ax = plt.plot(x, np.matrix([sin(x), cos(x)]).T)
plt.show()
When we use numpy
matrices in Python the corresponding functions are linked with libraries written in C and FORTRAN.
For example, see the BLAS (Basic Linear Algebra Subprograms) library.
These libraries are very fast.
Vectorised code can be more easily ported to frameworks like TensorFlow so that operations are performed in parallel using GPU hardware.
To compute the transpose $M^{T}$
M.T
matrix([[1, 3], [2, 4]])
To compute the inverse $M^{-1}$
M.I
matrix([[-2. , 1. ], [ 1.5, -0.5]])
M.size
4
M.shape
(2, 2)
len(M.shape)
2
I2 = np.matrix('2 0; 0 2')
I2
matrix([[2, 0], [0, 2]])
Now that we have two matrices, we can perform matrix multiplication:
M * I2
matrix([[2, 4], [6, 8]])
M[:,1]
matrix([[2], [4]])
V = M[:,1] # This does not make a copy of the elements!
V
matrix([[2], [4]])
M[0,1] = -2
V
matrix([[-2], [ 4]])
np.copy()
:M = np.matrix('1 2; 3 4')
V = np.copy(M[:,1]) # This does copy the elements.
V
array([[2], [4]])
M[0,1] = -2
V
array([[2], [4]])
One way we could sum a vector or matrix is to use a for
loop.
vector = np.arange(0.0, 100.0, 10.0)
vector
array([ 0., 10., 20., 30., 40., 50., 60., 70., 80., 90.])
result = 0.0
for x in vector:
result = result + x
result
450.0
Instead of using a for
loop, we can use a numpy function sum()
.
This function is written in the C language, and is very fast.
vector = np.array([0, 1, 2, 3, 4])
print(np.sum(vector))
10
axis
which allows us to specify whether to sum along, each rows or columns.matrix = np.matrix('1 2 3; 4 5 6; 7 8 9')
print(matrix)
[[1 2 3] [4 5 6] [7 8 9]]
np.sum(matrix, axis=0)
matrix([[12, 15, 18]])
np.sum(matrix, axis=1)
matrix([[ 6], [15], [24]])
import numpy as np
x = np.array([0, 1, 2, 3, 4])
y = np.cumsum(x)
print(y)
[ 0 1 3 6 10]
x = np.matrix('1 2 3; 4 5 6; 7 8 9')
print(x)
[[1 2 3] [4 5 6] [7 8 9]]
y = np.cumsum(x)
np.cumsum(x, axis=0)
matrix([[ 1, 2, 3], [ 5, 7, 9], [12, 15, 18]])
np.cumsum(x, axis=1)
matrix([[ 1, 3, 6], [ 4, 9, 15], [ 7, 15, 24]])
cumprod()
:import numpy as np
x = np.array([1, 2, 3, 4, 5])
np.cumprod(x)
array([ 1, 2, 6, 24, 120])
axis
parameter, just as with the cumsum()
example.numpy.random
contains functions for generating random numbers from different probability distributions.from numpy.random import normal, uniform, exponential, randint
Suppose that we have a random variable $\epsilon \sim N(0, 1)$.
In Python we can draw from this distribution like so:
epsilon = normal()
epsilon
0.1465312427787133
epsilon = normal()
epsilon
-2.3135050810408773
Strictly speaking, these are not random numbers.
They rely on an initial state value called the seed.
If we know the seed, then we can predict with total accuracy the rest of the sequence, given any "random" number.
Nevertheless, statistically they behave like independently and identically-distributed values.
For this reason they called pseudo-random numbers.
The algorithms for generating them are called Pseudo-Random Number Generators (PRNGs).
Some applications, such as cryptography, require genuinely unpredictable sequences.
In some applications we need to reliably reproduce the same sequence of pseudo-random numbers that were used.
We can specify the seed value at the beginning of execution to achieve this.
Use the function seed()
in the numpy.random
module.
from numpy.random import seed
seed(5)
normal()
0.44122748688504143
normal()
-0.33087015189408764
seed(5)
normal()
0.44122748688504143
normal()
-0.33087015189408764
size
parameter:normal(size=10)
array([ 2.43077119, -0.25209213, 0.10960984, 1.58248112, -0.9092324 , -0.59163666, 0.18760323, -0.32986996, -1.19276461, -0.20487651])
normal(size=(5,5))
array([[-0.35882895, 0.6034716 , -1.66478853, -0.70017904, 1.15139101], [ 1.85733101, -1.51117956, 0.64484751, -0.98060789, -0.85685315], [-0.87187918, -0.42250793, 0.99643983, 0.71242127, 0.05914424], [-0.36331088, 0.00328884, -0.10593044, 0.79305332, -0.63157163], [-0.00619491, -0.10106761, -0.05230815, 0.24921766, 0.19766009]])
hist()
function from matplotlib:import matplotlib.pyplot as plt
data = normal(size=10000)
ax = plt.hist(data)
plt.title('Histogram of normally distributed data ($n=10^5$)')
plt.show()
histogram()
in the numpy
module will count frequencies into bins and return the result as a 2-dimensional array.import numpy as np
np.histogram(data)
(array([ 23, 136, 618, 1597, 2626, 2635, 1620, 599, 130, 16]), array([-3.59780883, -2.87679609, -2.15578336, -1.43477063, -0.71375789, 0.00725484, 0.72826758, 1.44928031, 2.17029304, 2.89130578, 3.61231851]))
np.mean(data)
-0.00045461080333497925
np.var(data)
1.0016048722546331
axis
parameter to compute mean and variances of columns or rows of a multi-dimensional data-set.nan
values¶nan
values, then the descriptive statistics will also be nan
.from numpy import nan
import numpy as np
data = np.array([1, 2, 3, 4, nan])
np.mean(data)
nan
nan
values from the calculation, use the functions nanmean()
and nanvar()
:np.nanmean(data)
2.5
The randint()
function in numpy.random
can be used to draw from a uniform discrete probability distribution.
It takes two parameters: the low value (inclusive), and the high value (exclusive).
So to simulate one roll of a die, we would use the following Python code.
die_roll = randint(0, 6) + 1
die_roll
4
Just as with the normal()
function, we can generate an entire sequence of values.
To simulate a Bernoulli process with $n=20$ trials:
bernoulli_trials = randint(0, 2, size = 20)
bernoulli_trials
array([1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0])
Acknowledgements
The early sections of this notebook were adapted from an online article by Steve Hollasch.