Scientific Python Quickstart

ANU

This is a fast-paced, hands-on introduction to scientific computing with Python, contained in a Jupyter notebook. The main focus will be on introducing Python's four most important scientific libraries: NumPy, Scipy, Pandas and Matplotlib.

If you don't know how to use this notebook you need to first work through this page.

A slower, more detailed and more systematic treatment of Python for scientific applications can be found at quant-econ.net. But this notebook is a good place to start for those who like to learn by doing.

Here's some information on the version of Python that I'm using:

In [241]:
import sys
print(sys.version)
3.5.1 |Anaconda 2.4.1 (64-bit)| (default, Dec  7 2015, 11:16:01) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]

Basic NumPy

Perhaps the single most important scientific library for Python is NumPy. NumPy provides foundational data structures and routines on which many other libraries rely.

In [242]:
import numpy as np     # Import library and give it alias np
print(np.__version__)  #  The version I'm using
1.10.2

NumPy defines a basic data type called an array (actually a numpy.ndarray)

In [243]:
a = np.zeros(3)     # Create an array of zeros
a                   # Print a
Out[243]:
array([ 0.,  0.,  0.])
In [244]:
type(a)
Out[244]:
numpy.ndarray

Note that array data must be homogeneous

The most important data types are:

  • float64: 64 bit floating point number
  • float32: 32 bit floating point number
  • int64: 64 bit integer
  • int32: 32 bit integer
  • bool: 8 bit True or False

There are also dtypes to represent complex numbers, unsigned integers, etc

On most machines, the default dtype for arrays is float64

In [245]:
a = np.zeros(3)
type(a[1])
Out[245]:
numpy.float64

When we create an array such as

In [246]:
z = np.zeros(10)

z is a "flat" array with no dimension--- neither row nor column vector:

In [247]:
 z.shape
Out[247]:
(10,)

Here the shape tuple has only one element, which is the length of the array (tuples with one element end with a comma)

To give it dimension, we can change the shape attribute

For example, let's make it a column vector

In [248]:
z.shape = (10, 1)
In [249]:
z
Out[249]:
array([[ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.]])
In [250]:
z = np.zeros(4)
z.shape = (2, 2)
z
Out[250]:
array([[ 0.,  0.],
       [ 0.,  0.]])

Creating arrays

Creating empty arrays --- initializing memory:

In [251]:
z = np.empty(3)
z
Out[251]:
array([ 0.,  0.,  0.])

These are just garbage numbers --- whatever was in those memory slots

Here's how to make a regular gird sequence

In [252]:
z = np.linspace(2, 4, 5)  # From 2 to 4, with 5 elements
z
Out[252]:
array([ 2. ,  2.5,  3. ,  3.5,  4. ])

Creating an array of ones

In [253]:
z = np.ones(3)
z
Out[253]:
array([ 1.,  1.,  1.])
In [254]:
z = np.identity(2)
z
Out[254]:
array([[ 1.,  0.],
       [ 0.,  1.]])

Arrays can be made from Python lists or tuples

In [255]:
z = np.array([10, 20]) 
z
Out[255]:
array([10, 20])
In [256]:
z = np.array((10, 20), dtype=float) 
z
Out[256]:
array([ 10.,  20.])
In [257]:
z = np.array([[1, 2], [3, 4]])         # 2D array from a list of lists
z
Out[257]:
array([[1, 2],
       [3, 4]])

Array indexing

In [258]:
z = np.linspace(1, 2, 5)
z
Out[258]:
array([ 1.  ,  1.25,  1.5 ,  1.75,  2.  ])
In [259]:
z[0]  # First element --- Python sequences are zero based, like C, Java, etc.
Out[259]:
1.0
In [260]:
z[-1]  # Special syntax for last element
Out[260]:
2.0
In [261]:
z[0:2]  # Meaning: Two elements, starting from element 0
Out[261]:
array([ 1.  ,  1.25])
In [262]:
z = np.array([[1, 2], [3, 4]])
z
Out[262]:
array([[1, 2],
       [3, 4]])
In [263]:
z[0, 0]
Out[263]:
1
In [264]:
z[0,:]  # First row
Out[264]:
array([1, 2])
In [265]:
z[:,0]  # First column
Out[265]:
array([1, 3])
In [266]:
z = np.linspace(2, 4, 5)
z
Out[266]:
array([ 2. ,  2.5,  3. ,  3.5,  4. ])
In [267]:
d = np.array([0, 1, 1, 0, 0], dtype=bool)
d
Out[267]:
array([False,  True,  True, False, False], dtype=bool)
In [268]:
z[d]
Out[268]:
array([ 2.5,  3. ])

Array methods

In [269]:
A = np.array((4, 3, 2, 1))
A
Out[269]:
array([4, 3, 2, 1])
In [270]:
A.sort()
In [271]:
A
Out[271]:
array([1, 2, 3, 4])
In [272]:
A.mean()
Out[272]:
2.5
In [273]:
A.sum()
Out[273]:
10
In [274]:
A.max()
Out[274]:
4
In [275]:
A.cumsum()
Out[275]:
array([ 1,  3,  6, 10])
In [276]:
A.var()
Out[276]:
1.25
In [277]:
A.shape = (2, 2)
A
Out[277]:
array([[1, 2],
       [3, 4]])
In [278]:
A.T  # Transpose, equivalent to A.transpose()
Out[278]:
array([[1, 3],
       [2, 4]])

Operations on arrays

Standard arithmetic operations on arrays act elementwise

In [279]:
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])
In [280]:
a + b
Out[280]:
array([ 6,  8, 10, 12])
In [281]:
a - b
Out[281]:
array([-4, -4, -4, -4])
In [282]:
a + 10
Out[282]:
array([11, 12, 13, 14])
In [283]:
a.shape = 2, 2
b.shape = 2, 2
In [284]:
a
Out[284]:
array([[1, 2],
       [3, 4]])
In [285]:
b
Out[285]:
array([[5, 6],
       [7, 8]])
In [286]:
a * b # Pointwise multiplication!!
Out[286]:
array([[ 5, 12],
       [21, 32]])
In [287]:
np.dot(a, b) # Matrix multiplication
Out[287]:
array([[19, 22],
       [43, 50]])

For Python $\geq 3.5$ and NumPy $\geq 1.1$ the @ operator also works.

In [288]:
a @ b
Out[288]:
array([[19, 22],
       [43, 50]])

I'll continue to use np.dot below for the benefit of those who are using older versions. But in my opinion the @ operator is much nicer.

Comparisons

In [289]:
z = np.array([2, 3])
y = np.array([2, 3])
z == y
Out[289]:
array([ True,  True], dtype=bool)
In [290]:
y[0] = 3
z == y
Out[290]:
array([False,  True], dtype=bool)
In [291]:
z = np.linspace(0, 10, 5)
z
Out[291]:
array([  0. ,   2.5,   5. ,   7.5,  10. ])
In [292]:
z > 3
Out[292]:
array([False, False,  True,  True,  True], dtype=bool)
In [293]:
z[z > 3]  # Conditional extraction
Out[293]:
array([  5. ,   7.5,  10. ])

Matplotlib

Matplotlib is an outstanding plotting and visualization library for Python that interacts nicely with NumPy. Here are a few quick examples. We'll see more below when we discuss the SciPy library.

In [294]:
import matplotlib.pyplot as plt  # Import main functionality

Display figures in this browser window rather than having them open up separately:

In [295]:
%matplotlib inline 

Create something to plot

In [296]:
x = np.linspace(-2, 2, 100)
y = x**2
In [297]:
fig, ax = plt.subplots()  # Create axes and figure window
ax.plot(x, y, 'b-')
Out[297]:
[<matplotlib.lines.Line2D at 0x7f77a347d048>]

Here's a slightly more complex plot

In [298]:
y3 = x**3
fig, ax = plt.subplots()  # Create axes and figure window
ax.plot(x, y, 'b-', lw=2, alpha=0.8, label='$x^2$')
ax.plot(x, y3, 'g-', lw=2, alpha=0.8, label='$x^3$')
ax.legend(loc='lower right')
Out[298]:
<matplotlib.legend.Legend at 0x7f77a228fcf8>

SciPy

Let's just cover some simple examples --- references for further reading are below

Statistics and distributions

Let's use scipy.stats to generate some data from the Beta distribution

In [299]:
from scipy.stats import beta
q = beta(5, 5)      # Beta(a, b), with a = b = 5
obs = q.rvs(2000)   # 2000 observations

Now let's histogram it and compare it to the original density

In [300]:
fig, ax = plt.subplots()
ax.hist(obs, bins=40, normed=True)
grid = np.linspace(0.01, 0.99, 100)
ax.plot(grid, q.pdf(grid), 'k-', linewidth=2)
Out[300]:
[<matplotlib.lines.Line2D at 0x7f77a22d2ef0>]

Other methods

In [301]:
type(q)
Out[301]:
scipy.stats._distn_infrastructure.rv_frozen
In [302]:
dir(q)  # Let's see all its methods
Out[302]:
['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'a',
 'args',
 'b',
 'cdf',
 'dist',
 'entropy',
 'expect',
 'interval',
 'isf',
 'kwds',
 'logcdf',
 'logpdf',
 'logpmf',
 'logsf',
 'mean',
 'median',
 'moment',
 'pdf',
 'pmf',
 'ppf',
 'random_state',
 'rvs',
 'sf',
 'stats',
 'std',
 'var']
In [303]:
q.cdf(0.5)
Out[303]:
0.50000000000000011
In [304]:
q.pdf(0.5)
Out[304]:
2.4609375000000009
In [305]:
q.mean()
Out[305]:
0.5

Basic linear regression:

In [306]:
from scipy.stats import linregress
n = 100
alpha, beta, sigma = 1, 2, 1.5
x = np.random.randn(n)  # n standard normals
y = alpha + beta * x + sigma * np.random.randn(n)
beta_hat, alpha_hat, r_value, p_value, std_err = linregress(x, y)
print("gradient = {}".format(beta_hat))
print("intercept = {}".format(alpha_hat))
gradient = 1.874172371477785
intercept = 1.4035329152756753

Let's plot this with data and line of best fit

In [307]:
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(x, y, 'bo', alpha=0.6, label='observations')
xgrid = np.linspace(-3, 3, 2)
ax.plot(xgrid, alpha_hat + beta_hat * xgrid, 'k-', lw=2, alpha=0.8, label='best fit')
ax.grid()
ax.legend(loc='upper left')
Out[307]:
<matplotlib.legend.Legend at 0x7f77a21026a0>

Roots and fixed points

Let's choose an arbitrary function to work with

In [308]:
fig, ax = plt.subplots()
def f(x):
     return np.sin(4 * (x - 0.25)) + x + x**20 - 1
x = np.linspace(0, 1, 100)
ax.plot(x, f(x))
ax.plot(x, 0 * x)
Out[308]:
[<matplotlib.lines.Line2D at 0x7f77a2129c18>]
In [309]:
from scipy.optimize import bisect  # Bisection algorithm --- slow but robust
bisect(f, 0, 1)
Out[309]:
0.4082935042797544
In [310]:
from scipy.optimize import newton  #  Newton's method --- fast but less robust
newton(f, 0.2)   # Start the search at initial condition x = 0.2
Out[310]:
0.40829350427935679
In [311]:
newton(f, 0.7)   # Start the search at x = 0.7 instead 
Out[311]:
0.70017000000002816

Here we see that the algorithm gets it wrong --- newton is fast but not robust

Let's try a hybrid method

In [312]:
from scipy.optimize import brentq
brentq(f, 0, 1) # Hybrid method
Out[312]:
0.40829350427936706
In [313]:
timeit bisect(f, 0, 1)
10000 loops, best of 3: 64.7 µs per loop
In [314]:
timeit newton(f, 0.2)
100000 loops, best of 3: 14.2 µs per loop
In [315]:
timeit brentq(f, 0, 1)
100000 loops, best of 3: 16.1 µs per loop

Note that the hybrid method is robust but still quite fast...

Numerical optimization and integration

In [316]:
from scipy.optimize import fminbound
fminbound(lambda x: x**2, -1, 2)  # Search in [-1, 2]
Out[316]:
0.0
In [317]:
from scipy.integrate import quad
integral, error = quad(lambda x: x**2, 0, 1)
integral
Out[317]:
0.33333333333333337

Linear Algebra

Let's look at some of the most common routines from linear and matrix algebra

In [318]:
import scipy.linalg as la

We'll experiment with matrices

$$ A = \begin{bmatrix} 2 & -1 \\ 3 & 0 \end{bmatrix} \quad \text{and} \quad b = \begin{bmatrix} 1 \\ 1 \end{bmatrix} $$
In [319]:
A = [[2, -1],
     [3, 0]]
A = np.array(A) # Convert from list to NumPy array
b = np.ones((2, 1))  # Shape is 2 x 1
In [320]:
A
Out[320]:
array([[ 2, -1],
       [ 3,  0]])
In [321]:
b
Out[321]:
array([[ 1.],
       [ 1.]])
In [322]:
x = la.solve(A, b)  # Solve for x in Ax = b
print(x)
[[ 0.33333333]
 [-0.33333333]]

Let's check that $Ax = b$

In [323]:
np.dot(A, x)
Out[323]:
array([[ 1.],
       [ 1.]])

We can also invert directly

In [324]:
la.inv(A)
Out[324]:
array([[ 0.        ,  0.33333333],
       [-1.        ,  0.66666667]])
In [325]:
np.dot(A, la.inv(A))  # Should be the identity
Out[325]:
array([[ 1.,  0.],
       [ 0.,  1.]])

Let's compute the eigenvalues and eigenvectors

In [326]:
eigvals, eigvecs = la.eig(A)
In [327]:
print("eigenvalues = {}".format(eigvals))
eigenvalues = [ 1.+1.41421356j  1.-1.41421356j]
In [328]:
print("first eigenvector = {}".format(eigvecs[:, 0]))
first eigenvector = [ 0.28867513+0.40824829j  0.86602540+0.j        ]

More information

Pandas

Pandas is a very popular library for working with data sets. In pandas, data is held in a dataframe, which is kind of like a spread sheet

In [329]:
import pandas as pd

Let's start by writing a test data set to the present working directory, so we can read it back in as a dataframe using pandas. We use an IPython magic to write the data from a cell to a file:

In [330]:
%%file test_data.csv
"country","country isocode","year","POP","XRAT","tcgdp","cc","cg"
"Argentina","ARG","2000","37335.653","0.9995","295072.21869","75.716805379","5.5788042896"
"Australia","AUS","2000","19053.186","1.72483","541804.6521","67.759025993","6.7200975332"
"India","IND","2000","1006300.297","44.9416","1728144.3748","64.575551328","14.072205773"
"Israel","ISR","2000","6114.57","4.07733","129253.89423","64.436450847","10.266688415"
"Malawi","MWI","2000","11801.505","59.543808333","5026.2217836","74.707624181","11.658954494"
"South Africa","ZAF","2000","45064.098","6.93983","227242.36949","72.718710427","5.7265463933"
"United States","USA","2000","282171.957","1","9898700","72.347054303","6.0324539789"
"Uruguay","URY","2000","3219.793","12.099591667","25255.961693","78.978740282","5.108067988"
Overwriting test_data.csv
In [331]:
%ls ./*.csv # Check it's there
./test_data.csv
In [332]:
df = pd.read_csv('./test_data.csv')
In [333]:
df
Out[333]:
country country isocode year POP XRAT tcgdp cc cg
0 Argentina ARG 2000 37335.653 0.999500 295072.218690 75.716805 5.578804
1 Australia AUS 2000 19053.186 1.724830 541804.652100 67.759026 6.720098
2 India IND 2000 1006300.297 44.941600 1728144.374800 64.575551 14.072206
3 Israel ISR 2000 6114.570 4.077330 129253.894230 64.436451 10.266688
4 Malawi MWI 2000 11801.505 59.543808 5026.221784 74.707624 11.658954
5 South Africa ZAF 2000 45064.098 6.939830 227242.369490 72.718710 5.726546
6 United States USA 2000 282171.957 1.000000 9898700.000000 72.347054 6.032454
7 Uruguay URY 2000 3219.793 12.099592 25255.961693 78.978740 5.108068

Let's try that again but this time using the country as the index column

In [334]:
df = pd.read_csv('./test_data.csv', index_col='country')
In [335]:
df
Out[335]:
country isocode year POP XRAT tcgdp cc cg
country
Argentina ARG 2000 37335.653 0.999500 295072.218690 75.716805 5.578804
Australia AUS 2000 19053.186 1.724830 541804.652100 67.759026 6.720098
India IND 2000 1006300.297 44.941600 1728144.374800 64.575551 14.072206
Israel ISR 2000 6114.570 4.077330 129253.894230 64.436451 10.266688
Malawi MWI 2000 11801.505 59.543808 5026.221784 74.707624 11.658954
South Africa ZAF 2000 45064.098 6.939830 227242.369490 72.718710 5.726546
United States USA 2000 282171.957 1.000000 9898700.000000 72.347054 6.032454
Uruguay URY 2000 3219.793 12.099592 25255.961693 78.978740 5.108068

Let's drop the year since it's not very informative

In [336]:
df.drop(['year'], axis=1, inplace=True)
df
Out[336]:
country isocode POP XRAT tcgdp cc cg
country
Argentina ARG 37335.653 0.999500 295072.218690 75.716805 5.578804
Australia AUS 19053.186 1.724830 541804.652100 67.759026 6.720098
India IND 1006300.297 44.941600 1728144.374800 64.575551 14.072206
Israel ISR 6114.570 4.077330 129253.894230 64.436451 10.266688
Malawi MWI 11801.505 59.543808 5026.221784 74.707624 11.658954
South Africa ZAF 45064.098 6.939830 227242.369490 72.718710 5.726546
United States USA 282171.957 1.000000 9898700.000000 72.347054 6.032454
Uruguay URY 3219.793 12.099592 25255.961693 78.978740 5.108068

Let's add a column for GDP per capita

In [337]:
df['GDP percap'] = df['tcgdp'] / df['POP']
In [338]:
df
Out[338]:
country isocode POP XRAT tcgdp cc cg GDP percap
country
Argentina ARG 37335.653 0.999500 295072.218690 75.716805 5.578804 7.903229
Australia AUS 19053.186 1.724830 541804.652100 67.759026 6.720098 28.436433
India IND 1006300.297 44.941600 1728144.374800 64.575551 14.072206 1.717325
Israel ISR 6114.570 4.077330 129253.894230 64.436451 10.266688 21.138673
Malawi MWI 11801.505 59.543808 5026.221784 74.707624 11.658954 0.425897
South Africa ZAF 45064.098 6.939830 227242.369490 72.718710 5.726546 5.042648
United States USA 282171.957 1.000000 9898700.000000 72.347054 6.032454 35.080382
Uruguay URY 3219.793 12.099592 25255.961693 78.978740 5.108068 7.843971

Let's sort the whole data frame by GDP per capita

In [339]:
df.sort_values(by='GDP percap', inplace=True)
In [340]:
df
Out[340]:
country isocode POP XRAT tcgdp cc cg GDP percap
country
Malawi MWI 11801.505 59.543808 5026.221784 74.707624 11.658954 0.425897
India IND 1006300.297 44.941600 1728144.374800 64.575551 14.072206 1.717325
South Africa ZAF 45064.098 6.939830 227242.369490 72.718710 5.726546 5.042648
Uruguay URY 3219.793 12.099592 25255.961693 78.978740 5.108068 7.843971
Argentina ARG 37335.653 0.999500 295072.218690 75.716805 5.578804 7.903229
Israel ISR 6114.570 4.077330 129253.894230 64.436451 10.266688 21.138673
Australia AUS 19053.186 1.724830 541804.652100 67.759026 6.720098 28.436433
United States USA 282171.957 1.000000 9898700.000000 72.347054 6.032454 35.080382

Now we'll plot per capital GDP using the dataframe's plot method

In [341]:
df['GDP percap'].plot(kind='bar')
Out[341]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f77a207cda0>

Exercises

Here are two exercises. Feel free to consult documentation such as can be found here. The solutions are below. The cell with "solution below" is mean to push them below your line of sight and save you from temptation.

Exercise 1

Generate 10000 data points from the exponential distribution with density

$$ f(x; \alpha) = \alpha \exp(-\alpha x) \qquad (x > 0, \alpha > 0) $$

using scipy.stats and taking $\alpha = 0.5$. Then, after looking up the maximum likelihood estimator of $\alpha$, compute the estimate given your data and check that it is in fact close to $\alpha$.

In [342]:
# Put your solution here

Exercise 2

Using the same data set, implement maximum likelihood again, but this time pretending that you don't know the analytical expression for the maximum likelihood estimator. Set up the log likelihood function and maximize it numerically using a routine from scipy.optimize.

In [343]:
# Put your solution here

Solutions

In [344]:
# Print some nonsense to partially hide solutions
filler_text = "solution below\n" * 25
print(filler_text)
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below

Solution to Exercise 1

After checking the docs for the exponential distribution we proceed as follows

In [345]:
from scipy.stats import expon
alpha = 0.5
n = 10000
ep = expon(scale=1.0/alpha)  # scale controls the exponential parameter
x = ep.rvs(n)

Let's check we've got the right distribution here

In [346]:
fig, ax = plt.subplots(figsize=(8, 5))
xmin, xmax = 0.001, 10.0
ax.set_xlim(xmin, xmax)
ax.hist(x, normed=True, bins=40, alpha=0.3)
grid = np.linspace(xmin, xmax, 200)
ax.plot(grid, ep.pdf(grid), 'g-', lw=2, label='true density')
ax.legend()
Out[346]:
<matplotlib.legend.Legend at 0x7f77a2118048>

It's well-known that the MLE of $\alpha$ is $1/\bar x$ where $\bar x$ is the mean of the sample. Let's check that it is indeed close to $\alpha$.

In [347]:
alpha_mle = 1.0 / x.mean()
print("max likelihood estimate of alpha is {}".format(alpha_mle))
max likelihood estimate of alpha is 0.5042739418869412
In [348]:
s = x.sum()
def neg_loglike(a):
    "Minus the log likelihood function for exponential"
    return - n * np.log(a) + a * s

Minimize over a reasonable parameter space

In [349]:
from scipy.optimize import fminbound
fminbound(neg_loglike, 0.01, 10.0)
Out[349]:
0.5042730725954131

This is very close to the analytical value of the max likelihood estimator we got in exercise 1

In [ ]: