Scientific Python Quickstart¶

John Stachurski ¶

ANU¶

This is a fast-paced, hands-on introduction to scientific computing with Python, contained in a Jupyter notebook. The main focus will be on introducing Python's four most important scientific libraries: NumPy, Scipy, Pandas and Matplotlib.

If you don't know how to use this notebook you need to first work through this page.

A slower, more detailed and more systematic treatment of Python for scientific applications can be found at quant-econ.net. But this notebook is a good place to start for those who like to learn by doing.

Here's some information on the version of Python that I'm using:

In [241]:

import sys
print(sys.version)

3.5.1 |Anaconda 2.4.1 (64-bit)| (default, Dec  7 2015, 11:16:01) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]

Basic NumPy¶

Perhaps the single most important scientific library for Python is NumPy. NumPy provides foundational data structures and routines on which many other libraries rely.

In [242]:

import numpy as np     # Import library and give it alias np
print(np.__version__)  #  The version I'm using

1.10.2

NumPy defines a basic data type called an array (actually a numpy.ndarray)

In [243]:

a = np.zeros(3)     # Create an array of zeros
a                   # Print a

Out[243]:

array([ 0.,  0.,  0.])

In [244]:

type(a)

Out[244]:

numpy.ndarray

Note that array data must be homogeneous

The most important data types are:

float64: 64 bit floating point number
float32: 32 bit floating point number
int64: 64 bit integer
int32: 32 bit integer
bool: 8 bit True or False

There are also dtypes to represent complex numbers, unsigned integers, etc

On most machines, the default dtype for arrays is float64

In [245]:

a = np.zeros(3)
type(a[1])

Out[245]:

numpy.float64

When we create an array such as

In [246]:

z = np.zeros(10)

z is a "flat" array with no dimension--- neither row nor column vector:

In [247]:

 z.shape

Out[247]:

(10,)

Here the shape tuple has only one element, which is the length of the array (tuples with one element end with a comma)

To give it dimension, we can change the shape attribute

For example, let's make it a column vector

In [248]:

z.shape = (10, 1)

In [249]:

Out[249]:

array([[ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.],
       [ 0.]])

In [250]:

z = np.zeros(4)
z.shape = (2, 2)
z

Out[250]:

array([[ 0.,  0.],
       [ 0.,  0.]])

Creating arrays¶

Creating empty arrays --- initializing memory:

In [251]:

z = np.empty(3)
z

Out[251]:

array([ 0.,  0.,  0.])

These are just garbage numbers --- whatever was in those memory slots

Here's how to make a regular gird sequence

In [252]:

z = np.linspace(2, 4, 5)  # From 2 to 4, with 5 elements
z

Out[252]:

array([ 2. ,  2.5,  3. ,  3.5,  4. ])

Creating an array of ones

In [253]:

z = np.ones(3)
z

Out[253]:

array([ 1.,  1.,  1.])

In [254]:

z = np.identity(2)
z

Out[254]:

array([[ 1.,  0.],
       [ 0.,  1.]])

Arrays can be made from Python lists or tuples

In [255]:

z = np.array([10, 20]) 
z

Out[255]:

array([10, 20])

In [256]:

z = np.array((10, 20), dtype=float) 
z

Out[256]:

array([ 10.,  20.])

In [257]:

z = np.array([[1, 2], [3, 4]])         # 2D array from a list of lists
z

Out[257]:

array([[1, 2],
       [3, 4]])

Array indexing¶

In [258]:

z = np.linspace(1, 2, 5)
z

Out[258]:

array([ 1.  ,  1.25,  1.5 ,  1.75,  2.  ])

In [259]:

z[0]  # First element --- Python sequences are zero based, like C, Java, etc.

Out[259]:

1.0

In [260]:

z[-1]  # Special syntax for last element

Out[260]:

2.0

In [261]:

z[0:2]  # Meaning: Two elements, starting from element 0

Out[261]:

array([ 1.  ,  1.25])

In [262]:

z = np.array([[1, 2], [3, 4]])
z

Out[262]:

array([[1, 2],
       [3, 4]])

In [263]:

z[0, 0]

Out[263]:

In [264]:

z[0,:]  # First row

Out[264]:

array([1, 2])

In [265]:

z[:,0]  # First column

Out[265]:

array([1, 3])

In [266]:

z = np.linspace(2, 4, 5)
z

Out[266]:

array([ 2. ,  2.5,  3. ,  3.5,  4. ])

In [267]:

d = np.array([0, 1, 1, 0, 0], dtype=bool)
d

Out[267]:

array([False,  True,  True, False, False], dtype=bool)

In [268]:

z[d]

Out[268]:

array([ 2.5,  3. ])

Array methods¶

In [269]:

A = np.array((4, 3, 2, 1))
A

Out[269]:

array([4, 3, 2, 1])

In [270]:

A.sort()

In [271]:

Out[271]:

array([1, 2, 3, 4])

In [272]:

A.mean()

Out[272]:

2.5

In [273]:

A.sum()

Out[273]:

In [274]:

A.max()

Out[274]:

In [275]:

A.cumsum()

Out[275]:

array([ 1,  3,  6, 10])

In [276]:

A.var()

Out[276]:

1.25

In [277]:

A.shape = (2, 2)
A

Out[277]:

array([[1, 2],
       [3, 4]])

In [278]:

A.T  # Transpose, equivalent to A.transpose()

Out[278]:

array([[1, 3],
       [2, 4]])

Operations on arrays¶

Standard arithmetic operations on arrays act elementwise

In [279]:

a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])

In [280]:

a + b

Out[280]:

array([ 6,  8, 10, 12])

In [281]:

a - b

Out[281]:

array([-4, -4, -4, -4])

In [282]:

a + 10

Out[282]:

array([11, 12, 13, 14])

In [283]:

a.shape = 2, 2
b.shape = 2, 2

In [284]:

Out[284]:

array([[1, 2],
       [3, 4]])

In [285]:

Out[285]:

array([[5, 6],
       [7, 8]])

In [286]:

a * b # Pointwise multiplication!!

Out[286]:

array([[ 5, 12],
       [21, 32]])

In [287]:

np.dot(a, b) # Matrix multiplication

Out[287]:

array([[19, 22],
       [43, 50]])

For Python $\geq 3.5$ and NumPy $\geq 1.1$ the @ operator also works.

In [288]:

a @ b

Out[288]:

array([[19, 22],
       [43, 50]])

I'll continue to use np.dot below for the benefit of those who are using older versions. But in my opinion the @ operator is much nicer.

Comparisons¶

In [289]:

z = np.array([2, 3])
y = np.array([2, 3])
z == y

Out[289]:

array([ True,  True], dtype=bool)

In [290]:

y[0] = 3
z == y

Out[290]:

array([False,  True], dtype=bool)

In [291]:

z = np.linspace(0, 10, 5)
z

Out[291]:

array([  0. ,   2.5,   5. ,   7.5,  10. ])

In [292]:

z > 3

Out[292]:

array([False, False,  True,  True,  True], dtype=bool)

In [293]:

z[z > 3]  # Conditional extraction

Out[293]:

array([  5. ,   7.5,  10. ])

Matplotlib¶

Matplotlib is an outstanding plotting and visualization library for Python that interacts nicely with NumPy. Here are a few quick examples. We'll see more below when we discuss the SciPy library.

In [294]:

import matplotlib.pyplot as plt  # Import main functionality

Display figures in this browser window rather than having them open up separately:

In [295]:

%matplotlib inline

Create something to plot

In [296]:

x = np.linspace(-2, 2, 100)
y = x**2

In [297]:

fig, ax = plt.subplots()  # Create axes and figure window
ax.plot(x, y, 'b-')

Out[297]:

[<matplotlib.lines.Line2D at 0x7f77a347d048>]

Here's a slightly more complex plot

In [298]:

y3 = x**3
fig, ax = plt.subplots()  # Create axes and figure window
ax.plot(x, y, 'b-', lw=2, alpha=0.8, label='$x^2$')
ax.plot(x, y3, 'g-', lw=2, alpha=0.8, label='$x^3$')
ax.legend(loc='lower right')

Out[298]:

<matplotlib.legend.Legend at 0x7f77a228fcf8>

SciPy¶

Let's just cover some simple examples --- references for further reading are below

Statistics and distributions¶

Let's use scipy.stats to generate some data from the Beta distribution

In [299]:

from scipy.stats import beta
q = beta(5, 5)      # Beta(a, b), with a = b = 5
obs = q.rvs(2000)   # 2000 observations

Now let's histogram it and compare it to the original density

In [300]:

fig, ax = plt.subplots()
ax.hist(obs, bins=40, normed=True)
grid = np.linspace(0.01, 0.99, 100)
ax.plot(grid, q.pdf(grid), 'k-', linewidth=2)

Out[300]:

[<matplotlib.lines.Line2D at 0x7f77a22d2ef0>]

Other methods

In [301]:

type(q)

Out[301]:

scipy.stats._distn_infrastructure.rv_frozen

In [302]:

dir(q)  # Let's see all its methods

Out[302]:

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'a',
 'args',
 'b',
 'cdf',
 'dist',
 'entropy',
 'expect',
 'interval',
 'isf',
 'kwds',
 'logcdf',
 'logpdf',
 'logpmf',
 'logsf',
 'mean',
 'median',
 'moment',
 'pdf',
 'pmf',
 'ppf',
 'random_state',
 'rvs',
 'sf',
 'stats',
 'std',
 'var']

In [303]:

q.cdf(0.5)

Out[303]:

0.50000000000000011

In [304]:

q.pdf(0.5)

Out[304]:

2.4609375000000009

In [305]:

q.mean()

Out[305]:

0.5

Basic linear regression:

In [306]:

from scipy.stats import linregress
n = 100
alpha, beta, sigma = 1, 2, 1.5
x = np.random.randn(n)  # n standard normals
y = alpha + beta * x + sigma * np.random.randn(n)
beta_hat, alpha_hat, r_value, p_value, std_err = linregress(x, y)
print("gradient = {}".format(beta_hat))
print("intercept = {}".format(alpha_hat))

gradient = 1.874172371477785
intercept = 1.4035329152756753

Let's plot this with data and line of best fit

In [307]:

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(x, y, 'bo', alpha=0.6, label='observations')
xgrid = np.linspace(-3, 3, 2)
ax.plot(xgrid, alpha_hat + beta_hat * xgrid, 'k-', lw=2, alpha=0.8, label='best fit')
ax.grid()
ax.legend(loc='upper left')

Out[307]:

<matplotlib.legend.Legend at 0x7f77a21026a0>

Roots and fixed points¶

Let's choose an arbitrary function to work with

In [308]:

fig, ax = plt.subplots()
def f(x):
     return np.sin(4 * (x - 0.25)) + x + x**20 - 1
x = np.linspace(0, 1, 100)
ax.plot(x, f(x))
ax.plot(x, 0 * x)

Out[308]:

[<matplotlib.lines.Line2D at 0x7f77a2129c18>]

In [309]:

from scipy.optimize import bisect  # Bisection algorithm --- slow but robust
bisect(f, 0, 1)

Out[309]:

0.4082935042797544

In [310]:

from scipy.optimize import newton  #  Newton's method --- fast but less robust
newton(f, 0.2)   # Start the search at initial condition x = 0.2

Out[310]:

0.40829350427935679

In [311]:

newton(f, 0.7)   # Start the search at x = 0.7 instead 

Out[311]:

0.70017000000002816

Here we see that the algorithm gets it wrong --- newton is fast but not robust

Let's try a hybrid method

In [312]:

from scipy.optimize import brentq
brentq(f, 0, 1) # Hybrid method

Out[312]:

0.40829350427936706

In [313]:

timeit bisect(f, 0, 1)

10000 loops, best of 3: 64.7 µs per loop

In [314]:

timeit newton(f, 0.2)

100000 loops, best of 3: 14.2 µs per loop

In [315]:

timeit brentq(f, 0, 1)

100000 loops, best of 3: 16.1 µs per loop

Note that the hybrid method is robust but still quite fast...

Numerical optimization and integration¶

In [316]:

from scipy.optimize import fminbound
fminbound(lambda x: x**2, -1, 2)  # Search in [-1, 2]

Out[316]:

0.0

In [317]:

from scipy.integrate import quad
integral, error = quad(lambda x: x**2, 0, 1)
integral

Out[317]:

0.33333333333333337

Linear Algebra¶

Let's look at some of the most common routines from linear and matrix algebra

In [318]:

import scipy.linalg as la

We'll experiment with matrices

$$ A = \begin{bmatrix} 2 & -1 \\ 3 & 0 \end{bmatrix} \quad \text{and} \quad b = \begin{bmatrix} 1 \\ 1 \end{bmatrix} $$

In [319]:

A = [[2, -1],
     [3, 0]]
A = np.array(A) # Convert from list to NumPy array
b = np.ones((2, 1))  # Shape is 2 x 1

In [320]:

Out[320]:

array([[ 2, -1],
       [ 3,  0]])

In [321]:

Out[321]:

array([[ 1.],
       [ 1.]])

In [322]:

x = la.solve(A, b)  # Solve for x in Ax = b
print(x)

[[ 0.33333333]
 [-0.33333333]]

Let's check that $Ax = b$

In [323]:

np.dot(A, x)

Out[323]:

array([[ 1.],
       [ 1.]])

We can also invert directly

In [324]:

la.inv(A)

Out[324]:

array([[ 0.        ,  0.33333333],
       [-1.        ,  0.66666667]])

In [325]:

np.dot(A, la.inv(A))  # Should be the identity

Out[325]:

array([[ 1.,  0.],
       [ 0.,  1.]])

Let's compute the eigenvalues and eigenvectors

In [326]:

eigvals, eigvecs = la.eig(A)

In [327]:

print("eigenvalues = {}".format(eigvals))

eigenvalues = [ 1.+1.41421356j  1.-1.41421356j]

In [328]:

print("first eigenvector = {}".format(eigvecs[:, 0]))

first eigenvector = [ 0.28867513+0.40824829j  0.86602540+0.j        ]

More information¶

linear algebra: http://docs.scipy.org/doc/scipy/reference/linalg.html
numerical integration: http://docs.scipy.org/doc/scipy/reference/integrate.html
interpolation: http://docs.scipy.org/doc/scipy/reference/interpolate.html
optimization: http://docs.scipy.org/doc/scipy/reference/optimize.html
distributions and random number generation: http://docs.scipy.org/doc/scipy/reference/stats.html
signal processing: http://docs.scipy.org/doc/scipy/reference/signal.html

Pandas¶

Pandas is a very popular library for working with data sets. In pandas, data is held in a dataframe, which is kind of like a spread sheet

In [329]:

import pandas as pd

Let's start by writing a test data set to the present working directory, so we can read it back in as a dataframe using pandas. We use an IPython magic to write the data from a cell to a file:

In [330]:

%%file test_data.csv
"country","country isocode","year","POP","XRAT","tcgdp","cc","cg"
"Argentina","ARG","2000","37335.653","0.9995","295072.21869","75.716805379","5.5788042896"
"Australia","AUS","2000","19053.186","1.72483","541804.6521","67.759025993","6.7200975332"
"India","IND","2000","1006300.297","44.9416","1728144.3748","64.575551328","14.072205773"
"Israel","ISR","2000","6114.57","4.07733","129253.89423","64.436450847","10.266688415"
"Malawi","MWI","2000","11801.505","59.543808333","5026.2217836","74.707624181","11.658954494"
"South Africa","ZAF","2000","45064.098","6.93983","227242.36949","72.718710427","5.7265463933"
"United States","USA","2000","282171.957","1","9898700","72.347054303","6.0324539789"
"Uruguay","URY","2000","3219.793","12.099591667","25255.961693","78.978740282","5.108067988"

Overwriting test_data.csv

In [331]:

%ls ./*.csv # Check it's there

./test_data.csv

In [332]:

df = pd.read_csv('./test_data.csv')

In [333]:

df

Out[333]:

	country	country isocode	year	POP	XRAT	tcgdp	cc	cg
0	Argentina	ARG	2000	37335.653	0.999500	295072.218690	75.716805	5.578804
1	Australia	AUS	2000	19053.186	1.724830	541804.652100	67.759026	6.720098
2	India	IND	2000	1006300.297	44.941600	1728144.374800	64.575551	14.072206
3	Israel	ISR	2000	6114.570	4.077330	129253.894230	64.436451	10.266688
4	Malawi	MWI	2000	11801.505	59.543808	5026.221784	74.707624	11.658954
5	South Africa	ZAF	2000	45064.098	6.939830	227242.369490	72.718710	5.726546
6	United States	USA	2000	282171.957	1.000000	9898700.000000	72.347054	6.032454
7	Uruguay	URY	2000	3219.793	12.099592	25255.961693	78.978740	5.108068

Let's try that again but this time using the country as the index column

In [334]:

df = pd.read_csv('./test_data.csv', index_col='country')

In [335]:

df

Out[335]:

	country isocode	year	POP	XRAT	tcgdp	cc	cg
country
Argentina	ARG	2000	37335.653	0.999500	295072.218690	75.716805	5.578804
Australia	AUS	2000	19053.186	1.724830	541804.652100	67.759026	6.720098
India	IND	2000	1006300.297	44.941600	1728144.374800	64.575551	14.072206
Israel	ISR	2000	6114.570	4.077330	129253.894230	64.436451	10.266688
Malawi	MWI	2000	11801.505	59.543808	5026.221784	74.707624	11.658954
South Africa	ZAF	2000	45064.098	6.939830	227242.369490	72.718710	5.726546
United States	USA	2000	282171.957	1.000000	9898700.000000	72.347054	6.032454
Uruguay	URY	2000	3219.793	12.099592	25255.961693	78.978740	5.108068

Let's drop the year since it's not very informative

In [336]:

df.drop(['year'], axis=1, inplace=True)
df

Out[336]:

	country isocode	POP	XRAT	tcgdp	cc	cg
country
Argentina	ARG	37335.653	0.999500	295072.218690	75.716805	5.578804
Australia	AUS	19053.186	1.724830	541804.652100	67.759026	6.720098
India	IND	1006300.297	44.941600	1728144.374800	64.575551	14.072206
Israel	ISR	6114.570	4.077330	129253.894230	64.436451	10.266688
Malawi	MWI	11801.505	59.543808	5026.221784	74.707624	11.658954
South Africa	ZAF	45064.098	6.939830	227242.369490	72.718710	5.726546
United States	USA	282171.957	1.000000	9898700.000000	72.347054	6.032454
Uruguay	URY	3219.793	12.099592	25255.961693	78.978740	5.108068

Let's add a column for GDP per capita

In [337]:

df['GDP percap'] = df['tcgdp'] / df['POP']

In [338]:

df

Out[338]:

	country isocode	POP	XRAT	tcgdp	cc	cg	GDP percap
country
Argentina	ARG	37335.653	0.999500	295072.218690	75.716805	5.578804	7.903229
Australia	AUS	19053.186	1.724830	541804.652100	67.759026	6.720098	28.436433
India	IND	1006300.297	44.941600	1728144.374800	64.575551	14.072206	1.717325
Israel	ISR	6114.570	4.077330	129253.894230	64.436451	10.266688	21.138673
Malawi	MWI	11801.505	59.543808	5026.221784	74.707624	11.658954	0.425897
South Africa	ZAF	45064.098	6.939830	227242.369490	72.718710	5.726546	5.042648
United States	USA	282171.957	1.000000	9898700.000000	72.347054	6.032454	35.080382
Uruguay	URY	3219.793	12.099592	25255.961693	78.978740	5.108068	7.843971

Let's sort the whole data frame by GDP per capita

In [339]:

df.sort_values(by='GDP percap', inplace=True)

In [340]:

df

Out[340]:

	country isocode	POP	XRAT	tcgdp	cc	cg	GDP percap
country
Malawi	MWI	11801.505	59.543808	5026.221784	74.707624	11.658954	0.425897
India	IND	1006300.297	44.941600	1728144.374800	64.575551	14.072206	1.717325
South Africa	ZAF	45064.098	6.939830	227242.369490	72.718710	5.726546	5.042648
Uruguay	URY	3219.793	12.099592	25255.961693	78.978740	5.108068	7.843971
Argentina	ARG	37335.653	0.999500	295072.218690	75.716805	5.578804	7.903229
Israel	ISR	6114.570	4.077330	129253.894230	64.436451	10.266688	21.138673
Australia	AUS	19053.186	1.724830	541804.652100	67.759026	6.720098	28.436433
United States	USA	282171.957	1.000000	9898700.000000	72.347054	6.032454	35.080382

Now we'll plot per capital GDP using the dataframe's plot method

In [341]:

df['GDP percap'].plot(kind='bar')

Out[341]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f77a207cda0>

Exercises¶

Here are two exercises. Feel free to consult documentation such as can be found here. The solutions are below. The cell with "solution below" is mean to push them below your line of sight and save you from temptation.

Exercise 1¶

Generate 10000 data points from the exponential distribution with density

$$ f(x; \alpha) = \alpha \exp(-\alpha x) \qquad (x > 0, \alpha > 0) $$

using scipy.stats and taking $\alpha = 0.5$. Then, after looking up the maximum likelihood estimator of $\alpha$, compute the estimate given your data and check that it is in fact close to $\alpha$.

In [342]:

# Put your solution here

Exercise 2¶

Using the same data set, implement maximum likelihood again, but this time pretending that you don't know the analytical expression for the maximum likelihood estimator. Set up the log likelihood function and maximize it numerically using a routine from scipy.optimize.

In [343]:

# Put your solution here

Solutions¶

In [344]:

# Print some nonsense to partially hide solutions
filler_text = "solution below\n" * 25
print(filler_text)

solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below
solution below

Solution to Exercise 1¶

After checking the docs for the exponential distribution we proceed as follows

In [345]:

from scipy.stats import expon
alpha = 0.5
n = 10000
ep = expon(scale=1.0/alpha)  # scale controls the exponential parameter
x = ep.rvs(n)

Let's check we've got the right distribution here

In [346]:

fig, ax = plt.subplots(figsize=(8, 5))
xmin, xmax = 0.001, 10.0
ax.set_xlim(xmin, xmax)
ax.hist(x, normed=True, bins=40, alpha=0.3)
grid = np.linspace(xmin, xmax, 200)
ax.plot(grid, ep.pdf(grid), 'g-', lw=2, label='true density')
ax.legend()

Out[346]:

<matplotlib.legend.Legend at 0x7f77a2118048>

It's well-known that the MLE of $\alpha$ is $1/\bar x$ where $\bar x$ is the mean of the sample. Let's check that it is indeed close to $\alpha$.

In [347]:

alpha_mle = 1.0 / x.mean()
print("max likelihood estimate of alpha is {}".format(alpha_mle))

max likelihood estimate of alpha is 0.5042739418869412

In [348]:

s = x.sum()
def neg_loglike(a):
    "Minus the log likelihood function for exponential"
    return - n * np.log(a) + a * s

Minimize over a reasonable parameter space

In [349]:

from scipy.optimize import fminbound
fminbound(neg_loglike, 0.01, 10.0)

Out[349]:

0.5042730725954131

This is very close to the analytical value of the max likelihood estimator we got in exercise 1

In [ ]: