Statistical Computing Languages

Python

### 21st November 2014

In [1]:
# rerun Fernando's script to load in css
%run talktools

The history saving thread hit an unexpected error (DatabaseError('database disk image is malformed',)).History will not be written to the database.

My Python History (Personal)
1. Background as an Undergraduate Mechanical Engineer (1991-1994)
2. Learnt BBC Basic (1983-1990), Lotus 1-2-3 (1990-1991), Quick Basic & Assembler (1991-1993), MATLAB (1994)
3. Worked as Field Engineer on Oil Rigs, tried learning Borland C++ (1994-1996)
4. PhD Course: Learnt C and MATLAB (1997-2000).
5. Started as a lecturer in Computer Science at Sheffield 2001
1. Felt obliged to learn Java
2. Read Bruce Eckel's book "Thinking in Java"
3. All scripts in Book were read using Python.
Why Python?

[From Guido van Rossum's blog](http://python-history.blogspot.co.uk/2009/01/introduction-and-overview.html)

Python is currently one of the most popular dynamic programming languages, along with Perl, Tcl, PHP, and newcomer Ruby. Although it is often viewed as a "scripting" language, it is really a general purpose programming language along the lines of Lisp or Smalltalk (as are the others, by the way). Today, Python is used for everything from throw-away scripts to large scalable web servers that provide uninterrupted service 24x7. It is used for GUI and database programming, client- and server-side web programming, and application testing. It is used by scientists writing applications for the world's fastest supercomputers and by children first learning to program.

Ease of Use

Python has the style of a scripting language (cf perl, bash)

In [2]:
# Open my web page and look for links (modified by code on Guido's blog)
import re
import urllib

regex = re.compile(r'href="([^"]+\.html)"')

def matcher(url, max=10):
"Print the first several URL references in a given url."
hits = regex.findall(data)
return hits

hits = matcher("http://staffwww.dcs.shef.ac.uk/people/N.Lawrence/index.html")
for hit in hits:
print hit

/people/N.Lawrence/inaugural.html
/people/N.Lawrence/lectureNotes.html
/people/N.Lawrence/research.html
http://ml.dcs.shef.ac.uk/sitran/talks.html
/people/N.Lawrence/software.html
/people/N.Lawrence/miscellany.html
/people/N.Lawrence/notes.html
./thematic08.html
kappenball.html
inaugural.html
http://ml.dcs.shef.ac.uk/gpss/past_meetings.html
./inaugural.html
http://maths.dept.shef.ac.uk/maths/staff_info_452.html
http://staffwww.dcs.sheffield.ac.uk/people/N.Fusi/index.html
http://staffwww.dcs.sheffield.ac.uk/people/A.Damianou/index.html
./3PhaseData.html
./inaugural.html
./com4509_6509_2012.html
../A.Cowling/campus_only/teach/com3310.html
http://ttic.uchicago.edu/~rurtasun/tutorials/GP_tutorial.html
./mlss2012.html
http://www.kyb.mpg.de/nc/employee/details/bs.html
./matweave.html
./thematic.html
./licsb.html
./com4509_6509_2011.html
campus_only/lectureNotes.html
./interspeech_tutorial.html
./icml_tutorial.html
./thematic08.html
./thematic.html
http://www.cs.ucl.ac.uk/people/D.Barber.html
http://www.well.ox.ac.uk/~mtitsias/publications.html
./chi.html

Well Developed Object Orientated Language
• Python has the advantage that it was designed in the mid-late 1990s
• Not in the mid-late 1970s or 1980s (cf S, MATLAB)
• As a result it has a coherent and functional object system.
• It also has some attributes of functional programming.
• It provides an object system without imposing one (cf java).
In [3]:
import pandas as pd
import numpy as np

class my_data(pd.DataFrame):
def pronounce(self):
print "Extending existing objects is easy, this encourages code reuse and good programming practice."

a = my_data(np.zeros((3, 3)))
a.pronounce()
a.describe()

Extending existing objects is easy, this encourages code reuse and good programming practice.

Out[3]:
0 1 2
count 3 3 3
mean 0 0 0
std 0 0 0
min 0 0 0
25% 0 0 0
50% 0 0 0
75% 0 0 0
max 0 0 0
Development as a Numerical Platform
• Around turn of millenium, several researchers were looking for alternatives for MATLAB.
• Python seemed appropriate because of
• interactive shell,
• friendly scripting nature,
• well supported modern programming language
• Some, like me, realised lack of plotting, array types, numerical computing, fully functional shell was prohibitive
• Others, like Fernando Perez, John Hunter, Travis Oliphant set about adding this funcitonality.
• Targeted more at scientific computing rather than statistical computing.
• More recently Wes McKinney added pandas: data anlaysis in python.
• This gives data frame capabilities like in R. Proving very popular for data analysis.
• The scipy stack catpures a range of scientific computing facilities (FFT, linear algebra etc)
• scikit-learn encodes models from a machine learning perspective.
• statsmodels encodes models from a statistics perspective.
• The Jupyter notebook and the notebook viewer provide an excellent platform for sharing analysis and code.
• As a research group we moved from an entire MATLAB code base to Python.
• See blog post about it here
Software Distributions
• Code is all BSD licensed (cf R and octave which are GPL licensed)
• Companies deliver optimised python distributions for scientific computing (Enthought's Canopy and Continuum Analytics's Anaconda)
• I use Canopy, but we have Anaconda installed (free) on the University managed desktop in Sheffield (free). Commercial firms pay a charge for the full version.
Python in Action: Regression
• Like most modern languages, python is perhaps more about learning the libraries than the syntax.
• Here we use pandas to load in data.
In [4]:
import pandas as pd  # first we import the pandas library

# gender is stored as male/female. We convert it to 0/1
df['Gender'] = df['Sex'].map( {'Female': 0, 'Male': 1} ).astype(int)
df.describe()

Out[4]:
OI Age Gender
count 101.00000 101.000000 101.000000
mean 6.18495 48.940594 0.198020
std 3.17320 14.413759 0.400495
min 0.98000 -10.000000 0.000000
25% 3.75000 39.000000 0.000000
50% 5.65000 52.000000 0.000000
75% 7.90000 60.000000 0.000000
max 16.50000 73.000000 1.000000
matplotlib
• Plotting is done using the matplotlib library.
In [5]:
# import a plotting library: matplotlib
import matplotlib.pyplot as plt
# ask for plots to appear in the notebook
%matplotlib inline
fig, ax = plt.subplots(figsize=(10, 7))
ax.scatter(df.Age, df.OI)
ax.set_xlabel('Age')
ax.set_ylabel('OI')

Out[5]:
<matplotlib.text.Text at 0x10cc23950>
Removing Negative Age Values
• The plot allows us to spot the negative age value. Let's remove it.
• pandas allows for removal by indexing according to 'truth values'.
• These can also be used to make different colour plots.
In [6]:
df = df[df['Age']>0]
fig, ax = plt.subplots(figsize=(10, 7))
for symbol, sex in zip(['bo', 'ro'], ['Male', 'Female']):
lines= ax.semilogy(df.Age[df.Sex==sex], df.OI[df.Sex==sex], symbol,linewidth=3,markersize=10)
ax.set_xlabel('Age').set_fontsize(18)
ax.set_ylabel('OI').set_fontsize(18)
ax.set_title('Plot of age against OI on a log scale').set_fontsize(20)


Regression Fit
First example using statsmodels.

In [7]:
import statsmodels.formula.api as sm
df['Eins'] = 1
Y = np.log(df.OI)
Y.name = 'np.log(OI)'
X = df[['Age','Gender', 'Eins']]
result = sm.OLS( Y, X ).fit()
result.summary()
#import statsmodels.graphics.regressionplots as rp
#rp.plot_fit(results, 1)

Out[7]:
Dep. Variable: R-squared: np.log(OI) 0.262 OLS 0.247 Least Squares 17.23 Sun, 23 Nov 2014 3.96e-07 12:01:08 -61.671 100 129.3 97 137.2 2 nonrobust
coef std err t P>|t| [95.0% Conf. Int.] 0.0162 0.004 4.603 0.000 0.009 0.023 0.3189 0.116 2.757 0.007 0.089 0.548 0.8292 0.178 4.666 0.000 0.477 1.182
 Omnibus: Durbin-Watson: 3.016 2.007 0.221 2.772 -0.408 0.25 2.975 200

statsmodels seems designed for those with experience of statistics and R.

In [8]:
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Fit regression model (using the natural log for the dependent variable)
results = smf.ols('np.log(OI)  ~ Age + Sex', data=df).fit()

# Inspect the results
results.summary()

Out[8]:
Dep. Variable: R-squared: np.log(OI) 0.262 OLS 0.247 Least Squares 17.23 Sun, 23 Nov 2014 3.96e-07 12:01:16 -61.671 100 129.3 97 137.2 2 nonrobust
coef std err t P>|t| [95.0% Conf. Int.] 0.8292 0.178 4.666 0.000 0.477 1.182 0.3189 0.116 2.757 0.007 0.089 0.548 0.0162 0.004 4.603 0.000 0.009 0.023
 Omnibus: Durbin-Watson: 3.016 2.007 0.221 2.772 -0.408 0.25 2.975 200
• There is even provision for loading in R datasets.
In [9]:
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

dat = sm.datasets.get_rdataset("Guerry", "HistData").data

# Fit regression model (using the natural log of one of the regressors)
results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=dat).fit()

# Inspect the results
results.summary()

Out[9]:
Dep. Variable: R-squared: Lottery 0.348 OLS 0.333 Least Squares 22.20 Sun, 23 Nov 2014 1.90e-08 12:01:22 -379.82 86 765.6 83 773.0 2 nonrobust
coef std err t P>|t| [95.0% Conf. Int.] 246.4341 35.233 6.995 0.000 176.358 316.510 -0.4889 0.128 -3.832 0.000 -0.743 -0.235 -31.3114 5.977 -5.239 0.000 -43.199 -19.424
 Omnibus: Durbin-Watson: 3.713 2.019 0.156 3.394 -0.487 0.183 3.003 702
sklearn
• In machine learning the scikit-learn would be more familiar.
In [10]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

X, y = df[['Age', 'Gender']], np.log(df['OI']).values[:, None]

model.fit( X, y )

score = '{0:.3f}'.format( model.score( X, y ) )
print model.coef_

[[ 0.01620833  0.31889832]]

My Own Preference
• When teaching my own preference is to derive the maximum likelihood solution and have students implement it.
• Solve $$\mathbf{X}^\top \mathbf{X} \boldsymbol{\beta} = \mathbf{X}^\top \mathbf{y}$$ for $\boldsymbol{\beta}$.
In [15]:
import numpy as np
beta = np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y))
beta = pd.DataFrame(beta, index=X.columns)
beta

Out[15]:
0
Age 0.016208
Gender 0.318898
Eins 0.829203

Although Darren Wilkinson pointed out to me at the meeting, that this is more correctly done with QR decomposition (see this page for a description).

$$\mathbf{X}^\top \mathbf{X} \boldsymbol{\beta} = \mathbf{X}^\top \mathbf{y}$$$$(\mathbf{Q}\mathbf{R})^\top (\mathbf{Q}\mathbf{R})\boldsymbol{\beta} = (\mathbf{Q}\mathbf{R})^\top \mathbf{y}$$$$\mathbf{R}^\top (\mathbf{Q}^\top \mathbf{Q}) \mathbf{R} \boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top \mathbf{y}$$$$\mathbf{R}^\top \mathbf{R} \boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top \mathbf{y}$$$$\mathbf{R} \boldsymbol{\beta} = \mathbf{Q}^\top \mathbf{y}$$

So if we let $\mathbf{z} = \mathbf{Q}^\top \mathbf{y}$, $$\mathbf{R} \boldsymbol{\beta} =\mathbf{z}$$

In [16]:
import numpy as np
import scipy as sp
Q, R = np.linalg.qr(X)
beta = sp.linalg.solve_triangular(R, np.dot(Q.T, y))
beta = pd.DataFrame(beta, index=X.columns)
beta

Out[16]:
0
Age 0.016208
Gender 0.318898
Eins 0.829203