Notebook

Data Analysis Tools: Python and Jupyter¶

This is a Jupyter Notebook, a browser-based interface that works with a vide variety of programming languages. It's great for playing around with data, taking notes and making plots. If used properly, the notebook can be included directly as a supplement to your manuscript, making all your data processing transparrent and reproducible!

Jupyter allows you to keep all your notes, code and plots in one place. The workspace is divided into 'cells', which can be either 'Markdown' (like this one, for taking notes) or 'Code'.

The main idea:¶

Not Like this!

For Example, in Python:¶

In [1]:

# first, import a few useful packages
import numpy as np  # the 'numerical python' package
import pandas as pd  # a 'spreadsheet' type library for data handling

import matplotlib.pyplot as plt  # for plotting

# tell plots to display inline
%matplotlib inline

plt.rcParams['figure.dpi'] = 100  # make the displayed figures slightly bigger.

Read in some data¶

In [2]:

df = pd.read_excel('data.xls')

In [3]:

# this uses pandas' built in plotting functions, rather than matplotlib directly
ax = df.plot(x='x', y='y', kind='scatter')

Looks like a periodic function with an underlying polynomial trend...

Fitting Data¶

There are heaps of ways to fit models to data in Python. This examples uses the scipy (Scientific Python) package, but other notable tools include:

statsmodels (general statistics)
sklearn (machine-learning)
PyMC3 (Python Monte Carlo)
ggplot2 (plotting and basic model fitting).

In [4]:

from scipy.optimize import curve_fit

Define a function that we want to fit:

$ y = S sin(x) + p_0 + p_1 x + p_2 x^2$

In [5]:

def fitfn(x, s, p0, p1, p2):
    return s * np.sin(x) + p0 + p1 * x + p2 * x**2

Fit the model!

In [6]:

p, cov = curve_fit(f=fitfn, xdata=df.x, ydata=df.y)
# note: `curve_fit` returns both the parameters and the covariance matrix of the parameters.
#       I've assingned them separate variable names here by specifying two variables on the
#       left of the `=`: the parameters are called `p`, and the matrix is called `cov`.

Plot the results

In [7]:

# let's see how well it's done
ax = df.plot(x='x', y='y', kind='scatter')  # make a scatter plot

pred = fitfn(df.x, *p)  # calculate the fitted model
# note: using `*p` like this 'unpacks' the four values of P
# to populate the four parameters that `fitfn` is expecting.

ax.plot(df.x, pred, c='r')  # plot the model fit.

# an example of number formatting in text.
plt.text(.03, .97, '{:.1f} sin(x) + {:.1f} + {:.1f} x + {:.1f} $x^2$'.format(*p),
         va='top', ha='left', transform=ax.transAxes)  # write the fit equation on the plot

Out[7]:

<matplotlib.text.Text at 0x7f2e6d55ec18>

Residuals¶

In [8]:

yr = df.y - pred  # calculate the residuals

# This example doens't use pandas built-in plotting functions.
# We're using matplotlib directly here, as it offere more customisable plots.

# create a figure with three axes
fig, (fax, rax, hax) = plt.subplots(1, 3, figsize=[10,3])
# note the syntax on the left of the `=` sign - multiple parameters
# are assigned at once!

# plot the data and the best fit line
fax.scatter(df.x, df.y)
fax.plot(df.x, pred, c='r')
fax.set_ylabel('y')
fax.set_xlabel('x')

# plot the residuals
rax.scatter(df.x, yr)
rax.set_xlabel('x')

# plot a histogram of the residuals
hax.hist(yr, orientation='horizontal')
hax.set_xlabel('n')

# a loop! This cycles through the residual axes [rax, hax], and
# adds a y label and a dashed line at zero.
for ax in [rax, hax]:
    ax.axhline(0, c='k', ls='dashed')  # dashed line at zero
    ax.set_ylabel('Residual ($y_{obs} - y_{pred}$)')  # y label (note use of TeX syntax inside $ for subscripts)

    
# calculate goodness of fit
SStot = np.sum((df.y - df.y.mean())**2)  # total sum of squares
SSreg = np.sum(yr**2)  # residual sum of squares
R2 = 1 - (SSreg / SStot)  # calculate R2

# write R2 on plot
fax.text(.02, .98, '$R^2$: {:.3f}'.format(R2), va='top', ha='left', transform=fax.transAxes)

# shift around the axes so they look nice
fig.tight_layout()

Want to try this at home?¶

Complete Beginner¶

If you're completely new to Python, the easiest way in is to download and install a pre-packaged version that contains everything you need and will 'just work', like Continuum Analytics 'Anaconda'. Download the Python 3.6 version from that link, and everything should work. It even comes with a nice graphical interface to start Jupyter and other Python apps, and manage which packages you have installed.

Getting started with Python¶

Python is a famously simple programming language. It still has a learning curve, but it's easier than most. A good place to start is the Code Academy, which teaches you the very-basics of the language an syntax.

Getting starting with Jupyter¶

I could waffle here, but this is more useful.

Pro Tip: There's a list of Keyboard Shortcuts in the Help menu at the top.

Windows Users¶

Some people have been struggling to launch Jupyter from the 'Anaconda Navigator'. I'm not sure why this is (possibly something to do with how windows handles web broswers?), but a simple workaround:

Go to the start menu, and search for 'Anaconda Terminal'. Open it. This is different from a normal windows command prompt, because it starts within your Anaconda 'virtual environment'.
In the terminal that appears, navigate to the folder you want to work in (using cd), and type jupyter notebook to start Jupyter.
Leave this terminal window open in the background - this is where Jupyter is actually running, the browser is just an interface. Close it when you're done.
Reconsider your decision to use Windows. Anything that isn't 'point and click' is easier on Mac or Linux :D

Python Data Science Handbook¶

This book by Jake Vanderplas is excellent, and is entirely written in Jupyter Notebooks! Read on for lists of various useful packages.

Some useful packages¶

Python is a 'modular' language - it's not like Matlab, which includes pretty much everything 'out of the box'. This is necessary because python is a general-purpose language, which can be used for anything from data analysis to making toast(?!). If python came with everything it can do 'in the box', it would be unfeasibly massive.

This means that most of your work in python will rely on packages, of which there are at least 117,819. The first step to doing anything in useful in Python is finding the best package for the job. Google and colleagues are good resources here!

Some of the most useful packages that you might routinly use in data science are:

General calculations¶

numpy - efficient numeric operations and handling of multi-dimensional arrays.
theano - GPU accellerated multi-dimensional array operations.
scipy - the Scientific Python library. These guys make numpy and matplotlib, byt have an additional scipy library with all sorts of scientifically useful functions in.

Data manipulation / import¶

pandas - a 'Data Frame' library for dealing with tabular data. Including lots of useful data import/export functions and excel integration.
pytables - a package for dealing with hierarchical datasets (based on the HDF5 data format).

Plotting¶

matplotlib - a plottling library originally based on Matlab's plotting functions. A good place to start. The Matplotlib Gallery contains a wide array of example plot types and the code to make them.
Bokeh - a plottling library built around interactivity using the D3.js library. Particularly good for making plots to display on websites.
Plotly - another libary geared towards interactivity and sharing plots online. Some of their services require money, though...
ggplot - A Python re-write of R's popular ggplot2 library. Lots of convenient plotting functions, which would be a good place to start if you're an R user.
and many more

Statistics / Analysis¶

scipy.stats - the statistics part of the scipy library
statsmodels - a wide variety of statistical functions.
scikit-learn - a machine-learning focussed library withan emphasis on data classification. Excellent for Principal Component Analysis and multi-dimensional data clustering analysis.

Image Manipulation¶

scikit-image - image processing and manipulation tools.

Monte-Carlo Simulation¶

PyMC3 - a framework for Monte-Carlo computation.

If you've installed Anaconda, as instructed above, this comes pre-installed with many of the most useful packages for data science. If you find you need one that isn't installed (you get an error when you try to import it), you can install most of them either through the Anaconda Navigator, or in a terminal by typing conda install XXX, where XXX is the name of the package you want.

To actually use these packages, you need to import them at the top of your python code, for example:

import numpy as np

Will import all the functions in the numpy package, and make them accessible within the np variable in your python session. In a Jupyter Notebook, you can access the functions by typing np. followed by the [Tab] key to list all the functions available.