This is a Jupyter Notebook, a browser-based interface that works with a vide variety of programming languages. It's great for playing around with data, taking notes and making plots. If used properly, the notebook can be included directly as a supplement to your manuscript, making all your data processing transparrent and reproducible!
Jupyter allows you to keep all your notes, code and plots in one place. The workspace is divided into 'cells', which can be either 'Markdown' (like this one, for taking notes) or 'Code'.
# first, import a few useful packages
import numpy as np # the 'numerical python' package
import pandas as pd # a 'spreadsheet' type library for data handling
import matplotlib.pyplot as plt # for plotting
# tell plots to display inline
%matplotlib inline
plt.rcParams['figure.dpi'] = 100 # make the displayed figures slightly bigger.
df = pd.read_excel('data.xls')
# this uses pandas' built in plotting functions, rather than matplotlib directly
ax = df.plot(x='x', y='y', kind='scatter')
Looks like a periodic function with an underlying polynomial trend...
There are heaps of ways to fit models to data in Python. This examples uses the scipy
(Scientific Python) package, but other notable tools include:
statsmodels
(general statistics)sklearn
(machine-learning)PyMC3
(Python Monte Carlo)ggplot2
(plotting and basic model fitting).from scipy.optimize import curve_fit
Define a function that we want to fit:
$ y = S sin(x) + p_0 + p_1 x + p_2 x^2$
def fitfn(x, s, p0, p1, p2):
return s * np.sin(x) + p0 + p1 * x + p2 * x**2
Fit the model!
p, cov = curve_fit(f=fitfn, xdata=df.x, ydata=df.y)
# note: `curve_fit` returns both the parameters and the covariance matrix of the parameters.
# I've assingned them separate variable names here by specifying two variables on the
# left of the `=`: the parameters are called `p`, and the matrix is called `cov`.
Plot the results
# let's see how well it's done
ax = df.plot(x='x', y='y', kind='scatter') # make a scatter plot
pred = fitfn(df.x, *p) # calculate the fitted model
# note: using `*p` like this 'unpacks' the four values of P
# to populate the four parameters that `fitfn` is expecting.
ax.plot(df.x, pred, c='r') # plot the model fit.
# an example of number formatting in text.
plt.text(.03, .97, '{:.1f} sin(x) + {:.1f} + {:.1f} x + {:.1f} $x^2$'.format(*p),
va='top', ha='left', transform=ax.transAxes) # write the fit equation on the plot
<matplotlib.text.Text at 0x7f2e6d55ec18>
yr = df.y - pred # calculate the residuals
# This example doens't use pandas built-in plotting functions.
# We're using matplotlib directly here, as it offere more customisable plots.
# create a figure with three axes
fig, (fax, rax, hax) = plt.subplots(1, 3, figsize=[10,3])
# note the syntax on the left of the `=` sign - multiple parameters
# are assigned at once!
# plot the data and the best fit line
fax.scatter(df.x, df.y)
fax.plot(df.x, pred, c='r')
fax.set_ylabel('y')
fax.set_xlabel('x')
# plot the residuals
rax.scatter(df.x, yr)
rax.set_xlabel('x')
# plot a histogram of the residuals
hax.hist(yr, orientation='horizontal')
hax.set_xlabel('n')
# a loop! This cycles through the residual axes [rax, hax], and
# adds a y label and a dashed line at zero.
for ax in [rax, hax]:
ax.axhline(0, c='k', ls='dashed') # dashed line at zero
ax.set_ylabel('Residual ($y_{obs} - y_{pred}$)') # y label (note use of TeX syntax inside $ for subscripts)
# calculate goodness of fit
SStot = np.sum((df.y - df.y.mean())**2) # total sum of squares
SSreg = np.sum(yr**2) # residual sum of squares
R2 = 1 - (SSreg / SStot) # calculate R2
# write R2 on plot
fax.text(.02, .98, '$R^2$: {:.3f}'.format(R2), va='top', ha='left', transform=fax.transAxes)
# shift around the axes so they look nice
fig.tight_layout()
If you're completely new to Python, the easiest way in is to download and install a pre-packaged version that contains everything you need and will 'just work', like Continuum Analytics 'Anaconda'. Download the Python 3.6 version from that link, and everything should work. It even comes with a nice graphical interface to start Jupyter and other Python apps, and manage which packages you have installed.
Python is a famously simple programming language. It still has a learning curve, but it's easier than most. A good place to start is the Code Academy, which teaches you the very-basics of the language an syntax.
I could waffle here, but this is more useful.
Pro Tip: There's a list of Keyboard Shortcuts in the Help menu at the top.
Some people have been struggling to launch Jupyter from the 'Anaconda Navigator'. I'm not sure why this is (possibly something to do with how windows handles web broswers?), but a simple workaround:
cd
), and type jupyter notebook
to start Jupyter.This book by Jake Vanderplas is excellent, and is entirely written in Jupyter Notebooks! Read on for lists of various useful packages.
Python is a 'modular' language - it's not like Matlab, which includes pretty much everything 'out of the box'. This is necessary because python is a general-purpose language, which can be used for anything from data analysis to making toast(?!). If python came with everything it can do 'in the box', it would be unfeasibly massive.
This means that most of your work in python will rely on packages, of which there are at least 117,819. The first step to doing anything in useful in Python is finding the best package for the job. Google and colleagues are good resources here!
Some of the most useful packages that you might routinly use in data science are:
scipy
library with all sorts of scientifically useful functions in.scipy
libraryIf you've installed Anaconda, as instructed above, this comes pre-installed with many of the most useful packages for data science. If you find you need one that isn't installed (you get an error when you try to import it), you can install most of them either through the Anaconda Navigator, or in a terminal by typing conda install XXX
, where XXX is the name of the package you want.
To actually use these packages, you need to import
them at the top of your python code, for example:
import numpy as np
Will import all the functions in the numpy package, and make them accessible within the np
variable in your python session. In a Jupyter Notebook, you can access the functions by typing np.
followed by the [Tab]
key to list all the functions available.