Scientific plotting in Python

A little overview

Marianne Corvellec, Physics PhD
[email protected]
~ PyLadies MTL ~ 2014-10-02

Read data into a dataframe

My favourite library for data analysis is pandas. (It is similar to R.)

In [1]:
import pandas as pd

Material by Software Carpentry: http://software-carpentry.org/v5

In [2]:
pd.read_csv('https://raw.githubusercontent.com/swcarpentry/bc/master/intermediate/python/A1_mosquito_data.csv')
Out[2]:
year temperature rainfall mosquitos
0 2001 80 157 150
1 2002 85 252 217
2 2003 86 154 153
3 2004 87 159 158
4 2005 74 292 243
5 2006 75 283 237
6 2007 80 214 190
7 2008 85 197 181
8 2009 74 231 200
9 2010 74 207 184
In [3]:
data1 = pd.read_csv('https://raw.githubusercontent.com/swcarpentry/bc/master/intermediate/python/A1_mosquito_data.csv')
In [4]:
type(data1)
Out[4]:
pandas.core.frame.DataFrame

Dataframes are very popular data structures in data analysis.

In [5]:
data1['year']  # Access a column by its name
Out[5]:
0    2001
1    2002
2    2003
3    2004
4    2005
5    2006
6    2007
7    2008
8    2009
9    2010
Name: year, dtype: int64

This is a series (one-dimensional data structure).

In [6]:
type(data1['year'])
Out[6]:
pandas.core.series.Series
In [7]:
data1['year'].ndim
Out[7]:
1

Use the double-bracket syntax [[ ]] to keep a dataframe.

In [8]:
type(data1[['year']])
Out[8]:
pandas.core.frame.DataFrame
In [9]:
data1[['year']].ndim
Out[9]:
2

This is useful for selecting multiple columns (language: 'dimension' vs 'column').

In [10]:
data1.columns
Out[10]:
Index([u'year', u'temperature', u'rainfall', u'mosquitos'], dtype='object')
In [11]:
data1[['year', 'temperature']]
Out[11]:
year temperature
0 2001 80
1 2002 85
2 2003 86
3 2004 87
4 2005 74
5 2006 75
6 2007 80
7 2008 85
8 2009 74
9 2010 74
In [12]:
data1[['year', 'temperature']]
Out[12]:
year temperature
0 2001 80
1 2002 85
2 2003 86
3 2004 87
4 2005 74
5 2006 75
6 2007 80
7 2008 85
8 2009 74
9 2010 74
In [13]:
data1[[0, 2]]  # Access columns by their indices
Out[13]:
year rainfall
0 2001 157
1 2002 252
2 2003 154
3 2004 159
4 2005 292
5 2006 283
6 2007 214
7 2008 197
8 2009 231
9 2010 207
In [14]:
data1[1:3]  # Access rows by slicing
Out[14]:
year temperature rainfall mosquitos
1 2002 85 252 217
2 2003 86 154 153
In [15]:
data1[1:6:3]
Out[15]:
year temperature rainfall mosquitos
1 2002 85 252 217
4 2005 74 292 243

data[1] returns an error (what it would mean is ambiguous).

In [16]:
data1.mean()  # Analyze your data
Out[16]:
year           2005.5
temperature      80.0
rainfall        214.6
mosquitos       191.3
dtype: float64
Visualize data with various chart types

The traditional Python plotting library is matplotlib.

In [17]:
%matplotlib inline
import matplotlib.pyplot as plt
In [18]:
plt.plot(data1['year'], data1['temperature'])
Out[18]:
[<matplotlib.lines.Line2D at 0x7fbf38dcd1d0>]
In [19]:
# What the heck? (dates in x are simply integers)
type(data1['year'][0])
Out[19]:
numpy.int64
In [20]:
ax = plt.subplot()

line1 = ax.plot(data1['year'], data1['temperature'], linewidth=3)
bar2 = ax.bar(data1['year'], data1['rainfall'], color='y', align='center')
point3 = ax.plot(data1['year'], data1['mosquitos'], 'r^')

# Add some text for labels, title
ax.set_xlabel('Year')
ax.set_title('Mosquitoes')
ax.legend((line1[0], bar2[0]), ('Temperature', 'Rainfall') )

plt.show()

ggplot comes from R (it looks nicer).

In [21]:
from ggplot import *
In [22]:
gg = ggplot(data1, aes(x='year', y='mosquitos')) + \
    geom_point(colour='red', shape='^', size=45) + \
    geom_line(aes(x='year', y='temperature'), colour='blue', size=3) + \
    geom_bar(aes(x='year', y='rainfall'), stat="identity", fill='yellow', alpha=0.5) + \
    ggtitle("Mosquitoes") + \
    ylab("")
In [23]:
gg.draw()
Out[23]:

Scientists want heatmaps and contours and histograms...

In [24]:
import numpy as np

data2 = np.array((np.random.rand(10, 10)) * 2) + np.sin(6 * np.arange(10) / (2 * np.pi))

plt.imshow(data2)
Out[24]:
<matplotlib.image.AxesImage at 0x7fbf2f426c10>

Visualize grid per se.

In [25]:
plt.pcolormesh(data2)
Out[25]:
<matplotlib.collections.QuadMesh at 0x7fbf2ec93390>

prettyplotlib makes more sense, its defaults convey information.

In [26]:
import prettyplotlib as ppl

ppl.pcolormesh(data2)
Out[26]:
In [27]:
data2pos = np.copy(data2)
data2pos[data2 < 0] = 0
ppl.pcolormesh(data2pos)
Out[27]:

seaborn is dedicated to statistical data visualization. And it's gorgeous.

In [28]:
import seaborn as sns

One-dimensional distributions:

In [29]:
sns.distplot(data2[9, :], color='blue')
Out[29]:
<matplotlib.axes.AxesSubplot at 0x7fbf2e4f6310>
In [30]:
sns.distplot(data2[:, 5], color='blue')
Out[30]:
<matplotlib.axes.AxesSubplot at 0x7fbf2e429910>

Two-dimensional distributions:

Say we have 10 random variables, given by data2[0, :], data2[1, :], etc.

Let's consider the correlation matrix for these 10 random variables.

In [31]:
corrmat = np.corrcoef(data2)
# Argument is an array where each row of represents a variable,
# and each column a single observation of all those variables (see docs).

As expected, this matrix is symmetric and values on the diagonal are all 1.

In [32]:
ppl.pcolormesh(corrmat)
Out[32]:

With seaborn, we can get this directly (take the transpose of your array though):

In [33]:
sns.corrplot(data2.transpose())
Out[33]:
<matplotlib.axes.AxesSubplot at 0x7fbf2e2b3450>
In [34]:
corrmat  # Check numerical values
Out[34]:
array([[ 1.        ,  0.01337291,  0.78986118,  0.28298983,  0.67401585,
         0.26980475,  0.69650458,  0.4333922 ,  0.53762671,  0.62455138],
       [ 0.01337291,  1.        ,  0.24745786,  0.58477688,  0.5492957 ,
         0.41951684,  0.45357242,  0.65273709,  0.37824987,  0.6787956 ],
       [ 0.78986118,  0.24745786,  1.        ,  0.571733  ,  0.87725046,
         0.59298078,  0.83514514,  0.46487452,  0.63753963,  0.75904482],
       [ 0.28298983,  0.58477688,  0.571733  ,  1.        ,  0.7507271 ,
         0.62549639,  0.72967285,  0.6612755 ,  0.43232519,  0.76635867],
       [ 0.67401585,  0.5492957 ,  0.87725046,  0.7507271 ,  1.        ,
         0.69694054,  0.84276372,  0.58846042,  0.70434996,  0.81274116],
       [ 0.26980475,  0.41951684,  0.59298078,  0.62549639,  0.69694054,
         1.        ,  0.32776562,  0.40948753,  0.62602867,  0.54446305],
       [ 0.69650458,  0.45357242,  0.83514514,  0.72967285,  0.84276372,
         0.32776562,  1.        ,  0.68919873,  0.60270067,  0.83108509],
       [ 0.4333922 ,  0.65273709,  0.46487452,  0.6612755 ,  0.58846042,
         0.40948753,  0.68919873,  1.        ,  0.6044282 ,  0.76875362],
       [ 0.53762671,  0.37824987,  0.63753963,  0.43232519,  0.70434996,
         0.62602867,  0.60270067,  0.6044282 ,  1.        ,  0.55113825],
       [ 0.62455138,  0.6787956 ,  0.75904482,  0.76635867,  0.81274116,
         0.54446305,  0.83108509,  0.76875362,  0.55113825,  1.        ]])

Some image processing with scikit-image...

In [35]:
from skimage.filter.rank import median
from skimage.morphology import disk
from skimage import img_as_ubyte
In [36]:
image_data = np.copy(data2)
image_data[data2 > 0.99] = 1
image_data[data2 < -0.99] = -1
In [37]:
ppl.pcolormesh(image_data)
Out[37]:
In [38]:
image_data = img_as_ubyte(image_data)
/usr/local/lib/python2.7/dist-packages/skimage/util/dtype.py:107: UserWarning: Possible precision loss when converting from float64 to uint8
  "%s to %s" % (dtypeobj_in, dtypeobj))
In [39]:
filtered = median(image_data, disk(2))
In [40]:
plt.imshow(filtered, vmin=0, vmax=255)
Out[40]:
<matplotlib.image.AxesImage at 0x7fbf2b0a9f50>
In [41]:
filtered
Out[41]:
array([[255, 255, 255, 255, 242, 239, 242, 255, 255, 255],
       [255, 255, 255, 255, 255, 242, 242, 255, 255, 255],
       [255, 255, 255, 255, 242, 209, 209, 255, 255, 255],
       [255, 255, 255, 255, 255, 155, 168, 255, 255, 255],
       [255, 255, 255, 255, 168, 155, 168, 255, 255, 255],
       [243, 255, 255, 255, 168,  51, 168, 255, 255, 255],
       [243, 255, 255, 255, 114,  51,  51, 255, 255, 255],
       [255, 255, 255, 255, 114,  51,  51, 255, 255, 255],
       [255, 255, 255, 255, 153,  98,  98, 255, 255, 255],
       [255, 255, 255, 255, 186,  51,  98, 255, 255, 255]], dtype=uint8)
In [42]:
fig, ax = plt.subplots()
ax.plot(filtered)
Out[42]:
[<matplotlib.lines.Line2D at 0x7fbf2afe3390>,
 <matplotlib.lines.Line2D at 0x7fbf2afe3610>,
 <matplotlib.lines.Line2D at 0x7fbf2afe3850>,
 <matplotlib.lines.Line2D at 0x7fbf2afe3a10>,
 <matplotlib.lines.Line2D at 0x7fbf2afe3bd0>,
 <matplotlib.lines.Line2D at 0x7fbf2afe3d90>,
 <matplotlib.lines.Line2D at 0x7fbf2b12e590>,
 <matplotlib.lines.Line2D at 0x7fbf2afed150>,
 <matplotlib.lines.Line2D at 0x7fbf2afed310>,
 <matplotlib.lines.Line2D at 0x7fbf2afed4d0>]

Interactivity with plotly...

In [43]:
import plotly.plotly as py
In [44]:
py.sign_in('DemoAccount', 'lr1c37zw81')
In [45]:
py.iplot_mpl(fig, strip_style = True)

Thank you!

In [46]:
# CSS styling within IPython notebook
from IPython.display import HTML, display
import urllib2
display(HTML(urllib2.urlopen('https://raw.githubusercontent.com/plotly/python-user-guide/master/custom.css').read()))