Intermediate Scientific Python

This lesson assumes some programming background, but not necessarily with Python. It moves at a faster pace than a novice lesson, however.

We will learn how to do the following:

  1. Reading and manipulating tabular data
  2. Visualization and plotting
  3. Modularization and documentation
  4. Defensive programming
  5. Writing command line scripts with Python

This lesson is based heavily on the Software Carpentry python-intermediate-mosquitoes lesson, and uses its datasets. It also includes some mixed-in components from the python-novice-inflammation lesson.

Getting started

We are interested in understanding the relationship between the weather and the number of mosquitos occuring in a particular year so that we can plan mosquito control measures accordingly. Since we want to apply these mosquito control measures at a number of different sites we need to understand both the relationship at a particular site and whether or not it is consistent across sites. The data we have to address this problem comes from the local government and are stored in tables in comma-separated values (CSV) files. Each file holds the data for a single location, each row holds the information for a single year at that location, and the columns hold the data on both mosquito numbers and the average temperature and rainfall from the beginning of mosquito breeding season. The first few rows of our first file look like:

In [1]:
%cat A1_mosquito_data.csv | head -n 5
year,temperature,rainfall,mosquitos
2001,80,157,150
2002,85,252,217
2003,86,154,153
2004,87,159,158

And we have five files to work with:

In [2]:
%ls *.csv
A1_mosquito_data.csv  A3_mosquito_data.csv  B2_mosquito_data.csv
A2_mosquito_data.csv  B1_mosquito_data.csv

Note: commands preceded with a % are known as "magics". These are specfic to the notebook environment. They are not Python commands.

Since this is tabular data, our tool of choice is pandas, a library that provides special data structures for doing fast numerical operations on tabular data. Internally, pandas uses numpy to do the heavy lifting.

In [3]:
import pandas as pd

Importing a library in this way allows us to use the components defined inside of it. First, we'll read in a single dataset using the pandas.read_csv function:

In [4]:
pd.read_csv('A1_mosquito_data.csv')
Out[4]:
year temperature rainfall mosquitos
0 2001 80 157 150
1 2002 85 252 217
2 2003 86 154 153
3 2004 87 159 158
4 2005 74 292 243
5 2006 75 283 237
6 2007 80 214 190
7 2008 85 197 181
8 2009 74 231 200
9 2010 74 207 184

This reads the CSV from disk and deserializes it into a pandas.DataFrame. But unless we attach a name to the DataFrame the function returns, we cannot keep working with the object in memory.

In [5]:
data = pd.read_csv('A1_mosquito_data.csv')
In [6]:
data
Out[6]:
year temperature rainfall mosquitos
0 2001 80 157 150
1 2002 85 252 217
2 2003 86 154 153
3 2004 87 159 158
4 2005 74 292 243
5 2006 75 283 237
6 2007 80 214 190
7 2008 85 197 181
8 2009 74 231 200
9 2010 74 207 184

Now we can refer to the DataFrame directly, without having to read it from disk every time.

A bit about names. If I attach the name weight_kg to the value 55:

In [7]:
weight_kg = 55

I can then use this value to calculate the same weight in weight_lb:

In [8]:
weight_lb = 2.2 * weight_kg
In [9]:
weight_lb
Out[9]:
121.00000000000001

If I change my weight_kg to 66, what is the current value of weight_lb?

In [10]:
weight_kg = 66
In [11]:
weight_lb
Out[11]:
121.00000000000001

There's no change. This is because the name weight_lb is attached to a value, in this case a floating point number. The value has no concept of how it came to be.

In [12]:
weight_lb = 2.2 * weight_kg
In [13]:
weight_lb
Out[13]:
145.20000000000002

Back to our DataFrame. A DataFrame is an object. We can see what type of object it is with the Python builtin, type:

In [14]:
type(data)
Out[14]:
pandas.core.frame.DataFrame

It turns out that in Python, everything is an object. We'll see what this means as we go, but the most important aspect of this is that in Python we have names, and we assign these to objects. Any name can point to any object, and more than one name can point to a single object.

Anyway, a DataFrame allows us to get at individual components of our tabular data. We can get single columns like:

In [15]:
data['year']
Out[15]:
0    2001
1    2002
2    2003
3    2004
4    2005
5    2006
6    2007
7    2008
8    2009
9    2010
Name: year, dtype: int64

Or multiple columns with:

In [16]:
data[['rainfall', 'temperature']]
Out[16]:
rainfall temperature
0 157 80
1 252 85
2 154 86
3 159 87
4 292 74
5 283 75
6 214 80
7 197 85
8 231 74
9 207 74

Slicing can be used to get back subsets of rows:

In [17]:
data[0:2]
Out[17]:
year temperature rainfall mosquitos
0 2001 80 157 150
1 2002 85 252 217

Python indices are 0-based, meaning counting goes as 0, 1, 2, 3...; this means that the first row is row 0, the second row is row 1, etc. It's best to refer to row 0 as the "zeroth row" to avoid confusion.

This slice should be read as "get the 0th element up to and not including the 2nd element". The "not including" is important, and the cause of much initial frustration. It does take some getting used to.

What if we want a single row?

In [18]:
data[1]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   1875             try:
-> 1876                 return self._engine.get_loc(key)
   1877             except KeyError:

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4027)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3891)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12408)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12359)()

KeyError: 1

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-18-c805864c0d75> in <module>()
----> 1 data[1]

/usr/lib/python3.5/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1990             return self._getitem_multilevel(key)
   1991         else:
-> 1992             return self._getitem_column(key)
   1993 
   1994     def _getitem_column(self, key):

/usr/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   1997         # get column
   1998         if self.columns.is_unique:
-> 1999             return self._get_item_cache(key)
   2000 
   2001         # duplicate columns & possible reduce dimensionality

/usr/lib/python3.5/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1343         res = cache.get(item)
   1344         if res is None:
-> 1345             values = self._data.get(item)
   1346             res = self._box_item_values(item, values)
   1347             cache[item] = res

/usr/lib/python3.5/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3223 
   3224             if not isnull(item):
-> 3225                 loc = self.items.get_loc(item)
   3226             else:
   3227                 indexer = np.arange(len(self.items))[isnull(self.items)]

/usr/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   1876                 return self._engine.get_loc(key)
   1877             except KeyError:
-> 1878                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   1879 
   1880         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4027)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3891)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12408)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12359)()

KeyError: 1

For a DataFrame, this is ambiguous, since a single value is interpreted as a column name. We can only get at rows by slicing at the top level:

In [19]:
data[1:2]
Out[19]:
year temperature rainfall mosquitos
1 2002 85 252 217

Or we could use .iloc:

In [20]:
data.iloc[1]
Out[20]:
year           2002
temperature      85
rainfall        252
mosquitos       217
Name: 1, dtype: int64

Getting a single row in this way returns a Series:

In [21]:
type(data.iloc[1])
Out[21]:
pandas.core.series.Series

A Series is a 1-D column of values, having all the same datatype. Since each of the datatypes of our columns were integers, we got a Series with dtype int64 this time. If we had columns with, e.g. strings, then we'd get back dtype object, which is a catchall for pandas.

We can also get the data in our Series as a raw numpy array:

In [22]:
type(data.iloc[1].values)
Out[22]:
numpy.ndarray

Pandas is a relatively young library, but it's built on top of the venerable numpy array, which makes it possible to do fast numerical work in Python. A Series is basically a 1-D numpy array with the ability to select by labeled indices:

Subsetting data

More usefully than simple slicing, we can use boolean indexing to subselect our data. Say we want only data for years beyond 2005?

In [23]:
data[data['year'] > 2005]
Out[23]:
year temperature rainfall mosquitos
5 2006 75 283 237
6 2007 80 214 190
7 2008 85 197 181
8 2009 74 231 200
9 2010 74 207 184

There's no magic here; we get a boolean index directly from a comparison:

In [24]:
gt_2005 = data['year'] > 2005
gt_2005
Out[24]:
0    False
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
9     True
Name: year, dtype: bool

And using this Series of bools will then give only the rows for which the Series had True:

In [25]:
data[gt_2005]
Out[25]:
year temperature rainfall mosquitos
5 2006 75 283 237
6 2007 80 214 190
7 2008 85 197 181
8 2009 74 231 200
9 2010 74 207 184

This is the same behavior as numpy arrays: using most binary operators, such as +, *, >, &, work element-wise. With a single value on one side (such as 2005), we get the result of the operation for each element.

A DataFrame is an object, and objects have methods. These are functions that are part of the object itself, often doing operations on the object's data. One of these is DataFrame.mean:

In [26]:
data.mean()
Out[26]:
year           2005.5
temperature      80.0
rainfall        214.6
mosquitos       191.3
dtype: float64

We get back the mean value of each column as a single Series. There's more like this:

In [27]:
data.max()
Out[27]:
year           2010
temperature      87
rainfall        292
mosquitos       243
dtype: int64

There's also DataFrame.describe, which gives common descriptive statistics of the whole DataFrame:

In [28]:
data.describe()
Out[28]:
year temperature rainfall mosquitos
count 10.00000 10.000000 10.000000 10.00000
mean 2005.50000 80.000000 214.600000 191.30000
std 3.02765 5.456902 50.317216 33.23335
min 2001.00000 74.000000 154.000000 150.00000
25% 2003.25000 74.250000 168.500000 163.75000
50% 2005.50000 80.000000 210.500000 187.00000
75% 2007.75000 85.000000 246.750000 212.75000
max 2010.00000 87.000000 292.000000 243.00000

This is, itself, a DataFrame:

In [29]:
data.describe()['temperature']
Out[29]:
count    10.000000
mean     80.000000
std       5.456902
min      74.000000
25%      74.250000
50%      80.000000
75%      85.000000
max      87.000000
Name: temperature, dtype: float64

Documentation is a key part of Python's design. In the notebook, you can get a quick look at the docs for a given Python function or method with:

In [30]:
data.describe?

Or more generally (built-in Python behavior):

In [31]:
help(data.describe)
Help on method describe in module pandas.core.generic:

describe(percentiles=None, include=None, exclude=None) method of pandas.core.frame.DataFrame instance
    Generate various summary statistics, excluding NaN values.
    
    Parameters
    ----------
    percentiles : array-like, optional
        The percentiles to include in the output. Should all
        be in the interval [0, 1]. By default `percentiles` is
        [.25, .5, .75], returning the 25th, 50th, and 75th percentiles.
    include, exclude : list-like, 'all', or None (default)
        Specify the form of the returned result. Either:
    
        - None to both (default). The result will include only
          numeric-typed columns or, if none are, only categorical columns.
        - A list of dtypes or strings to be included/excluded.
          To select all numeric types use numpy numpy.number. To select
          categorical objects use type object. See also the select_dtypes
          documentation. eg. df.describe(include=['O'])
        - If include is the string 'all', the output column-set will
          match the input one.
    
    Returns
    -------
    summary: NDFrame of summary statistics
    
    Notes
    -----
    The output DataFrame index depends on the requested dtypes:
    
    For numeric dtypes, it will include: count, mean, std, min,
    max, and lower, 50, and upper percentiles.
    
    For object dtypes (e.g. timestamps or strings), the index
    will include the count, unique, most common, and frequency of the
    most common. Timestamps also include the first and last items.
    
    For mixed dtypes, the index will be the union of the corresponding
    output types. Non-applicable entries will be filled with NaN.
    Note that mixed-dtype outputs can only be returned from mixed-dtype
    inputs and appropriate use of the include/exclude arguments.
    
    If multiple values have the highest count, then the
    `count` and `most common` pair will be arbitrarily chosen from
    among those with the highest count.
    
    The include, exclude arguments are ignored for Series.
    
    See Also
    --------
    DataFrame.select_dtypes


Challenge: obtain the standard deviation of mosquito count for years in which the rainfall was greater than 200:

One way we could do this is to first grab the "mosquitos" column, then use a fancy index obtained from comparing the "rainfall" column to 200. We can then call the std method of the resulting Series:

In [32]:
data['mosquitos'][data['rainfall'] > 200].std()
Out[32]:
24.587937421969063

Note that this is a key part of the power of pandas objects: operations for subsetting and calculating descriptive statistics can often be stacked to great effect.


What if we know our temperatures are in fahrenheit, but we want them in celsius? We can convert them.

In [33]:
(data['temperature'] - 32) * 5 / 9
Out[33]:
0    26.666667
1    29.444444
2    30.000000
3    30.555556
4    23.333333
5    23.888889
6    26.666667
7    29.444444
8    23.333333
9    23.333333
Name: temperature, dtype: float64

This gives us back a new Series object. If we want to change the values in our existing DataFrame to be these, we can just set the column to this new Series:

In [34]:
data['temperature'] = (data['temperature'] - 32) * 5 / 9
In [35]:
data['temperature']
Out[35]:
0    26.666667
1    29.444444
2    30.000000
3    30.555556
4    23.333333
5    23.888889
6    26.666667
7    29.444444
8    23.333333
9    23.333333
Name: temperature, dtype: float64

Similarly, it's also possible to add new columns. We could add one giving us, e.g. the ratio of rainfall to mosquitos:

In [36]:
data['rainfall / mosquitos'] = data['rainfall'] / data['mosquitos']
In [37]:
data
Out[37]:
year temperature rainfall mosquitos rainfall / mosquitos
0 2001 26.666667 157 150 1.046667
1 2002 29.444444 252 217 1.161290
2 2003 30.000000 154 153 1.006536
3 2004 30.555556 159 158 1.006329
4 2005 23.333333 292 243 1.201646
5 2006 23.888889 283 237 1.194093
6 2007 26.666667 214 190 1.126316
7 2008 29.444444 197 181 1.088398
8 2009 23.333333 231 200 1.155000
9 2010 23.333333 207 184 1.125000

It's probably a bit silly to have such a column. So, let's get rid of it:

In [38]:
del data['rainfall / mosquitos']
In [39]:
data
Out[39]:
year temperature rainfall mosquitos
0 2001 26.666667 157 150
1 2002 29.444444 252 217
2 2003 30.000000 154 153
3 2004 30.555556 159 158
4 2005 23.333333 292 243
5 2006 23.888889 283 237
6 2007 26.666667 214 190
7 2008 29.444444 197 181
8 2009 23.333333 231 200
9 2010 23.333333 207 184

Don't like the order of your columns? We can make a DataFrame with a different column order by selecting them out in the order we want:

In [40]:
data[['rainfall', 'year', 'mosquitos', 'temperature']]
Out[40]:
rainfall year mosquitos temperature
0 157 2001 150 26.666667
1 252 2002 217 29.444444
2 154 2003 153 30.000000
3 159 2004 158 30.555556
4 292 2005 243 23.333333
5 283 2006 237 23.888889
6 214 2007 190 26.666667
7 197 2008 181 29.444444
8 231 2009 200 23.333333
9 207 2010 184 23.333333

We can even have duplicates:

In [41]:
data[['rainfall', 'year', 'mosquitos', 'temperature', 'year']]
Out[41]:
rainfall year mosquitos temperature year
0 157 2001 150 26.666667 2001
1 252 2002 217 29.444444 2002
2 154 2003 153 30.000000 2003
3 159 2004 158 30.555556 2004
4 292 2005 243 23.333333 2005
5 283 2006 237 23.888889 2006
6 214 2007 190 26.666667 2007
7 197 2008 181 29.444444 2008
8 231 2009 200 23.333333 2009
9 207 2010 184 23.333333 2010

Remember: this returns a copy. It does not change our existing DataFrame:

In [42]:
data
Out[42]:
year temperature rainfall mosquitos
0 2001 26.666667 157 150
1 2002 29.444444 252 217
2 2003 30.000000 154 153
3 2004 30.555556 159 158
4 2005 23.333333 292 243
5 2006 23.888889 283 237
6 2007 26.666667 214 190
7 2008 29.444444 197 181
8 2009 23.333333 231 200
9 2010 23.333333 207 184

Plotting data from a DataFrame

Just as pandas is the de-facto library for working with tabular data, matplotlib is the de-facto library for producing high-quality plots from numerical data. A saying often goes that matplotlib makes easy things easy and hard things possible when it comes to making plots.

In [43]:
import matplotlib.pyplot as plt

We need to tell the notebook to render plots in the notebook itself:

In [44]:
%matplotlib inline

Before we use matplotlib explicitly, pandas objects feature convenience methods for common plotting operations. These use matplotlib internally, so everything we learn later about matplotlib objects applies to these. For example, we can plot both "temperature" and "mosquitos" as a function of "year":

In [45]:
data.plot(x='year', y=['temperature', 'mosquitos'])
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f339711ce80>

Let's load a larger dataset:

In [46]:
data = pd.read_csv('A2_mosquito_data.csv')

This dataset has 51 rows, instead of just 10:

In [47]:
len(data)
Out[47]:
51
In [48]:
data.plot(x='year', y=['temperature', 'mosquitos'])
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f33970c72e8>

There are other convenience methods for different ways of plotting the data. For example, we can get a kernel-density estimate:

In [49]:
data['mosquitos'].plot.kde()
Out[49]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3397030160>

These convenience methods are great, but for more complex plots we can, and should, use matplotlib directly. We can make a multi-paneled figure giving both "temperature" and "rainfall" for each "year."

In [50]:
fig = plt.figure(figsize=(4, 6))

ax1 = fig.add_subplot(2, 1, 1)

ax1.plot(data['year'], data['temperature'], 'ro-')

ax1.set_xlabel('Year')
ax1.set_ylabel('Temperature')

ax2 = fig.add_subplot(2, 1, 2)

ax2.plot(data['year'], data['rainfall'], 'bs-')

ax2.set_xlabel('Year')
ax2.set_ylabel('Rainfall')
Out[50]:
<matplotlib.text.Text at 0x7f3390d64ef0>

Challenge: plot the relationship between the number of mosquitos and temperature and the number of mosquitos and rainfall.

We can do this in a similar way as we did above.

In [51]:
fig = plt.figure(figsize=(4, 6))

ax1 = fig.add_subplot(2, 1, 1)

ax1.plot(data['temperature'], data['mosquitos'], 'ro')

ax1.set_xlabel('Temperature')
ax1.set_ylabel('Mosquitos')

ax2 = fig.add_subplot(2, 1, 2)

ax2.plot(data['rainfall'], data['mosquitos'], 'bs')

ax2.set_xlabel('Rainfall')
ax2.set_ylabel('Mosquitos')
Out[51]:
<matplotlib.text.Text at 0x7f3390d15978>

Note that the linestyle code on the right can be used to turn off interpolation lines, since we want to just plot the points, and don't care about their order.


From this one dataset we see what looks like a linear relationship between mosquitos and rainfall, but not really any relationship between mosquitos and temperature. We'd like to quantify this further by applying a statistical model to the data, but we also want to do this same treatment to all our datasets. We need to learn a few more things, including loops, before continuing further.

Repetition with loops

First, an aside with strings. Most programming languages generally deal in a few basic data types, including integers, floating-point numbers, and characters. A string is a sequence of characters, and we can manipulate them easily with python. For example:

In [52]:
word = 'bird'

We have a string 'lead' that we've assigned the name word. We can examine the string's type, and indeed any object in python, with the type builtin.

In [54]:
type(word)
Out[54]:
str

We can also access its characters (it's a sequence!) with indexing:

In [55]:
word[0]
Out[55]:
'b'
In [56]:
word[1]
Out[56]:
'i'

And slicing:

In [57]:
word[1:]
Out[57]:
'ird'

If I want to output each letter of word on a new line, I could do:

In [59]:
print(word[0])
print(word[1])
print(word[2])
print(word[3])
b
i
r
d

But this doesn't scale to words of arbitrary length. Instead, let's let the computer do the boring work of iterating with a for loop:

In [44]:
for letter in word:
    print letter
l
e
a
d

This same bit of code will work for any word, including ridiculously long ones. Like this gem:

In [61]:
word = 'supercallifragilisticexpialidocious'
print(word)
supercallifragilisticexpialidocious
In [62]:
for letter in word:
    print(letter)
s
u
p
e
r
c
a
l
l
i
f
r
a
g
i
l
i
s
t
i
c
e
x
p
i
a
l
i
d
o
c
i
o
u
s

We can do more with loops than iterate through the elements of a sequence. A common use-case is building a sum. For example, getting the number of letters in word:

In [63]:
counter = 0
for letter in word:
    counter = counter + 1

print(counter)
35

For this very particular use, there's a python built-in, which is guaranteed to be more efficient. Use it instead:

In [64]:
len(word)
Out[64]:
35

It works on any sequence! For pandas.DataFrames it gives the length of the number of rows:

In [66]:
len(data)
Out[66]:
51

Lists: ordered collections of other objects (even other lists)

Loops work best when they have something to iterate over. In python, the list is a built-in data structure that serves as a container for a sequence of other objects. We can make an empty list with:

In [100]:
stuff = list()
In [101]:
print(stuff)
[]

And we can append things to the end of the list, in this case a bunch of random strings:

In [102]:
stuff.append('marklar')
In [103]:
stuff.append('chewbacca')
In [104]:
stuff.append('chickenfingers')
In [105]:
stuff
Out[105]:
['marklar', 'chewbacca', 'chickenfingers']

Just like with the word example, we can iterate through the elements of the list:

In [106]:
for item in stuff:
    print(item)
marklar
chewbacca
chickenfingers

And slicing works too! Slicing with negative numbers counts backwards from the end of the list. The slice stuff[-2:] could be read as "get the 2nd-to-last element of stuff, through the last element":

In [107]:
for item in stuff[-2:]:
    print(item)
chewbacca
chickenfingers

So indexing with -1 gives the last element:

In [108]:
stuff[-1]
Out[108]:
'chickenfingers'

And consistently, but uselessly, slicing from and to the same index yeilds nothing. Remember, for a slice n:m, it reads as "get all elements starting with element n and up to but not including element m."

In [109]:
stuff[0:0]
Out[109]:
[]

Lists are mutable: that is, their elements can be changed. For example, we can replace 'chewbacca' with 'hansolo':

In [110]:
stuff[1]
Out[110]:
'chewbacca'
In [111]:
stuff[1] = 'hansolo'
In [112]:
stuff[1]
Out[112]:
'hansolo'

...or replace 'hansolo' with a DataFrame:

In [113]:
stuff[1] = data
In [114]:
stuff[1]
Out[114]:
year temperature rainfall mosquitos
0 1960 82 200 180
1 1961 70 227 194
2 1962 89 231 207
3 1963 74 114 121
4 1964 78 147 140
5 1965 85 151 148
6 1966 86 172 162
7 1967 75 106 112
8 1968 70 276 230
9 1969 86 165 162
10 1970 83 222 198
11 1971 78 297 247
12 1972 87 288 248
13 1973 76 286 239
14 1974 86 231 202
15 1975 90 284 243
16 1976 76 190 175
17 1977 87 257 225
18 1978 88 128 133
19 1979 87 218 199
20 1980 81 206 184
21 1981 74 175 160
22 1982 85 202 187
23 1983 71 130 126
24 1984 80 225 200
25 1985 72 196 173
26 1986 76 261 222
27 1987 85 111 121
28 1988 83 247 210
29 1989 86 137 142
30 1990 82 159 152
31 1991 77 172 160
32 1992 74 280 231
33 1993 70 291 238
34 1994 77 126 125
35 1995 89 191 178
36 1996 83 298 248
37 1997 80 282 237
38 1998 86 219 195
39 1999 72 143 134
40 2000 79 262 221
41 2001 85 189 175
42 2002 86 205 186
43 2003 72 195 173
44 2004 78 148 146
45 2005 71 262 219
46 2006 88 255 226
47 2007 79 262 221
48 2008 73 198 176
49 2009 86 215 187
50 2010 87 127 129
In [115]:
print(stuff)
['marklar',     year  temperature  rainfall  mosquitos
0   1960           82       200        180
1   1961           70       227        194
2   1962           89       231        207
3   1963           74       114        121
4   1964           78       147        140
5   1965           85       151        148
6   1966           86       172        162
7   1967           75       106        112
8   1968           70       276        230
9   1969           86       165        162
10  1970           83       222        198
11  1971           78       297        247
12  1972           87       288        248
13  1973           76       286        239
14  1974           86       231        202
15  1975           90       284        243
16  1976           76       190        175
17  1977           87       257        225
18  1978           88       128        133
19  1979           87       218        199
20  1980           81       206        184
21  1981           74       175        160
22  1982           85       202        187
23  1983           71       130        126
24  1984           80       225        200
25  1985           72       196        173
26  1986           76       261        222
27  1987           85       111        121
28  1988           83       247        210
29  1989           86       137        142
30  1990           82       159        152
31  1991           77       172        160
32  1992           74       280        231
33  1993           70       291        238
34  1994           77       126        125
35  1995           89       191        178
36  1996           83       298        248
37  1997           80       282        237
38  1998           86       219        195
39  1999           72       143        134
40  2000           79       262        221
41  2001           85       189        175
42  2002           86       205        186
43  2003           72       195        173
44  2004           78       148        146
45  2005           71       262        219
46  2006           88       255        226
47  2007           79       262        221
48  2008           73       198        176
49  2009           86       215        187
50  2010           87       127        129, 'chickenfingers']

We also add a list as an element to itself:

In [116]:
stuff.append(stuff)

So now stuff[-1] points to the same list the name stuff does:

In [117]:
stuff[-1] is stuff
Out[117]:
True

It's important to recognize that these are exactly the same list. Modifications to stuff[-1] will be seen in stuff. These are just names, and in this case they refer to the same object in memory.

Lists can contain lists can contain lists can contain lists...and of course any number of other python objects. Even functions can be included as elements:

In [118]:
stuff.append(len)
In [119]:
stuff
Out[119]:
['marklar',     year  temperature  rainfall  mosquitos
 0   1960           82       200        180
 1   1961           70       227        194
 2   1962           89       231        207
 3   1963           74       114        121
 4   1964           78       147        140
 5   1965           85       151        148
 6   1966           86       172        162
 7   1967           75       106        112
 8   1968           70       276        230
 9   1969           86       165        162
 10  1970           83       222        198
 11  1971           78       297        247
 12  1972           87       288        248
 13  1973           76       286        239
 14  1974           86       231        202
 15  1975           90       284        243
 16  1976           76       190        175
 17  1977           87       257        225
 18  1978           88       128        133
 19  1979           87       218        199
 20  1980           81       206        184
 21  1981           74       175        160
 22  1982           85       202        187
 23  1983           71       130        126
 24  1984           80       225        200
 25  1985           72       196        173
 26  1986           76       261        222
 27  1987           85       111        121
 28  1988           83       247        210
 29  1989           86       137        142
 30  1990           82       159        152
 31  1991           77       172        160
 32  1992           74       280        231
 33  1993           70       291        238
 34  1994           77       126        125
 35  1995           89       191        178
 36  1996           83       298        248
 37  1997           80       282        237
 38  1998           86       219        195
 39  1999           72       143        134
 40  2000           79       262        221
 41  2001           85       189        175
 42  2002           86       205        186
 43  2003           72       195        173
 44  2004           78       148        146
 45  2005           71       262        219
 46  2006           88       255        226
 47  2007           79       262        221
 48  2008           73       198        176
 49  2009           86       215        187
 50  2010           87       127        129, 'chickenfingers', [...], <function len>]

Lists also have their own methods, which usually alter the list itself:

In [120]:
stuff.reverse()
In [121]:
stuff
Out[121]:
[<function len>,
 [...],
 'chickenfingers',
     year  temperature  rainfall  mosquitos
 0   1960           82       200        180
 1   1961           70       227        194
 2   1962           89       231        207
 3   1963           74       114        121
 4   1964           78       147        140
 5   1965           85       151        148
 6   1966           86       172        162
 7   1967           75       106        112
 8   1968           70       276        230
 9   1969           86       165        162
 10  1970           83       222        198
 11  1971           78       297        247
 12  1972           87       288        248
 13  1973           76       286        239
 14  1974           86       231        202
 15  1975           90       284        243
 16  1976           76       190        175
 17  1977           87       257        225
 18  1978           88       128        133
 19  1979           87       218        199
 20  1980           81       206        184
 21  1981           74       175        160
 22  1982           85       202        187
 23  1983           71       130        126
 24  1984           80       225        200
 25  1985           72       196        173
 26  1986           76       261        222
 27  1987           85       111        121
 28  1988           83       247        210
 29  1989           86       137        142
 30  1990           82       159        152
 31  1991           77       172        160
 32  1992           74       280        231
 33  1993           70       291        238
 34  1994           77       126        125
 35  1995           89       191        178
 36  1996           83       298        248
 37  1997           80       282        237
 38  1998           86       219        195
 39  1999           72       143        134
 40  2000           79       262        221
 41  2001           85       189        175
 42  2002           86       205        186
 43  2003           72       195        173
 44  2004           78       148        146
 45  2005           71       262        219
 46  2006           88       255        226
 47  2007           79       262        221
 48  2008           73       198        176
 49  2009           86       215        187
 50  2010           87       127        129,
 'marklar']

list.pop removes the element at the given index:

In [122]:
stuff.pop(-2)
Out[122]:
year temperature rainfall mosquitos
0 1960 82 200 180
1 1961 70 227 194
2 1962 89 231 207
3 1963 74 114 121
4 1964 78 147 140
5 1965 85 151 148
6 1966 86 172 162
7 1967 75 106 112
8 1968 70 276 230
9 1969 86 165 162
10 1970 83 222 198
11 1971 78 297 247
12 1972 87 288 248
13 1973 76 286 239
14 1974 86 231 202
15 1975 90 284 243
16 1976 76 190 175
17 1977 87 257 225
18 1978 88 128 133
19 1979 87 218 199
20 1980 81 206 184
21 1981 74 175 160
22 1982 85 202 187
23 1983 71 130 126
24 1984 80 225 200
25 1985 72 196 173
26 1986 76 261 222
27 1987 85 111 121
28 1988 83 247 210
29 1989 86 137 142
30 1990 82 159 152
31 1991 77 172 160
32 1992 74 280 231
33 1993 70 291 238
34 1994 77 126 125
35 1995 89 191 178
36 1996 83 298 248
37 1997 80 282 237
38 1998 86 219 195
39 1999 72 143 134
40 2000 79 262 221
41 2001 85 189 175
42 2002 86 205 186
43 2003 72 195 173
44 2004 78 148 146
45 2005 71 262 219
46 2006 88 255 226
47 2007 79 262 221
48 2008 73 198 176
49 2009 86 215 187
50 2010 87 127 129

...while list.remove removes the first instance of the given element in the list:

In [123]:
stuff.remove('chickenfingers')

So now our list looks like:

In [124]:
stuff
Out[124]:
[<function len>, [...], 'marklar']

Plot all the data

Now that we know how to do loops and use lists in Python, we'll use these to make a plot for each dataset. First, we'll import a module called glob:

In [125]:
import glob

We can use this to grab all the filenames matching a "globbing" pattern:

In [126]:
glob.glob("*.csv")
Out[126]:
['A1_mosquito_data.csv',
 'A3_mosquito_data.csv',
 'B2_mosquito_data.csv',
 'A2_mosquito_data.csv',
 'B1_mosquito_data.csv']

"Globbing" is the name of the shell-completion patterns common for shells like sh and bash.

Challenge: write a for-loop that produces a plot of mosquito population vs. temperature and mosquito population vs. rainfall for each dataset we have.

We can grab the block of code we wrote previously, and loop through it for each file obtained by globbing:

In [127]:
import glob

files = glob.glob('*.csv')

for file in files:
    
    print(file)
    data = pd.read_csv(file)
    
    fig = plt.figure(figsize=(4, 6))

    ax1 = fig.add_subplot(2, 1, 1)

    ax1.plot(data['temperature'], data['mosquitos'], 'ro')

    ax1.set_xlabel('Temperature')
    ax1.set_ylabel('Mosquitos')

    ax2 = fig.add_subplot(2, 1, 2)

    ax2.plot(data['rainfall'], data['mosquitos'], 'bs')

    ax2.set_xlabel('Rainfall')
    ax2.set_ylabel('Mosquitos')
    
    # tell matplotlib to render the plot *now*
    plt.show()
A1_mosquito_data.csv
A3_mosquito_data.csv
B2_mosquito_data.csv
A2_mosquito_data.csv
B1_mosquito_data.csv

It looks like there's a similar pattern for each area we have data for. We'd like to quantify more closely the relationship between mosquito population and each of these variables.

Applying a statistical model to the dataset

There are many statistics packages in Python, but one of the most well-developed and widely-used is statsmodels:

In [128]:
import statsmodels.api as sm

We'll use our second dataset for the initial stab at fitting a statistical model to our data:

In [129]:
data = pd.read_csv('A2_mosquito_data.csv')

And let's convert the temperatures to Celsius, for good measure:

In [130]:
data['temperature'] = (data['temperature'] - 32) * 5 / 9

We can build an ordinary least-squares fit to the data using statsmodels.OLS. Remember we can check the docs with:

In [131]:
sm.OLS?

We want to fit the relationship between mosquito population and, for a start, rainfall:

In [132]:
regr_results = sm.OLS(data['mosquitos'], data['rainfall']).fit()
In [133]:
regr_results.summary()
Out[133]:
OLS Regression Results
Dep. Variable: mosquitos R-squared: 0.996
Model: OLS Adj. R-squared: 0.996
Method: Least Squares F-statistic: 1.386e+04
Date: Mon, 09 May 2016 Prob (F-statistic): 8.69e-63
Time: 23:02:14 Log-Likelihood: -196.25
No. Observations: 51 AIC: 394.5
Df Residuals: 50 BIC: 396.4
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
rainfall 0.8811 0.007 117.744 0.000 0.866 0.896
Omnibus: 4.422 Durbin-Watson: 1.878
Prob(Omnibus): 0.110 Jarque-Bera (JB): 1.959
Skew: -0.085 Prob(JB): 0.376
Kurtosis: 2.055 Cond. No. 1.00

So, this gives us a lot of information. But is this what we want?

We actually want to quantify which variable correlates more with the mosquito population. One way we can build a multivariate model with statsmodels is to use a formula:

In [134]:
regr_results = sm.OLS.from_formula('mosquitos ~ temperature + rainfall', data).fit()
regr_results.summary()
Out[134]:
OLS Regression Results
Dep. Variable: mosquitos R-squared: 0.997
Model: OLS Adj. R-squared: 0.997
Method: Least Squares F-statistic: 7889.
Date: Mon, 09 May 2016 Prob (F-statistic): 3.68e-61
Time: 23:02:15 Log-Likelihood: -111.54
No. Observations: 51 AIC: 229.1
Df Residuals: 48 BIC: 234.9
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 17.5457 2.767 6.341 0.000 11.983 23.109
temperature 0.8719 0.092 9.457 0.000 0.687 1.057
rainfall 0.6967 0.006 125.385 0.000 0.686 0.708
Omnibus: 1.651 Durbin-Watson: 1.872
Prob(Omnibus): 0.438 Jarque-Bera (JB): 0.906
Skew: -0.278 Prob(JB): 0.636
Kurtosis: 3.343 Cond. No. 1.92e+03

Each of these results can be accessed as attributes of the Results object:

In [135]:
print(regr_results.params)
Intercept      17.545739
temperature     0.871943
rainfall        0.696717
dtype: float64
In [136]:
print(regr_results.rsquared)
0.996966873691

Our model "explains" most of the data. Nice! Let's plot the predicted population of mosquitos from our fitted model against the measured population to see how well we did:

In [137]:
import matplotlib.pyplot as plt

figure = plt.figure()

ax = figure.add_subplot(1,1,1)

parameters = regr_results.params
predicted = parameters['Intercept'] + parameters['temperature'] * data['temperature'] + parameters['rainfall'] * data['rainfall']
ax.plot(predicted, data['mosquitos'], 'ro')

ax.set_xlabel('predicted mosquito population')
ax.set_ylabel('measured mosquito population')
Out[137]:
<matplotlib.text.Text at 0x7f338d98c908>

The more linear this plot is, the better our model fits the data. Also, it turns out that, as you might have guessed from the plots we made previously, rainfall is the better predictor than average temperature for how many mosquitoes we expect to see.

In [138]:
regr_results.tvalues
Out[138]:
Intercept        6.341405
temperature      9.456614
rainfall       125.385116
dtype: float64

Challenge: for each dataset, print the t-value for temperature and rainfall, and plot the statsmodel results along with plots for the relationships between mosquitos and the two variables separately.

We can take our block of code we worked on previously, and make a three-panel plot instead of just a two-panel one:

In [139]:
filenames = glob.glob('*.csv')

for filename in filenames:
    print(filename)
    data = pd.read_csv(filename)
    
    # convert temperatures to celsius from fahrenheit
    data['temperature'] = (data['temperature'] - 32) * 5/9
    
    # perform fit
    regr_results = sm.OLS.from_formula('mosquitos ~ temperature + rainfall', data).fit()
    print(regr_results.tvalues)
    
    fig = plt.figure(figsize=(6, 9))

    # plot predicted vs. measured mosquito populations from fitted model
    parameters = regr_results.params
    predicted = parameters['Intercept'] + parameters['temperature'] * data['temperature'] + parameters['rainfall'] * data['rainfall']
    
    ax0 = fig.add_subplot(3, 1, 1)
    ax0.plot(predicted, data['mosquitos'], 'gd')
    
    ax0.set_xlabel('predicted mosquito population')
    ax0.set_ylabel('measured mosquito population')
   
    # plot population vs. temperature
    ax1 = fig.add_subplot(3, 1, 2)

    ax1.plot(data['temperature'], data['mosquitos'], 'ro')
    ax1.set_xlabel('Temperature')
    ax1.set_ylabel('Mosquitos')

    # plot population vs. rainfall
    ax2 = fig.add_subplot(3, 1, 3)

    ax2.plot(data['rainfall'], data['mosquitos'], 'bs')

    ax2.set_xlabel('Rainfall')
    ax2.set_ylabel('Mosquitos')

    plt.show()
A1_mosquito_data.csv
Intercept       3.589733
temperature     3.456159
rainfall       56.825601
dtype: float64
A3_mosquito_data.csv
Intercept       9.658621
temperature    10.881139
rainfall       70.798256
dtype: float64
B2_mosquito_data.csv
Intercept        8.778676
temperature     11.647932
rainfall       125.016860
dtype: float64
A2_mosquito_data.csv
Intercept        6.341405
temperature      9.456614
rainfall       125.385116
dtype: float64
B1_mosquito_data.csv
Intercept      11.016743
temperature    13.396049
rainfall       43.072783
dtype: float64

For all areas, it looks as if rainfall is indeed the stronger predictor of mosquito population than temperature. But what if this is just the beginning of the number of datasets we have to analyze? Are we going to just copy this block of code around each time we want to analyze new data?

No. That would be awful.

Writing functions: we need to break things up

We have code blocks we've been copying and pasting around a lot, and this is fine for a while. But once we've built some useful things, we want to re-use them, and we don't want to maintain many versions of the same thing lying around.

In [140]:
def square(x):
    x_squared = x ** 2
    return x_squared
In [141]:
square(3)
Out[141]:
9
In [142]:
square(4)
Out[142]:
16

Functions are also composable:

In [143]:
square(square(3))
Out[143]:
81

We can improve on this a bit by adding documentation. In Python, we can write a docstring for our function directly in its definition:

In [144]:
def square(x):
    """Return the square of the value, `x`.
    
    Parameters
    ----------
    x : float
        Number to obtain the square of.
    
    Returns
    -------
    square : float
        The square of `x`.
    
    """
    x_squared = x ** 2
    return x_squared

And this docstring is now what we see when we use the builtin help system:

In [145]:
help(square)
Help on function square in module __main__:

square(x)
    Return the square of the value, `x`.
    
    Parameters
    ----------
    x : float
        Number to obtain the square of.
    
    Returns
    -------
    square : float
        The square of `x`.

Or alternatively, in the notebook:

In [146]:
square?

Challenge: write your fahrenheit to celsius conversion as a function that takes temperatures in fahrenheit and converts them to temperatures in celsius

This is one way we could write it:

In [147]:
def fahrenheit_to_celsius(temp):
    """Convert temperature in fahrenheit to celsius.
    
    Parameters
    ----------
    temp : float or array_like
        Temperature(s) in fahrenheit.
    
    Returns
    -------
    float or array_like
        Temperature(s) in celsius.
    
    """
    return (temp - 32) * 5 / 9

Note that this same function will work for pandas.Series objects, numpy arrays, and single numbers. Writing functions in this way takes advantage of the fact that these objects are roughly duck-typed: they all behave as if they were single numbers, so doing arithmetic operations with them yields the expected result.



Challenge: write a function that prints the t-value for temperature and rainfall, and plots the statsmodel results along with plots for the relationships between mosquitos and the two variables separately.

Let's do our model building and plotting in a single function, returning a useful panel plot:

In [148]:
def analyze(data):
    """Return panel plot of mosquito population vs. temperature, rainfall.
    
    Also prints t-values for temperature and rainfall.
    
    Panel plot gives: 
        1. comparison of modeled values of mosquito population vs. actual values
        2. mosquito population vs. average temperature
        3. mosquito population vs. total rainfall
    
    Parameters
    ----------
    data : DataFrame
        DataFrame giving columns for average temperature, 
        total rainfall, and mosquito population during mosquito
        breeding season for each year.
    
    Returns
    -------
    Figure
        :mod:`matplotlib.figure.Figure` object giving panel plot.
    
    """
    # perform fit
    regr_results = sm.OLS.from_formula('mosquitos ~ temperature + rainfall', data).fit()
    print(regr_results.tvalues)
    
    fig = plt.figure(figsize=(6, 9))

    # plot prediction from fitted model against measured mosquito population
    parameters = regr_results.params
    predicted = parameters['Intercept'] + parameters['temperature'] * data['temperature'] + parameters['rainfall'] * data['rainfall']
    
    ax0 = fig.add_subplot(3, 1, 1)
    ax0.plot(predicted, data['mosquitos'], 'go')
    
    ax0.set_xlabel('predicted mosquito population')
    ax0.set_ylabel('measured mosquito population')
   
    # plot population vs. temperature
    ax1 = fig.add_subplot(3, 1, 2)

    ax1.plot(data['temperature'], data['mosquitos'], 'ro')
    ax1.set_xlabel('Temperature')
    ax1.set_ylabel('Mosquitos')

    # plot population vs. rainfall
    ax2 = fig.add_subplot(3, 1, 3)

    ax2.plot(data['rainfall'], data['mosquitos'], 'bs')

    ax2.set_xlabel('Rainfall')
    ax2.set_ylabel('Mosquitos')

    return fig

Note that since this is Python, and everything is an object, we can return our Figure object. This is generally a useful thing to do for plotting functions, since in different contexts you may want to modify the resulting Figure before rendering any plots of it.


We can now re-write our big loop from before, and it should look a lot simpler.

In [149]:
filenames = glob.glob('*.csv')

for filename in filenames:
    print(filename)
    data = pd.read_csv(filename)
    
    # convert temperatures to celsius
    data['temperature'] = fahrenheit_to_celsius(data['temperature'])
    
    # get t-values; return plots
    analyze(data)
    
    plt.show()
A1_mosquito_data.csv
Intercept       3.589733
temperature     3.456159
rainfall       56.825601
dtype: float64
A3_mosquito_data.csv
Intercept       9.658621
temperature    10.881139
rainfall       70.798256
dtype: float64
B2_mosquito_data.csv
Intercept        8.778676
temperature     11.647932
rainfall       125.016860
dtype: float64
A2_mosquito_data.csv
Intercept        6.341405
temperature      9.456614
rainfall       125.385116
dtype: float64
B1_mosquito_data.csv
Intercept      11.016743
temperature    13.396049
rainfall       43.072783
dtype: float64

Our for loop went from being long and complicated to simple and easy-to-understand. This is part of the power of functions. Because the human brain can only keep about $7 \pm 2$ concepts in working memory at any given time, functions allow us to abstract and build more complicated code without making it more difficult to understand. Instead of having to worry about how filter_data does its thing, we need only know its inputs and its outputs to wield it effectively. It can be re-used without copying and pasting, and we can maintain it and improve it easily since it can be defined in a single place.

Since we may want to use these functions later in other notebooks/code, we can put their definitions in a file and import it. Opening a text editor and copying the function definitions (and the imports they need) to a file datastuff.py, that looks like this:

In [150]:
%cat datastuff.py
import statsmodels.api as sm
import matplotlib.pyplot as plt


def fahrenheit_to_celsius(temp):
    """Convert temperature in fahrenheit to celsius.
    
    Parameters
    ----------
    temp : float or array_like
        Temperature(s) in fahrenheit.
        
    Returns
    -------
    float or array_like
        Temperatures in celsius.

    bubbles
    
    """
    try:
        newtemps = (temp - 32) * 5/9
    except TypeError:
        newtemps = []
        for value in temp:
            newtemps.append(fahrenheit_to_celsius(value))

    return newtemps


def analyze(data):
    """Return panel plot of mosquito population vs. temperature, rainfall.
    
    Also prints t-values for temperature and rainfall.
    
    Panel plot gives: 
        1. comparison of modeled values of mosquito population vs. actual values
        2. mosquito population vs. average temperature
        3. mosquito population vs. total rainfall
    
    Parameters
    ----------
    data : DataFrame
        DataFrame giving columns for average temperature, 
        total rainfall, and mosquito population during mosquito
        breeding season for each year.
    
    Returns
    -------
    Figure
        :mod:`matplotlib.figure.Figure` object giving panel plot.
    
    """
    # perform fit
    regr_results = sm.OLS.from_formula('mosquitos ~ temperature + rainfall', data).fit()
    print(regr_results.tvalues)
    
    fig = plt.figure(figsize=(6, 9))

    # plot prediction from fitted model against measured mosquito population
    parameters = regr_results.params
    predicted = parameters['Intercept'] + parameters['temperature'] * data['temperature'] + parameters['rainfall'] * data['rainfall']
    
    ax0 = fig.add_subplot(3, 1, 1)
    ax0.plot(predicted, data['mosquitos'], 'go')
    
    ax0.set_xlabel('predicted mosquito population')
    ax0.set_ylabel('measured mosquito population')
   
    # plot population vs. temperature
    ax1 = fig.add_subplot(3, 1, 2)

    ax1.plot(data['temperature'], data['mosquitos'], 'ro')
    ax1.set_xlabel('Temperature')
    ax1.set_ylabel('Mosquitos')

    # plot population vs. rainfall
    ax2 = fig.add_subplot(3, 1, 3)

    ax2.plot(data['rainfall'], data['mosquitos'], 'bs')

    ax2.set_xlabel('Rainfall')
    ax2.set_ylabel('Mosquitos')

    return fig

Congratulations: you just made your first Python module! We can import it directly:

In [151]:
import datastuff

And we can see our documentation for these functions and use them as you'd expect:

In [152]:
datastuff.fahrenheit_to_celsius?
In [153]:
datastuff.analyze(data)
Intercept      11.016743
temperature    13.396049
rainfall       43.072783
dtype: float64
Out[153]:

Note: if you make modifications to your module, the will not show up in this session, even if you rerun import datastuff. You will either need to restart your Python session, or explicitly tell the Python session to rebuild the module object by re-reading the file:

In [154]:
import importlib
In [155]:
importlib.reload(datastuff)
Out[155]:
<module 'datastuff' from '/home/alter/Documents/SWC/Events/2016_UCSF/swc_ucsf_intermediate_data/python/datastuff.py'>

The reason this is the case is because it's more efficient for the Python interpreter to build a module object once from a module file, and then with each import encountered use that same object instead of building a whole new one. It's faster, and requires less memory, for this to be the case.

An aside: combining DataFrames

Although we didn't need to do it for these examples, pandas is generally more useful when we can pack as much of our data as is reasonable into a single DataFrame. Since all our datasets have the same form, we can probably gain by combining them.

In [156]:
import pandas as pd
import glob

We will use the pandas.concat function to make a single DataFrame from all our others:

In [157]:
filenames = glob.glob('*.csv')
datas = []
keys = []
for filename in filenames:
    keys.append(filename.split('_')[0])
    datas.append(pd.read_csv(filename))
    
df = pd.concat(datas, keys=keys, names=['location'])

We now have a "multi-index" DataFrame:

In [158]:
df
Out[158]:
year temperature rainfall mosquitos
location
A1 0 2001 80 157 150
1 2002 85 252 217
2 2003 86 154 153
3 2004 87 159 158
4 2005 74 292 243
5 2006 75 283 237
6 2007 80 214 190
7 2008 85 197 181
8 2009 74 231 200
9 2010 74 207 184
A3 0 1960 76 191 93
1 1961 73 249 109
2 1962 81 112 73
3 1963 78 113 72
4 1964 81 159 89
5 1965 87 222 109
6 1966 72 103 68
7 1967 77 176 92
8 1968 89 236 114
9 1969 88 283 128
10 1970 89 151 89
11 1971 71 121 72
12 1972 88 267 124
13 1973 85 211 102
14 1974 75 101 67
15 1975 72 173 85
16 1976 74 254 117
17 1977 85 155 90
18 1978 89 158 92
19 1979 86 117 75
... ... ... ... ... ...
B1 21 1981 77 115 77
22 1982 71 267 106
23 1983 89 164 99
24 1984 75 175 90
25 1985 70 210 95
26 1986 72 214 99
27 1987 79 123 81
28 1988 74 195 96
29 1989 84 129 90
30 1990 80 276 111
31 1991 73 147 82
32 1992 73 140 80
33 1993 78 285 115
34 1994 71 280 107
35 1995 80 180 95
36 1996 90 280 119
37 1997 86 166 100
38 1998 82 285 117
39 1999 75 215 101
40 2000 86 212 101
41 2001 72 138 83
42 2002 84 130 85
43 2003 74 166 87
44 2004 87 280 121
45 2005 85 130 88
46 2006 83 102 77
47 2007 85 163 94
48 2008 83 190 99
49 2009 79 151 88
50 2010 88 295 122

214 rows × 4 columns

The outer index gives the area name ("location") the rows came from. These allow us to select rows on the basis of area:

In [159]:
df.loc[('A1', 1)]
Out[159]:
year           2002
temperature      85
rainfall        252
mosquitos       217
Name: (A1, 1), dtype: int64

More powerfully, this allows us to get aggregates over all areas:

In [160]:
df[df['year'] > 2005]['temperature'].mean()
Out[160]:
81.040000000000006

Or get descriptive statistics directly as a function of location:

In [161]:
df.mean(level='location')
Out[161]:
year temperature rainfall mosquitos
location
A1 2005.5 80.000000 214.600000 191.300000
A3 1985.0 80.333333 193.274510 97.647059
B2 1985.0 79.764706 196.823529 196.058824
A2 1985.0 80.392157 207.039216 185.235294
B1 1985.0 79.627451 204.588235 99.509804

Challenge: obtain the minimum rainfall for each location given that the temperature was between 75F and 90F.

This is best read backwards. First, let's filter our rows so we only get those for the temperature range we want:

In [162]:
df[(df['temperature'] > 75) & (df['temperature'] < 90)]
Out[162]:
year temperature rainfall mosquitos
location
A1 0 2001 80 157 150
1 2002 85 252 217
2 2003 86 154 153
3 2004 87 159 158
6 2007 80 214 190
7 2008 85 197 181
A3 0 1960 76 191 93
2 1962 81 112 73
3 1963 78 113 72
4 1964 81 159 89
5 1965 87 222 109
7 1967 77 176 92
8 1968 89 236 114
9 1969 88 283 128
10 1970 89 151 89
12 1972 88 267 124
13 1973 85 211 102
17 1977 85 155 90
18 1978 89 158 92
19 1979 86 117 75
21 1981 80 153 87
23 1983 80 145 82
27 1987 83 261 117
28 1988 89 280 127
29 1989 81 271 121
30 1990 77 188 94
31 1991 80 218 105
33 1993 83 211 105
34 1994 77 224 108
36 1996 86 247 115
... ... ... ... ... ...
B1 5 1965 76 272 113
6 1966 77 140 87
7 1967 87 207 106
8 1968 86 266 115
9 1969 79 159 92
11 1971 79 252 108
13 1973 82 281 116
14 1974 80 295 120
15 1975 86 166 99
16 1976 87 298 123
17 1977 79 184 95
19 1979 86 157 95
21 1981 77 115 77
23 1983 89 164 99
27 1987 79 123 81
29 1989 84 129 90
30 1990 80 276 111
33 1993 78 285 115
35 1995 80 180 95
37 1997 86 166 100
38 1998 82 285 117
40 2000 86 212 101
42 2002 84 130 85
44 2004 87 280 121
45 2005 85 130 88
46 2006 83 102 77
47 2007 85 163 94
48 2008 83 190 99
49 2009 79 151 88
50 2010 88 295 122

151 rows × 4 columns

Then, we can grab the minimum rainfall for these:

In [163]:
df[(df['temperature'] > 75) & (df['temperature'] < 90)]['rainfall'].min()
Out[163]:
100

But we wanted minimum rainfall for each location; we can tell Series.min to split the data by location:

In [164]:
df[(df['temperature'] > 75) & (df['temperature'] < 90)]['rainfall'].min(level='location')
Out[164]:
location
A1    154
A3    112
B2    100
A2    111
B1    102
Name: rainfall, dtype: int64

We don't have time during this lesson to get into groupbys, but they are such a central part of pandas that we'd be remiss if we excluded them. A groupby allows us to split our rows according to values in one or more column. For example, we could ask for the minimum rainfall measured across all areas for each year:

In [165]:
df.groupby('year')['rainfall'].min()
Out[165]:
year
1960    129
1961    135
1962    112
1963    113
1964    147
1965    151
1966    103
1967    106
1968    138
1969    147
1970    151
1971    121
1972    153
1973    100
1974    101
1975    134
1976    116
1977    155
1978    128
1979    117
1980    138
1981    115
1982    202
1983    130
1984    151
1985    152
1986    214
1987    111
1988    195
1989    129
1990    159
1991    147
1992    140
1993    211
1994    126
1995    180
1996    187
1997    123
1998    127
1999    123
2000    212
2001    112
2002    130
2003    132
2004    119
2005    130
2006    102
2007    163
2008    186
2009    151
2010    127
Name: rainfall, dtype: int64

This should be read as "group the data by year, then get the minimum rainfall for each group." If we want to do something more complex for each group of rows, we can use a for loop:

In [166]:
for name, group in df.groupby('year'):
    print(name, group['rainfall'].min())
1960 129
1961 135
1962 112
1963 113
1964 147
1965 151
1966 103
1967 106
1968 138
1969 147
1970 151
1971 121
1972 153
1973 100
1974 101
1975 134
1976 116
1977 155
1978 128
1979 117
1980 138
1981 115
1982 202
1983 130
1984 151
1985 152
1986 214
1987 111
1988 195
1989 129
1990 159
1991 147
1992 140
1993 211
1994 126
1995 180
1996 187
1997 123
1998 127
1999 123
2000 212
2001 112
2002 130
2003 132
2004 119
2005 130
2006 102
2007 163
2008 186
2009 151
2010 127

To learn more about groupbys, check out the pandas docs on the "split-apply-combine" approach to working with datasets with groupby: http://pandas.pydata.org/pandas-docs/stable/groupby.html

Conditionals

We've already used boolean operations to filter rows on DataFrames, but more broadly we must be able to write code that makes decisions based on the inputs it is given.

We can use conditionals. To demonstrate how these work, we'll use absurd toy examples. Say we have some number:

In [167]:
num = 37

We can ask whether that number is greater than 100 with a conditional like this:

In [169]:
if num > 100:
    print('greater')
else:
    print('not greater')
not greater

We could add some more sophisticatioin to the conditional by checking for equality:

In [170]:
if num > 100:
    print('greater')
elif num == 100:
    print('equal!')
else:
    print('not greater')
not greater

...and we could test it:

In [171]:
num = 100
In [172]:
if num > 100:
    print('greater')
elif num == 100:
    print('equal!')
else:
    print('not greater')
equal!

Conditionals can include boolean operators such as and:

In [173]:
if (1 > 0) and (-1 > 0):
    print('both parts are true')
else:
    print('one part is not true')
one part is not true

A "truth table" for and, given two boolean values s1 and s2, each either True or False gives:

s1 s2 s1 and s2
True True True
True False False
False True False
False False False

There's also or:

In [174]:
if (1 > 0) or (-1 > 0):
    print('at least one part is true')
else:
    print('no part is true')
at least one part is true
s1 s2 s1 or s2
True True True
True False True
False True True
False False False

And not:

In [176]:
if (1 > 0) and not (-1 > 0):
    print('both parts are true')
else:
    print('one part is not true')
both parts are true
s1 not s1
True False
False True

Challenge: rewrite your Fahrenheit to Celsius converter with an if-else statement such that it can also take a list of values in addition to arrays/single numbers.

There are perhaps other ways to do this, but if we want to specifically handle the case of lists of temperatures, we can use the builtin isinstance to check that we are dealing with a list:

In [177]:
def fahrenheit_to_celsius(temp):
    """Convert temperature in fahrenheit to celsius.
    
    Parameters
    ----------
    temp : float or array_like
        Temperature(s) in fahrenheit.
        
    Returns
    -------
    float or array_like
        Temperatures in celsius.

    bubbles
    
    """
    if isinstance(temp, list):
        newtemps = []
        for value in temp:
            newtemps.append(fahrenheit_to_celsius(value))
    else:
        newtemps = (temp - 32) * 5/9

    return newtemps

This function takes advantage of recursion for the case of a list, calling itself on each value in the list and converting those values accordingly.

Errors and exceptions

Encountering errors is part of coding. If you are coding, you will hit errors. The important thing to remember is that errors that are loud are the best kind, because they usually give hints as to what the problem is. In Python, errors are known as exceptions, and there are a few common varieties. Familiarity with these varieties helps to more quickly identify the root of the problem, which means we can more quickly fix it.

We want to run a function inside a module called errors_01.py; it's inside our scripts directory:

In [180]:
%ls scripts/
argv_list.py  errors_01.py  errors_02.py  plot_rand_mp.py*

Let's try to import it:

In [181]:
import errors_01
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-181-56eb787402eb> in <module>()
----> 1 import errors_01

ImportError: No module named 'errors_01'

We got an ImportError. This exception tells us that Python cannot find a module we are trying to import. To troubleshoot why, we should first look at sys.path:

In [182]:
import sys
sys.path
Out[182]:
['',
 '/home/alter/.local/lib/python3.5/site-packages/scandir-1.1-py3.5-linux-x86_64.egg',
 '/home/alter/Library/datreant/datreant.core/src',
 '/home/alter/Library/datreant/datreant.data/src',
 '/home/alter/Library/asciitree',
 '/usr/lib/python35.zip',
 '/usr/lib/python3.5',
 '/usr/lib/python3.5/plat-linux',
 '/usr/lib/python3.5/lib-dynload',
 '/home/alter/.local/lib/python3.5/site-packages',
 '/usr/lib/python3.5/site-packages',
 '/usr/lib/python3.5/site-packages/gtk-2.0',
 '/home/alter/.local/lib/python3.5/site-packages/IPython/extensions',
 '/home/alter/.ipython']

The Python interpreter looks at each location in this list, in order, when we try to import a module by name. Since our scripts directory is not in this list, errors_01 is not found. We can add the directory to the list, though, to make errors_01 findable:

In [183]:
sys.path.append('scripts/')
In [184]:
sys.path
Out[184]:
['',
 '/home/alter/.local/lib/python3.5/site-packages/scandir-1.1-py3.5-linux-x86_64.egg',
 '/home/alter/Library/datreant/datreant.core/src',
 '/home/alter/Library/datreant/datreant.data/src',
 '/home/alter/Library/asciitree',
 '/usr/lib/python35.zip',
 '/usr/lib/python3.5',
 '/usr/lib/python3.5/plat-linux',
 '/usr/lib/python3.5/lib-dynload',
 '/home/alter/.local/lib/python3.5/site-packages',
 '/usr/lib/python3.5/site-packages',
 '/usr/lib/python3.5/site-packages/gtk-2.0',
 '/home/alter/.local/lib/python3.5/site-packages/IPython/extensions',
 '/home/alter/.ipython',
 'scripts/']

Note that this change will only be present in this session. It is not permanent. This is a perfectly fine method, though, for using a module you've built yourself. Let's import errors_01 now:

In [185]:
import errors_01
  File "scripts/errors_01.py", line 7
    print ice_creams[3]
                   ^
SyntaxError: Missing parentheses in call to 'print'

We get a different exception now: a SyntaxError. This the Python interpreter's way of telling us that we are using poor grammar, and it doesn't understand what we are telling it. In this case, it tells us exactly what's wrong:

SyntaxError: Missing parentheses in call to 'print'

If we make the suggested fix in, say, errors_02.py (we copied the file here for demo purposes, but normally we would fix it and version control it with git):

In [186]:
%cat scripts/errors_02.py
def favorite_ice_cream():
    ice_creams = [
        "chocolate",
        "vanilla",
        "strawberry"
    ]
    print(ice_creams[3])
In [187]:
import errors_02

This one imports just fine. Let's run errors_02.favorite_ice_cream:

In [188]:
errors_02.favorite_ice_cream()
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-188-8c34ad3792fd> in <module>()
----> 1 errors_02.favorite_ice_cream()

/home/alter/Documents/SWC/Events/2016_UCSF/swc_ucsf_intermediate_data/python/scripts/errors_02.py in favorite_ice_cream()
      5         "strawberry"
      6     ]
----> 7     print(ice_creams[3])

IndexError: list index out of range

This time we get an IndexError. Let's dissect this. The first line shows where the error originates in our code. In this case it's the only line in the cell. The next set of lines shows where the error originates at the source: that is, inside the definition of errors_02.favorite_ice_cream in line 7 of the module errors_02. The last line gives us a hint as to the nature of the IndexError: we appear to be using an index for an element of a list that doesn't exist.

There are other types of exceptions, but these are some of the most common. You need not know all of them, but having some knowledge of the basic varieties is valuable for understanding what's going wroing, and then how to fix it!

Aside: catching exceptions

In addition to being useful tools for debugging, exceptions are catchable. That is, you can write code in such a way that it is able to recover from failure or exceptional cases. As an example, let's rewrite our Fahrenheit to Celsius converter one last time:

In [189]:
def fahrenheit_to_celsius(temp):
    """Convert temperature in fahrenheit to celsius.
    
    Parameters
    ----------
    temp : float or array_like
        Temperature(s) in fahrenheit.
        
    Returns
    -------
    float or array_like
        Temperatures in celsius.

    bubbles
    
    """
    try:
        newtemps = (temp - 32) * 5/9
    except TypeError:
        newtemps = []
        for value in temp:
            newtemps.append(fahrenheit_to_celsius(value))

    return newtemps

This function can take lists just like the one with the if-else and isinstance check, but it turns the problem on its head. It uses a try-except block instead, first trying to use temp as a single number or array, and when that fails with a TypeError (as it would if temp is a list), it then handles temp as if it were a list.

This is known as "ask forgiveness" style: we do the most common thing, and if that fails, we then do something else. This is in contrast to "ask permission" style, as exhibited in the if-else implementation: we first checked if we had a funny case, and if not, did the common thing.

Which you choose is often a matter of style, though there are performance benefits to "ask forgiveness" when the odd cases truly are rare; when the other cases a function must deal with are almost equally common, it is better to use if-else statements to direct flow.

Defensive programming: being skeptical of your code

Say we write a function that does something a bit obtuse (heh): Given the coordinates of the lower-left and upper-right corners of a rectangle, it normalizes the rectangle so that its lower-left coordinate is at the origin (0, 0), and its longest side is 1.0 units long:

In [190]:
def normalize_rectangle(rect):
    """Normalizes a rectangle so that it is at the origin, 
    and 1.0 units long on its longest axis.
    
    :Arguments:
        *rect*
            a tuple of size four, giving (x0, y0, x1, y1), the (x,y)
            positions of the lower-left and upper-right corner
            
    :Returns:
        *newrect*
            a tuple of size four, giving the new coordinates of our rectangle's
            vertices
    
    """
    x0, y0, x1, y1 = rect
    
    assert x0 < x1, 'Invalid x coordinates'
    assert y0 < y1, 'Invalid y coordinates'
    
    dx = x1 - x0
    dy = y1 - y0
    
    if dx > dy:
        scaled = float(dx) / dy
        upper_x, upper_y = 1.0, scaled
    else:
        scaled = float(dx) / dy
        upper_x, upper_y = scaled, 1.0
        
    assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid'
    assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid'
        
    return (0, 0, upper_x, upper_y)

Notice that we added a nice documentation string at the top that explicitly states the functions inputs (arguments) and outputs (returns), so when we use it we don't need to mind its details anymore. Also, at the beginning of the function we have assert statements that check the validity of the inputs, while at the end of the function we have assert statements that check the validity of the outputs.

These statements codify what our documentation says about what the inputs should mean and what the output should look like, guarding against cases where the function is given inputs it wasn't designed to handle.

Let's test the function!

In [191]:
normalize_rectangle((3, 4, 4, 6))
Out[191]:
(0, 0, 0.5, 1.0)

Looks good. First point is at the origin; longest side is 1.0.

In [192]:
normalize_rectangle((0.0, 1.0, 2.0, 5.0))
Out[192]:
(0, 0, 0.5, 1.0)

Nice!

In [193]:
normalize_rectangle((0.0, 0.0, 5.0, 1.0))
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-193-a41732464e3f> in <module>()
----> 1 normalize_rectangle((0.0, 0.0, 5.0, 1.0))

<ipython-input-190-a7edcf11432e> in normalize_rectangle(rect)
     30 
     31     assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid'
---> 32     assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid'
     33 
     34     return (0, 0, upper_x, upper_y)

AssertionError: Calculated upper Y coordinate invalid

Oh...what happened here? The inputs should be fine, but an assertion caught something strange in the output. Looking closely at the code, we spot a mistake in line 26:

In [194]:
def normalize_rectangle(rect):
    """Normalizes a rectangle so that it is at the origin, 
    and 1.0 units long on its longest axis.
    
    :Arguments:
        *rect*
            a tuple of size four, giving (x0, y0, x1, y1), the (x,y)
            positions of the lower-left and upper-right corner
            
    :Returns:
        *newrect*
            a tuple of size four, giving the new coordinates of our rectangle's
            vertices
    
    """
    x0, y0, x1, y1 = rect
    
    assert x0 < x1, 'Invalid x coordinates'
    assert y0 < y1, 'Invalid y coordinates'
    
    dx = x1 - x0
    dy = y1 - y0
    
    if dx > dy:
        scaled = float(dy) / dx
        upper_x, upper_y = 1.0, scaled
    else:
        scaled = float(dx) / dy
        upper_x, upper_y = scaled, 1.0
        
    assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid'
    assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid'
        
    return (0, 0, upper_x, upper_y)

And now we get:

In [195]:
normalize_rectangle((0.0, 0.0, 5.0, 1.0))
Out[195]:
(0, 0, 1.0, 0.2)
In [196]:
normalize_rectangle((2, 3, 4, 6))
Out[196]:
(0, 0, 0.6666666666666666, 1.0)

That looks better.

What if we were to try feeding in only three values instead of four?

In [197]:
normalize_rectangle((2, 4, 5))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-197-c5d8e9954719> in <module>()
----> 1 normalize_rectangle((2, 4, 5))

<ipython-input-194-d4ce9e9a4cfa> in normalize_rectangle(rect)
     14 
     15     """
---> 16     x0, y0, x1, y1 = rect
     17 
     18     assert x0 < x1, 'Invalid x coordinates'

ValueError: not enough values to unpack (expected 4, got 3)

That's a rather unhelpful error message. We can also add an assertion that checks that the input is long enough, and gives a message indicating what the problem is:

In [198]:
def normalize_rectangle(rect):
    """Normalizes a rectangle so that it is at the origin, 
    and 1.0 units long on its longest axis.
    
    :Arguments:
        *rect*
            a tuple of size four, giving (x0, y0, x1, y1), the (x,y)
            positions of the lower-left and upper-right corner
            
    :Returns:
        *newrect*
            a tuple of size four, giving the new coordinates of our rectangle's
            vertices
    
    """
    assert len(rect) == 4, "Rectangle must have 4 elements"

    x0, y0, x1, y1 = rect
    
    assert x0 < x1, 'Invalid x coordinates'
    assert y0 < y1, 'Invalid y coordinates'
    
    dx = x1 - x0
    dy = y1 - y0
    
    if dx > dy:
        scaled = float(dy) / dx
        upper_x, upper_y = 1.0, scaled
    else:
        scaled = float(dx) / dy
        upper_x, upper_y = scaled, 1.0
        
    assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid'
    assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid'
        
    return (0, 0, upper_x, upper_y)
In [199]:
normalize_rectangle((2, 4, 5))
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-199-c5d8e9954719> in <module>()
----> 1 normalize_rectangle((2, 4, 5))

<ipython-input-198-e9732fa1861f> in normalize_rectangle(rect)
     14 
     15     """
---> 16     assert len(rect) == 4, "Rectangle must have 4 elements"
     17 
     18     x0, y0, x1, y1 = rect

AssertionError: Rectangle must have 4 elements

Now when we use this function six months later, we won't have to scratch our heads too long when this sort of thing happens! Defensive programming in this way makes code more robust, easier to debug, and easier to maintain. This is especially important for scientific code, since incorrect results can occur silently and end up getting published, perhaps only getting uncovered years later. Much embarrassment and misinformation can be avoided by being skeptical of our code, using tools such as assert statements to make sure it keeps doing what we think it should.

Writing Python scripts

Running Python code from a notebook is great, but we often need to run Python code from the command line, perhaps on files as input data. We can turn our module datastuff.py into a simple script by adding some meat at the bottom:

In [ ]:
if __name__ == '__main__':

    import pandas as pd
    import sys

    for dataset in sys.argv[1:]:
        print(dataset)

        data = pd.read_csv(dataset)
        data['temperature'] = fahrenheit_to_celsius(
                data['temperature'])

        fig = analyze_mosquitos(data)

        fig.savefig(dataset + "_plot.pdf")

The if __name__ == '__main__' statement at the top says that this code should only be executed if the module is being run as a script. It's a common check that the current namespace is __main__, as it would only be if the code were being executed as a script.

Adding this allows us to run the script on any number of input files as:

python datastuff.py A1_mosquito_data.csv A2_mosquito_data.csv

or even:

python datastuff.py *.csv

It will loop through each filename given as arguments, convert the temperature column to Celsius, then generate and save a PDF of the analysis figures for that dataset.

sys.argv is a list having the name of the script as its zeroth element, then each argument to the script in order from the first to the last. We do sys.argv[1:] to slice out only those arguments so we can iterate through them. This is a quick-and-dirty way to parse arguments, but if you're interested in writing more sophisticated command-line utilities, definitely use argparse.