This lesson assumes some programming background, but not necessarily with Python. It moves at a faster pace than a novice lesson, however.
We will learn how to do the following:
This lesson is based heavily on the Software Carpentry python-intermediate-mosquitoes
lesson, and uses its datasets. It also includes some mixed-in components from the python-novice-inflammation
lesson.
We are interested in understanding the relationship between the weather and the number of mosquitos occuring in a particular year so that we can plan mosquito control measures accordingly. Since we want to apply these mosquito control measures at a number of different sites we need to understand both the relationship at a particular site and whether or not it is consistent across sites. The data we have to address this problem comes from the local government and are stored in tables in comma-separated values (CSV) files. Each file holds the data for a single location, each row holds the information for a single year at that location, and the columns hold the data on both mosquito numbers and the average temperature and rainfall from the beginning of mosquito breeding season. The first few rows of our first file look like:
%cat A1_mosquito_data.csv | head -n 5
year,temperature,rainfall,mosquitos 2001,80,157,150 2002,85,252,217 2003,86,154,153 2004,87,159,158
And we have five files to work with:
%ls *.csv
A1_mosquito_data.csv A3_mosquito_data.csv B2_mosquito_data.csv A2_mosquito_data.csv B1_mosquito_data.csv
Note: commands preceded with a %
are known as "magics". These are specfic to the notebook environment. They are not Python commands.
import pandas as pd
Importing a library in this way allows us to use the components defined inside of it. First, we'll read in a single dataset using the pandas.read_csv
function:
pd.read_csv('A1_mosquito_data.csv')
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
0 | 2001 | 80 | 157 | 150 |
1 | 2002 | 85 | 252 | 217 |
2 | 2003 | 86 | 154 | 153 |
3 | 2004 | 87 | 159 | 158 |
4 | 2005 | 74 | 292 | 243 |
5 | 2006 | 75 | 283 | 237 |
6 | 2007 | 80 | 214 | 190 |
7 | 2008 | 85 | 197 | 181 |
8 | 2009 | 74 | 231 | 200 |
9 | 2010 | 74 | 207 | 184 |
This reads the CSV from disk and deserializes it into a pandas.DataFrame
. But unless we attach a name to the DataFrame
the function returns, we cannot keep working with the object in memory.
data = pd.read_csv('A1_mosquito_data.csv')
data
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
0 | 2001 | 80 | 157 | 150 |
1 | 2002 | 85 | 252 | 217 |
2 | 2003 | 86 | 154 | 153 |
3 | 2004 | 87 | 159 | 158 |
4 | 2005 | 74 | 292 | 243 |
5 | 2006 | 75 | 283 | 237 |
6 | 2007 | 80 | 214 | 190 |
7 | 2008 | 85 | 197 | 181 |
8 | 2009 | 74 | 231 | 200 |
9 | 2010 | 74 | 207 | 184 |
Now we can refer to the DataFrame
directly, without having to read it from disk every time.
A bit about names. If I attach the name weight_kg
to the value 55
:
weight_kg = 55
I can then use this value to calculate the same weight in weight_lb
:
weight_lb = 2.2 * weight_kg
weight_lb
121.00000000000001
If I change my weight_kg
to 66, what is the current value of weight_lb
?
weight_kg = 66
weight_lb
121.00000000000001
There's no change. This is because the name weight_lb
is attached to a value, in this case a floating point number. The value has no concept of how it came to be.
weight_lb = 2.2 * weight_kg
weight_lb
145.20000000000002
Back to our DataFrame
. A DataFrame
is an object. We can see what type of object it is with the Python builtin, type
:
type(data)
pandas.core.frame.DataFrame
It turns out that in Python, everything is an object. We'll see what this means as we go, but the most important aspect of this is that in Python we have names, and we assign these to objects. Any name can point to any object, and more than one name can point to a single object.
Anyway, a DataFrame
allows us to get at individual components of our tabular data. We can get single columns like:
data['year']
0 2001 1 2002 2 2003 3 2004 4 2005 5 2006 6 2007 7 2008 8 2009 9 2010 Name: year, dtype: int64
Or multiple columns with:
data[['rainfall', 'temperature']]
rainfall | temperature | |
---|---|---|
0 | 157 | 80 |
1 | 252 | 85 |
2 | 154 | 86 |
3 | 159 | 87 |
4 | 292 | 74 |
5 | 283 | 75 |
6 | 214 | 80 |
7 | 197 | 85 |
8 | 231 | 74 |
9 | 207 | 74 |
Slicing can be used to get back subsets of rows:
data[0:2]
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
0 | 2001 | 80 | 157 | 150 |
1 | 2002 | 85 | 252 | 217 |
Python indices are 0-based, meaning counting goes as 0, 1, 2, 3...; this means that the first row is row 0, the second row is row 1, etc. It's best to refer to row 0 as the "zeroth row" to avoid confusion.
This slice should be read as "get the 0th element up to and not including the 2nd element". The "not including" is important, and the cause of much initial frustration. It does take some getting used to.
What if we want a single row?
data[1]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) /usr/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance) 1875 try: -> 1876 return self._engine.get_loc(key) 1877 except KeyError: pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4027)() pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3891)() pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12408)() pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12359)() KeyError: 1 During handling of the above exception, another exception occurred: KeyError Traceback (most recent call last) <ipython-input-18-c805864c0d75> in <module>() ----> 1 data[1] /usr/lib/python3.5/site-packages/pandas/core/frame.py in __getitem__(self, key) 1990 return self._getitem_multilevel(key) 1991 else: -> 1992 return self._getitem_column(key) 1993 1994 def _getitem_column(self, key): /usr/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_column(self, key) 1997 # get column 1998 if self.columns.is_unique: -> 1999 return self._get_item_cache(key) 2000 2001 # duplicate columns & possible reduce dimensionality /usr/lib/python3.5/site-packages/pandas/core/generic.py in _get_item_cache(self, item) 1343 res = cache.get(item) 1344 if res is None: -> 1345 values = self._data.get(item) 1346 res = self._box_item_values(item, values) 1347 cache[item] = res /usr/lib/python3.5/site-packages/pandas/core/internals.py in get(self, item, fastpath) 3223 3224 if not isnull(item): -> 3225 loc = self.items.get_loc(item) 3226 else: 3227 indexer = np.arange(len(self.items))[isnull(self.items)] /usr/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance) 1876 return self._engine.get_loc(key) 1877 except KeyError: -> 1878 return self._engine.get_loc(self._maybe_cast_indexer(key)) 1879 1880 indexer = self.get_indexer([key], method=method, tolerance=tolerance) pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4027)() pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3891)() pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12408)() pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12359)() KeyError: 1
For a DataFrame, this is ambiguous, since a single value is interpreted as a column name. We can only get at rows by slicing at the top level:
data[1:2]
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
1 | 2002 | 85 | 252 | 217 |
Or we could use .iloc
:
data.iloc[1]
year 2002 temperature 85 rainfall 252 mosquitos 217 Name: 1, dtype: int64
Getting a single row in this way returns a Series
:
type(data.iloc[1])
pandas.core.series.Series
A Series
is a 1-D column of values, having all the same datatype. Since each of the datatypes of our columns were integers, we got a Series
with dtype int64
this time. If we had columns with, e.g. strings, then we'd get back dtype object
, which is a catchall for pandas
.
We can also get the data in our Series
as a raw numpy
array:
type(data.iloc[1].values)
numpy.ndarray
Pandas is a relatively young library, but it's built on top of the venerable numpy
array, which makes it possible to do fast numerical work in Python. A Series
is basically a 1-D numpy
array with the ability to select by labeled indices:
More usefully than simple slicing, we can use boolean indexing to subselect our data. Say we want only data for years beyond 2005?
data[data['year'] > 2005]
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
5 | 2006 | 75 | 283 | 237 |
6 | 2007 | 80 | 214 | 190 |
7 | 2008 | 85 | 197 | 181 |
8 | 2009 | 74 | 231 | 200 |
9 | 2010 | 74 | 207 | 184 |
There's no magic here; we get a boolean index directly from a comparison:
gt_2005 = data['year'] > 2005
gt_2005
0 False 1 False 2 False 3 False 4 False 5 True 6 True 7 True 8 True 9 True Name: year, dtype: bool
And using this Series
of bools will then give only the rows for which the Series
had True
:
data[gt_2005]
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
5 | 2006 | 75 | 283 | 237 |
6 | 2007 | 80 | 214 | 190 |
7 | 2008 | 85 | 197 | 181 |
8 | 2009 | 74 | 231 | 200 |
9 | 2010 | 74 | 207 | 184 |
This is the same behavior as numpy
arrays: using most binary operators, such as +
, *
, >
, &
, work element-wise. With a single value on one side (such as 2005
), we get the result of the operation for each element.
A DataFrame
is an object, and objects have methods. These are functions that are part of the object itself, often doing operations on the object's data. One of these is DataFrame.mean
:
data.mean()
year 2005.5 temperature 80.0 rainfall 214.6 mosquitos 191.3 dtype: float64
We get back the mean value of each column as a single Series
. There's more like this:
data.max()
year 2010 temperature 87 rainfall 292 mosquitos 243 dtype: int64
There's also DataFrame.describe
, which gives common descriptive statistics of the whole DataFrame
:
data.describe()
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
count | 10.00000 | 10.000000 | 10.000000 | 10.00000 |
mean | 2005.50000 | 80.000000 | 214.600000 | 191.30000 |
std | 3.02765 | 5.456902 | 50.317216 | 33.23335 |
min | 2001.00000 | 74.000000 | 154.000000 | 150.00000 |
25% | 2003.25000 | 74.250000 | 168.500000 | 163.75000 |
50% | 2005.50000 | 80.000000 | 210.500000 | 187.00000 |
75% | 2007.75000 | 85.000000 | 246.750000 | 212.75000 |
max | 2010.00000 | 87.000000 | 292.000000 | 243.00000 |
This is, itself, a DataFrame
:
data.describe()['temperature']
count 10.000000 mean 80.000000 std 5.456902 min 74.000000 25% 74.250000 50% 80.000000 75% 85.000000 max 87.000000 Name: temperature, dtype: float64
Documentation is a key part of Python's design. In the notebook, you can get a quick look at the docs for a given Python function or method with:
data.describe?
Or more generally (built-in Python behavior):
help(data.describe)
Help on method describe in module pandas.core.generic: describe(percentiles=None, include=None, exclude=None) method of pandas.core.frame.DataFrame instance Generate various summary statistics, excluding NaN values. Parameters ---------- percentiles : array-like, optional The percentiles to include in the output. Should all be in the interval [0, 1]. By default `percentiles` is [.25, .5, .75], returning the 25th, 50th, and 75th percentiles. include, exclude : list-like, 'all', or None (default) Specify the form of the returned result. Either: - None to both (default). The result will include only numeric-typed columns or, if none are, only categorical columns. - A list of dtypes or strings to be included/excluded. To select all numeric types use numpy numpy.number. To select categorical objects use type object. See also the select_dtypes documentation. eg. df.describe(include=['O']) - If include is the string 'all', the output column-set will match the input one. Returns ------- summary: NDFrame of summary statistics Notes ----- The output DataFrame index depends on the requested dtypes: For numeric dtypes, it will include: count, mean, std, min, max, and lower, 50, and upper percentiles. For object dtypes (e.g. timestamps or strings), the index will include the count, unique, most common, and frequency of the most common. Timestamps also include the first and last items. For mixed dtypes, the index will be the union of the corresponding output types. Non-applicable entries will be filled with NaN. Note that mixed-dtype outputs can only be returned from mixed-dtype inputs and appropriate use of the include/exclude arguments. If multiple values have the highest count, then the `count` and `most common` pair will be arbitrarily chosen from among those with the highest count. The include, exclude arguments are ignored for Series. See Also -------- DataFrame.select_dtypes
One way we could do this is to first grab the "mosquitos"
column, then use a fancy index obtained from comparing the "rainfall"
column to 200
. We can then call the std
method of the resulting Series
:
data['mosquitos'][data['rainfall'] > 200].std()
24.587937421969063
Note that this is a key part of the power of pandas
objects: operations for subsetting and calculating descriptive statistics can often be stacked to great effect.
What if we know our temperatures are in fahrenheit, but we want them in celsius? We can convert them.
(data['temperature'] - 32) * 5 / 9
0 26.666667 1 29.444444 2 30.000000 3 30.555556 4 23.333333 5 23.888889 6 26.666667 7 29.444444 8 23.333333 9 23.333333 Name: temperature, dtype: float64
This gives us back a new Series
object. If we want to change the values in our existing DataFrame
to be these, we can just set the column to this new Series
:
data['temperature'] = (data['temperature'] - 32) * 5 / 9
data['temperature']
0 26.666667 1 29.444444 2 30.000000 3 30.555556 4 23.333333 5 23.888889 6 26.666667 7 29.444444 8 23.333333 9 23.333333 Name: temperature, dtype: float64
Similarly, it's also possible to add new columns. We could add one giving us, e.g. the ratio of rainfall to mosquitos:
data['rainfall / mosquitos'] = data['rainfall'] / data['mosquitos']
data
year | temperature | rainfall | mosquitos | rainfall / mosquitos | |
---|---|---|---|---|---|
0 | 2001 | 26.666667 | 157 | 150 | 1.046667 |
1 | 2002 | 29.444444 | 252 | 217 | 1.161290 |
2 | 2003 | 30.000000 | 154 | 153 | 1.006536 |
3 | 2004 | 30.555556 | 159 | 158 | 1.006329 |
4 | 2005 | 23.333333 | 292 | 243 | 1.201646 |
5 | 2006 | 23.888889 | 283 | 237 | 1.194093 |
6 | 2007 | 26.666667 | 214 | 190 | 1.126316 |
7 | 2008 | 29.444444 | 197 | 181 | 1.088398 |
8 | 2009 | 23.333333 | 231 | 200 | 1.155000 |
9 | 2010 | 23.333333 | 207 | 184 | 1.125000 |
It's probably a bit silly to have such a column. So, let's get rid of it:
del data['rainfall / mosquitos']
data
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
0 | 2001 | 26.666667 | 157 | 150 |
1 | 2002 | 29.444444 | 252 | 217 |
2 | 2003 | 30.000000 | 154 | 153 |
3 | 2004 | 30.555556 | 159 | 158 |
4 | 2005 | 23.333333 | 292 | 243 |
5 | 2006 | 23.888889 | 283 | 237 |
6 | 2007 | 26.666667 | 214 | 190 |
7 | 2008 | 29.444444 | 197 | 181 |
8 | 2009 | 23.333333 | 231 | 200 |
9 | 2010 | 23.333333 | 207 | 184 |
Don't like the order of your columns? We can make a DataFrame
with a different column order by selecting them out in the order we want:
data[['rainfall', 'year', 'mosquitos', 'temperature']]
rainfall | year | mosquitos | temperature | |
---|---|---|---|---|
0 | 157 | 2001 | 150 | 26.666667 |
1 | 252 | 2002 | 217 | 29.444444 |
2 | 154 | 2003 | 153 | 30.000000 |
3 | 159 | 2004 | 158 | 30.555556 |
4 | 292 | 2005 | 243 | 23.333333 |
5 | 283 | 2006 | 237 | 23.888889 |
6 | 214 | 2007 | 190 | 26.666667 |
7 | 197 | 2008 | 181 | 29.444444 |
8 | 231 | 2009 | 200 | 23.333333 |
9 | 207 | 2010 | 184 | 23.333333 |
We can even have duplicates:
data[['rainfall', 'year', 'mosquitos', 'temperature', 'year']]
rainfall | year | mosquitos | temperature | year | |
---|---|---|---|---|---|
0 | 157 | 2001 | 150 | 26.666667 | 2001 |
1 | 252 | 2002 | 217 | 29.444444 | 2002 |
2 | 154 | 2003 | 153 | 30.000000 | 2003 |
3 | 159 | 2004 | 158 | 30.555556 | 2004 |
4 | 292 | 2005 | 243 | 23.333333 | 2005 |
5 | 283 | 2006 | 237 | 23.888889 | 2006 |
6 | 214 | 2007 | 190 | 26.666667 | 2007 |
7 | 197 | 2008 | 181 | 29.444444 | 2008 |
8 | 231 | 2009 | 200 | 23.333333 | 2009 |
9 | 207 | 2010 | 184 | 23.333333 | 2010 |
Remember: this returns a copy. It does not change our existing DataFrame
:
data
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
0 | 2001 | 26.666667 | 157 | 150 |
1 | 2002 | 29.444444 | 252 | 217 |
2 | 2003 | 30.000000 | 154 | 153 |
3 | 2004 | 30.555556 | 159 | 158 |
4 | 2005 | 23.333333 | 292 | 243 |
5 | 2006 | 23.888889 | 283 | 237 |
6 | 2007 | 26.666667 | 214 | 190 |
7 | 2008 | 29.444444 | 197 | 181 |
8 | 2009 | 23.333333 | 231 | 200 |
9 | 2010 | 23.333333 | 207 | 184 |
Just as pandas
is the de-facto library for working with tabular data, matplotlib
is the de-facto library for producing high-quality plots from numerical data. A saying often goes that matplotlib makes easy things easy and hard things possible when it comes to making plots.
import matplotlib.pyplot as plt
We need to tell the notebook to render plots in the notebook itself:
%matplotlib inline
Before we use matplotlib
explicitly, pandas
objects feature convenience methods for common plotting operations. These use matplotlib
internally, so everything we learn later about matplotlib
objects applies to these. For example, we can plot both "temperature"
and "mosquitos"
as a function of "year"
:
data.plot(x='year', y=['temperature', 'mosquitos'])
<matplotlib.axes._subplots.AxesSubplot at 0x7f339711ce80>
Let's load a larger dataset:
data = pd.read_csv('A2_mosquito_data.csv')
This dataset has 51 rows, instead of just 10:
len(data)
51
data.plot(x='year', y=['temperature', 'mosquitos'])
<matplotlib.axes._subplots.AxesSubplot at 0x7f33970c72e8>
There are other convenience methods for different ways of plotting the data. For example, we can get a kernel-density estimate:
data['mosquitos'].plot.kde()
<matplotlib.axes._subplots.AxesSubplot at 0x7f3397030160>
These convenience methods are great, but for more complex plots we can, and should, use matplotlib
directly. We can make a multi-paneled figure giving both "temperature" and "rainfall" for each "year."
fig = plt.figure(figsize=(4, 6))
ax1 = fig.add_subplot(2, 1, 1)
ax1.plot(data['year'], data['temperature'], 'ro-')
ax1.set_xlabel('Year')
ax1.set_ylabel('Temperature')
ax2 = fig.add_subplot(2, 1, 2)
ax2.plot(data['year'], data['rainfall'], 'bs-')
ax2.set_xlabel('Year')
ax2.set_ylabel('Rainfall')
<matplotlib.text.Text at 0x7f3390d64ef0>
We can do this in a similar way as we did above.
fig = plt.figure(figsize=(4, 6))
ax1 = fig.add_subplot(2, 1, 1)
ax1.plot(data['temperature'], data['mosquitos'], 'ro')
ax1.set_xlabel('Temperature')
ax1.set_ylabel('Mosquitos')
ax2 = fig.add_subplot(2, 1, 2)
ax2.plot(data['rainfall'], data['mosquitos'], 'bs')
ax2.set_xlabel('Rainfall')
ax2.set_ylabel('Mosquitos')
<matplotlib.text.Text at 0x7f3390d15978>
Note that the linestyle code on the right can be used to turn off interpolation lines, since we want to just plot the points, and don't care about their order.
From this one dataset we see what looks like a linear relationship between mosquitos and rainfall, but not really any relationship between mosquitos and temperature. We'd like to quantify this further by applying a statistical model to the data, but we also want to do this same treatment to all our datasets. We need to learn a few more things, including loops, before continuing further.
First, an aside with strings. Most programming languages generally deal in a few basic data types, including integers, floating-point numbers, and characters. A string is a sequence of characters, and we can manipulate them easily with python. For example:
word = 'bird'
We have a string 'lead'
that we've assigned the name word
. We can examine the string's type, and indeed any object in python, with the type
builtin.
type(word)
str
We can also access its characters (it's a sequence!) with indexing:
word[0]
'b'
word[1]
'i'
And slicing:
word[1:]
'ird'
If I want to output each letter of word
on a new line, I could do:
print(word[0])
print(word[1])
print(word[2])
print(word[3])
b i r d
But this doesn't scale to words of arbitrary length. Instead, let's let the computer do the boring work of iterating with a for
loop:
for letter in word:
print letter
l e a d
This same bit of code will work for any word
, including ridiculously long ones. Like this gem:
word = 'supercallifragilisticexpialidocious'
print(word)
supercallifragilisticexpialidocious
for letter in word:
print(letter)
s u p e r c a l l i f r a g i l i s t i c e x p i a l i d o c i o u s
We can do more with loops than iterate through the elements of a sequence. A common use-case is building a sum. For example, getting the number of letters in word
:
counter = 0
for letter in word:
counter = counter + 1
print(counter)
35
For this very particular use, there's a python built-in, which is guaranteed to be more efficient. Use it instead:
len(word)
35
It works on any sequence! For pandas.DataFrame
s it gives the length of the number of rows:
len(data)
51
Loops work best when they have something to iterate over. In python, the list
is a built-in data structure that serves as a container for a sequence of other objects. We can make an empty list with:
stuff = list()
print(stuff)
[]
And we can append things to the end of the list, in this case a bunch of random strings:
stuff.append('marklar')
stuff.append('chewbacca')
stuff.append('chickenfingers')
stuff
['marklar', 'chewbacca', 'chickenfingers']
Just like with the word
example, we can iterate through the elements of the list:
for item in stuff:
print(item)
marklar chewbacca chickenfingers
And slicing works too! Slicing with negative numbers counts backwards from the end of the list. The slice stuff[-2:]
could be read as "get the 2nd-to-last element of stuff
, through the last element":
for item in stuff[-2:]:
print(item)
chewbacca chickenfingers
So indexing with -1
gives the last element:
stuff[-1]
'chickenfingers'
And consistently, but uselessly, slicing from and to the same index yeilds nothing. Remember, for a slice n:m
, it reads as "get all elements starting with element n
and up to but not including element m
."
stuff[0:0]
[]
Lists are mutable: that is, their elements can be changed. For example, we can replace 'chewbacca'
with 'hansolo'
:
stuff[1]
'chewbacca'
stuff[1] = 'hansolo'
stuff[1]
'hansolo'
...or replace 'hansolo'
with a DataFrame
:
stuff[1] = data
stuff[1]
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
0 | 1960 | 82 | 200 | 180 |
1 | 1961 | 70 | 227 | 194 |
2 | 1962 | 89 | 231 | 207 |
3 | 1963 | 74 | 114 | 121 |
4 | 1964 | 78 | 147 | 140 |
5 | 1965 | 85 | 151 | 148 |
6 | 1966 | 86 | 172 | 162 |
7 | 1967 | 75 | 106 | 112 |
8 | 1968 | 70 | 276 | 230 |
9 | 1969 | 86 | 165 | 162 |
10 | 1970 | 83 | 222 | 198 |
11 | 1971 | 78 | 297 | 247 |
12 | 1972 | 87 | 288 | 248 |
13 | 1973 | 76 | 286 | 239 |
14 | 1974 | 86 | 231 | 202 |
15 | 1975 | 90 | 284 | 243 |
16 | 1976 | 76 | 190 | 175 |
17 | 1977 | 87 | 257 | 225 |
18 | 1978 | 88 | 128 | 133 |
19 | 1979 | 87 | 218 | 199 |
20 | 1980 | 81 | 206 | 184 |
21 | 1981 | 74 | 175 | 160 |
22 | 1982 | 85 | 202 | 187 |
23 | 1983 | 71 | 130 | 126 |
24 | 1984 | 80 | 225 | 200 |
25 | 1985 | 72 | 196 | 173 |
26 | 1986 | 76 | 261 | 222 |
27 | 1987 | 85 | 111 | 121 |
28 | 1988 | 83 | 247 | 210 |
29 | 1989 | 86 | 137 | 142 |
30 | 1990 | 82 | 159 | 152 |
31 | 1991 | 77 | 172 | 160 |
32 | 1992 | 74 | 280 | 231 |
33 | 1993 | 70 | 291 | 238 |
34 | 1994 | 77 | 126 | 125 |
35 | 1995 | 89 | 191 | 178 |
36 | 1996 | 83 | 298 | 248 |
37 | 1997 | 80 | 282 | 237 |
38 | 1998 | 86 | 219 | 195 |
39 | 1999 | 72 | 143 | 134 |
40 | 2000 | 79 | 262 | 221 |
41 | 2001 | 85 | 189 | 175 |
42 | 2002 | 86 | 205 | 186 |
43 | 2003 | 72 | 195 | 173 |
44 | 2004 | 78 | 148 | 146 |
45 | 2005 | 71 | 262 | 219 |
46 | 2006 | 88 | 255 | 226 |
47 | 2007 | 79 | 262 | 221 |
48 | 2008 | 73 | 198 | 176 |
49 | 2009 | 86 | 215 | 187 |
50 | 2010 | 87 | 127 | 129 |
print(stuff)
['marklar', year temperature rainfall mosquitos 0 1960 82 200 180 1 1961 70 227 194 2 1962 89 231 207 3 1963 74 114 121 4 1964 78 147 140 5 1965 85 151 148 6 1966 86 172 162 7 1967 75 106 112 8 1968 70 276 230 9 1969 86 165 162 10 1970 83 222 198 11 1971 78 297 247 12 1972 87 288 248 13 1973 76 286 239 14 1974 86 231 202 15 1975 90 284 243 16 1976 76 190 175 17 1977 87 257 225 18 1978 88 128 133 19 1979 87 218 199 20 1980 81 206 184 21 1981 74 175 160 22 1982 85 202 187 23 1983 71 130 126 24 1984 80 225 200 25 1985 72 196 173 26 1986 76 261 222 27 1987 85 111 121 28 1988 83 247 210 29 1989 86 137 142 30 1990 82 159 152 31 1991 77 172 160 32 1992 74 280 231 33 1993 70 291 238 34 1994 77 126 125 35 1995 89 191 178 36 1996 83 298 248 37 1997 80 282 237 38 1998 86 219 195 39 1999 72 143 134 40 2000 79 262 221 41 2001 85 189 175 42 2002 86 205 186 43 2003 72 195 173 44 2004 78 148 146 45 2005 71 262 219 46 2006 88 255 226 47 2007 79 262 221 48 2008 73 198 176 49 2009 86 215 187 50 2010 87 127 129, 'chickenfingers']
We also add a list as an element to itself:
stuff.append(stuff)
So now stuff[-1]
points to the same list the name stuff
does:
stuff[-1] is stuff
True
It's important to recognize that these are exactly the same list. Modifications to stuff[-1]
will be seen in stuff
. These are just names, and in this case they refer to the same object in memory.
Lists can contain lists can contain lists can contain lists...and of course any number of other python objects. Even functions can be included as elements:
stuff.append(len)
stuff
['marklar', year temperature rainfall mosquitos 0 1960 82 200 180 1 1961 70 227 194 2 1962 89 231 207 3 1963 74 114 121 4 1964 78 147 140 5 1965 85 151 148 6 1966 86 172 162 7 1967 75 106 112 8 1968 70 276 230 9 1969 86 165 162 10 1970 83 222 198 11 1971 78 297 247 12 1972 87 288 248 13 1973 76 286 239 14 1974 86 231 202 15 1975 90 284 243 16 1976 76 190 175 17 1977 87 257 225 18 1978 88 128 133 19 1979 87 218 199 20 1980 81 206 184 21 1981 74 175 160 22 1982 85 202 187 23 1983 71 130 126 24 1984 80 225 200 25 1985 72 196 173 26 1986 76 261 222 27 1987 85 111 121 28 1988 83 247 210 29 1989 86 137 142 30 1990 82 159 152 31 1991 77 172 160 32 1992 74 280 231 33 1993 70 291 238 34 1994 77 126 125 35 1995 89 191 178 36 1996 83 298 248 37 1997 80 282 237 38 1998 86 219 195 39 1999 72 143 134 40 2000 79 262 221 41 2001 85 189 175 42 2002 86 205 186 43 2003 72 195 173 44 2004 78 148 146 45 2005 71 262 219 46 2006 88 255 226 47 2007 79 262 221 48 2008 73 198 176 49 2009 86 215 187 50 2010 87 127 129, 'chickenfingers', [...], <function len>]
Lists also have their own methods, which usually alter the list itself:
stuff.reverse()
stuff
[<function len>, [...], 'chickenfingers', year temperature rainfall mosquitos 0 1960 82 200 180 1 1961 70 227 194 2 1962 89 231 207 3 1963 74 114 121 4 1964 78 147 140 5 1965 85 151 148 6 1966 86 172 162 7 1967 75 106 112 8 1968 70 276 230 9 1969 86 165 162 10 1970 83 222 198 11 1971 78 297 247 12 1972 87 288 248 13 1973 76 286 239 14 1974 86 231 202 15 1975 90 284 243 16 1976 76 190 175 17 1977 87 257 225 18 1978 88 128 133 19 1979 87 218 199 20 1980 81 206 184 21 1981 74 175 160 22 1982 85 202 187 23 1983 71 130 126 24 1984 80 225 200 25 1985 72 196 173 26 1986 76 261 222 27 1987 85 111 121 28 1988 83 247 210 29 1989 86 137 142 30 1990 82 159 152 31 1991 77 172 160 32 1992 74 280 231 33 1993 70 291 238 34 1994 77 126 125 35 1995 89 191 178 36 1996 83 298 248 37 1997 80 282 237 38 1998 86 219 195 39 1999 72 143 134 40 2000 79 262 221 41 2001 85 189 175 42 2002 86 205 186 43 2003 72 195 173 44 2004 78 148 146 45 2005 71 262 219 46 2006 88 255 226 47 2007 79 262 221 48 2008 73 198 176 49 2009 86 215 187 50 2010 87 127 129, 'marklar']
list.pop
removes the element at the given index:
stuff.pop(-2)
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
0 | 1960 | 82 | 200 | 180 |
1 | 1961 | 70 | 227 | 194 |
2 | 1962 | 89 | 231 | 207 |
3 | 1963 | 74 | 114 | 121 |
4 | 1964 | 78 | 147 | 140 |
5 | 1965 | 85 | 151 | 148 |
6 | 1966 | 86 | 172 | 162 |
7 | 1967 | 75 | 106 | 112 |
8 | 1968 | 70 | 276 | 230 |
9 | 1969 | 86 | 165 | 162 |
10 | 1970 | 83 | 222 | 198 |
11 | 1971 | 78 | 297 | 247 |
12 | 1972 | 87 | 288 | 248 |
13 | 1973 | 76 | 286 | 239 |
14 | 1974 | 86 | 231 | 202 |
15 | 1975 | 90 | 284 | 243 |
16 | 1976 | 76 | 190 | 175 |
17 | 1977 | 87 | 257 | 225 |
18 | 1978 | 88 | 128 | 133 |
19 | 1979 | 87 | 218 | 199 |
20 | 1980 | 81 | 206 | 184 |
21 | 1981 | 74 | 175 | 160 |
22 | 1982 | 85 | 202 | 187 |
23 | 1983 | 71 | 130 | 126 |
24 | 1984 | 80 | 225 | 200 |
25 | 1985 | 72 | 196 | 173 |
26 | 1986 | 76 | 261 | 222 |
27 | 1987 | 85 | 111 | 121 |
28 | 1988 | 83 | 247 | 210 |
29 | 1989 | 86 | 137 | 142 |
30 | 1990 | 82 | 159 | 152 |
31 | 1991 | 77 | 172 | 160 |
32 | 1992 | 74 | 280 | 231 |
33 | 1993 | 70 | 291 | 238 |
34 | 1994 | 77 | 126 | 125 |
35 | 1995 | 89 | 191 | 178 |
36 | 1996 | 83 | 298 | 248 |
37 | 1997 | 80 | 282 | 237 |
38 | 1998 | 86 | 219 | 195 |
39 | 1999 | 72 | 143 | 134 |
40 | 2000 | 79 | 262 | 221 |
41 | 2001 | 85 | 189 | 175 |
42 | 2002 | 86 | 205 | 186 |
43 | 2003 | 72 | 195 | 173 |
44 | 2004 | 78 | 148 | 146 |
45 | 2005 | 71 | 262 | 219 |
46 | 2006 | 88 | 255 | 226 |
47 | 2007 | 79 | 262 | 221 |
48 | 2008 | 73 | 198 | 176 |
49 | 2009 | 86 | 215 | 187 |
50 | 2010 | 87 | 127 | 129 |
...while list.remove
removes the first instance of the given element in the list:
stuff.remove('chickenfingers')
So now our list looks like:
stuff
[<function len>, [...], 'marklar']
Now that we know how to do loops and use lists in Python, we'll use these to make a plot for each dataset. First, we'll import a module called glob
:
import glob
We can use this to grab all the filenames matching a "globbing" pattern:
glob.glob("*.csv")
['A1_mosquito_data.csv', 'A3_mosquito_data.csv', 'B2_mosquito_data.csv', 'A2_mosquito_data.csv', 'B1_mosquito_data.csv']
"Globbing" is the name of the shell-completion patterns common for shells like sh
and bash
.
We can grab the block of code we wrote previously, and loop through it for each file obtained by globbing:
import glob
files = glob.glob('*.csv')
for file in files:
print(file)
data = pd.read_csv(file)
fig = plt.figure(figsize=(4, 6))
ax1 = fig.add_subplot(2, 1, 1)
ax1.plot(data['temperature'], data['mosquitos'], 'ro')
ax1.set_xlabel('Temperature')
ax1.set_ylabel('Mosquitos')
ax2 = fig.add_subplot(2, 1, 2)
ax2.plot(data['rainfall'], data['mosquitos'], 'bs')
ax2.set_xlabel('Rainfall')
ax2.set_ylabel('Mosquitos')
# tell matplotlib to render the plot *now*
plt.show()
A1_mosquito_data.csv
A3_mosquito_data.csv
B2_mosquito_data.csv
A2_mosquito_data.csv
B1_mosquito_data.csv
It looks like there's a similar pattern for each area we have data for. We'd like to quantify more closely the relationship between mosquito population and each of these variables.
There are many statistics packages in Python, but one of the most well-developed and widely-used is statsmodels
:
import statsmodels.api as sm
We'll use our second dataset for the initial stab at fitting a statistical model to our data:
data = pd.read_csv('A2_mosquito_data.csv')
And let's convert the temperatures to Celsius, for good measure:
data['temperature'] = (data['temperature'] - 32) * 5 / 9
We can build an ordinary least-squares fit to the data using statsmodels.OLS
. Remember we can check the docs with:
sm.OLS?
We want to fit the relationship between mosquito population and, for a start, rainfall:
regr_results = sm.OLS(data['mosquitos'], data['rainfall']).fit()
regr_results.summary()
Dep. Variable: | mosquitos | R-squared: | 0.996 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.996 |
Method: | Least Squares | F-statistic: | 1.386e+04 |
Date: | Mon, 09 May 2016 | Prob (F-statistic): | 8.69e-63 |
Time: | 23:02:14 | Log-Likelihood: | -196.25 |
No. Observations: | 51 | AIC: | 394.5 |
Df Residuals: | 50 | BIC: | 396.4 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
rainfall | 0.8811 | 0.007 | 117.744 | 0.000 | 0.866 0.896 |
Omnibus: | 4.422 | Durbin-Watson: | 1.878 |
---|---|---|---|
Prob(Omnibus): | 0.110 | Jarque-Bera (JB): | 1.959 |
Skew: | -0.085 | Prob(JB): | 0.376 |
Kurtosis: | 2.055 | Cond. No. | 1.00 |
So, this gives us a lot of information. But is this what we want?
We actually want to quantify which variable correlates more with the mosquito population. One way we can build a multivariate model with statsmodels is to use a formula:
regr_results = sm.OLS.from_formula('mosquitos ~ temperature + rainfall', data).fit()
regr_results.summary()
Dep. Variable: | mosquitos | R-squared: | 0.997 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.997 |
Method: | Least Squares | F-statistic: | 7889. |
Date: | Mon, 09 May 2016 | Prob (F-statistic): | 3.68e-61 |
Time: | 23:02:15 | Log-Likelihood: | -111.54 |
No. Observations: | 51 | AIC: | 229.1 |
Df Residuals: | 48 | BIC: | 234.9 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 17.5457 | 2.767 | 6.341 | 0.000 | 11.983 23.109 |
temperature | 0.8719 | 0.092 | 9.457 | 0.000 | 0.687 1.057 |
rainfall | 0.6967 | 0.006 | 125.385 | 0.000 | 0.686 0.708 |
Omnibus: | 1.651 | Durbin-Watson: | 1.872 |
---|---|---|---|
Prob(Omnibus): | 0.438 | Jarque-Bera (JB): | 0.906 |
Skew: | -0.278 | Prob(JB): | 0.636 |
Kurtosis: | 3.343 | Cond. No. | 1.92e+03 |
Each of these results can be accessed as attributes of the Results
object:
print(regr_results.params)
Intercept 17.545739 temperature 0.871943 rainfall 0.696717 dtype: float64
print(regr_results.rsquared)
0.996966873691
Our model "explains" most of the data. Nice! Let's plot the predicted population of mosquitos from our fitted model against the measured population to see how well we did:
import matplotlib.pyplot as plt
figure = plt.figure()
ax = figure.add_subplot(1,1,1)
parameters = regr_results.params
predicted = parameters['Intercept'] + parameters['temperature'] * data['temperature'] + parameters['rainfall'] * data['rainfall']
ax.plot(predicted, data['mosquitos'], 'ro')
ax.set_xlabel('predicted mosquito population')
ax.set_ylabel('measured mosquito population')
<matplotlib.text.Text at 0x7f338d98c908>
The more linear this plot is, the better our model fits the data. Also, it turns out that, as you might have guessed from the plots we made previously, rainfall is the better predictor than average temperature for how many mosquitoes we expect to see.
regr_results.tvalues
Intercept 6.341405 temperature 9.456614 rainfall 125.385116 dtype: float64
We can take our block of code we worked on previously, and make a three-panel plot instead of just a two-panel one:
filenames = glob.glob('*.csv')
for filename in filenames:
print(filename)
data = pd.read_csv(filename)
# convert temperatures to celsius from fahrenheit
data['temperature'] = (data['temperature'] - 32) * 5/9
# perform fit
regr_results = sm.OLS.from_formula('mosquitos ~ temperature + rainfall', data).fit()
print(regr_results.tvalues)
fig = plt.figure(figsize=(6, 9))
# plot predicted vs. measured mosquito populations from fitted model
parameters = regr_results.params
predicted = parameters['Intercept'] + parameters['temperature'] * data['temperature'] + parameters['rainfall'] * data['rainfall']
ax0 = fig.add_subplot(3, 1, 1)
ax0.plot(predicted, data['mosquitos'], 'gd')
ax0.set_xlabel('predicted mosquito population')
ax0.set_ylabel('measured mosquito population')
# plot population vs. temperature
ax1 = fig.add_subplot(3, 1, 2)
ax1.plot(data['temperature'], data['mosquitos'], 'ro')
ax1.set_xlabel('Temperature')
ax1.set_ylabel('Mosquitos')
# plot population vs. rainfall
ax2 = fig.add_subplot(3, 1, 3)
ax2.plot(data['rainfall'], data['mosquitos'], 'bs')
ax2.set_xlabel('Rainfall')
ax2.set_ylabel('Mosquitos')
plt.show()
A1_mosquito_data.csv Intercept 3.589733 temperature 3.456159 rainfall 56.825601 dtype: float64
A3_mosquito_data.csv Intercept 9.658621 temperature 10.881139 rainfall 70.798256 dtype: float64
B2_mosquito_data.csv Intercept 8.778676 temperature 11.647932 rainfall 125.016860 dtype: float64
A2_mosquito_data.csv Intercept 6.341405 temperature 9.456614 rainfall 125.385116 dtype: float64
B1_mosquito_data.csv Intercept 11.016743 temperature 13.396049 rainfall 43.072783 dtype: float64
For all areas, it looks as if rainfall is indeed the stronger predictor of mosquito population than temperature. But what if this is just the beginning of the number of datasets we have to analyze? Are we going to just copy this block of code around each time we want to analyze new data?
No. That would be awful.
We have code blocks we've been copying and pasting around a lot, and this is fine for a while. But once we've built some useful things, we want to re-use them, and we don't want to maintain many versions of the same thing lying around.
def square(x):
x_squared = x ** 2
return x_squared
square(3)
9
square(4)
16
Functions are also composable:
square(square(3))
81
We can improve on this a bit by adding documentation. In Python, we can write a docstring for our function directly in its definition:
def square(x):
"""Return the square of the value, `x`.
Parameters
----------
x : float
Number to obtain the square of.
Returns
-------
square : float
The square of `x`.
"""
x_squared = x ** 2
return x_squared
And this docstring is now what we see when we use the builtin help system:
help(square)
Help on function square in module __main__: square(x) Return the square of the value, `x`. Parameters ---------- x : float Number to obtain the square of. Returns ------- square : float The square of `x`.
Or alternatively, in the notebook:
square?
This is one way we could write it:
def fahrenheit_to_celsius(temp):
"""Convert temperature in fahrenheit to celsius.
Parameters
----------
temp : float or array_like
Temperature(s) in fahrenheit.
Returns
-------
float or array_like
Temperature(s) in celsius.
"""
return (temp - 32) * 5 / 9
Note that this same function will work for pandas.Series
objects, numpy
arrays, and single numbers. Writing functions in this way takes advantage of the fact that these objects are roughly duck-typed: they all behave as if they were single numbers, so doing arithmetic operations with them yields the expected result.
Let's do our model building and plotting in a single function, returning a useful panel plot:
def analyze(data):
"""Return panel plot of mosquito population vs. temperature, rainfall.
Also prints t-values for temperature and rainfall.
Panel plot gives:
1. comparison of modeled values of mosquito population vs. actual values
2. mosquito population vs. average temperature
3. mosquito population vs. total rainfall
Parameters
----------
data : DataFrame
DataFrame giving columns for average temperature,
total rainfall, and mosquito population during mosquito
breeding season for each year.
Returns
-------
Figure
:mod:`matplotlib.figure.Figure` object giving panel plot.
"""
# perform fit
regr_results = sm.OLS.from_formula('mosquitos ~ temperature + rainfall', data).fit()
print(regr_results.tvalues)
fig = plt.figure(figsize=(6, 9))
# plot prediction from fitted model against measured mosquito population
parameters = regr_results.params
predicted = parameters['Intercept'] + parameters['temperature'] * data['temperature'] + parameters['rainfall'] * data['rainfall']
ax0 = fig.add_subplot(3, 1, 1)
ax0.plot(predicted, data['mosquitos'], 'go')
ax0.set_xlabel('predicted mosquito population')
ax0.set_ylabel('measured mosquito population')
# plot population vs. temperature
ax1 = fig.add_subplot(3, 1, 2)
ax1.plot(data['temperature'], data['mosquitos'], 'ro')
ax1.set_xlabel('Temperature')
ax1.set_ylabel('Mosquitos')
# plot population vs. rainfall
ax2 = fig.add_subplot(3, 1, 3)
ax2.plot(data['rainfall'], data['mosquitos'], 'bs')
ax2.set_xlabel('Rainfall')
ax2.set_ylabel('Mosquitos')
return fig
Note that since this is Python, and everything is an object, we can return our Figure
object. This is generally a useful thing to do for plotting functions, since in different contexts you may want to modify the resulting Figure
before rendering any plots of it.
We can now re-write our big loop from before, and it should look a lot simpler.
filenames = glob.glob('*.csv')
for filename in filenames:
print(filename)
data = pd.read_csv(filename)
# convert temperatures to celsius
data['temperature'] = fahrenheit_to_celsius(data['temperature'])
# get t-values; return plots
analyze(data)
plt.show()
A1_mosquito_data.csv Intercept 3.589733 temperature 3.456159 rainfall 56.825601 dtype: float64
A3_mosquito_data.csv Intercept 9.658621 temperature 10.881139 rainfall 70.798256 dtype: float64
B2_mosquito_data.csv Intercept 8.778676 temperature 11.647932 rainfall 125.016860 dtype: float64
A2_mosquito_data.csv Intercept 6.341405 temperature 9.456614 rainfall 125.385116 dtype: float64
B1_mosquito_data.csv Intercept 11.016743 temperature 13.396049 rainfall 43.072783 dtype: float64
Our for loop went from being long and complicated to simple and easy-to-understand. This is part of the power of functions. Because the human brain can only keep about $7 \pm 2$ concepts in working memory at any given time, functions allow us to abstract and build more complicated code without making it more difficult to understand. Instead of having to worry about how filter_data does its thing, we need only know its inputs and its outputs to wield it effectively. It can be re-used without copying and pasting, and we can maintain it and improve it easily since it can be defined in a single place.
Since we may want to use these functions later in other notebooks/code, we can put their definitions in a file and import it. Opening a text editor and copying the function definitions (and the imports they need) to a file datastuff.py, that looks like this:
%cat datastuff.py
import statsmodels.api as sm import matplotlib.pyplot as plt def fahrenheit_to_celsius(temp): """Convert temperature in fahrenheit to celsius. Parameters ---------- temp : float or array_like Temperature(s) in fahrenheit. Returns ------- float or array_like Temperatures in celsius. bubbles """ try: newtemps = (temp - 32) * 5/9 except TypeError: newtemps = [] for value in temp: newtemps.append(fahrenheit_to_celsius(value)) return newtemps def analyze(data): """Return panel plot of mosquito population vs. temperature, rainfall. Also prints t-values for temperature and rainfall. Panel plot gives: 1. comparison of modeled values of mosquito population vs. actual values 2. mosquito population vs. average temperature 3. mosquito population vs. total rainfall Parameters ---------- data : DataFrame DataFrame giving columns for average temperature, total rainfall, and mosquito population during mosquito breeding season for each year. Returns ------- Figure :mod:`matplotlib.figure.Figure` object giving panel plot. """ # perform fit regr_results = sm.OLS.from_formula('mosquitos ~ temperature + rainfall', data).fit() print(regr_results.tvalues) fig = plt.figure(figsize=(6, 9)) # plot prediction from fitted model against measured mosquito population parameters = regr_results.params predicted = parameters['Intercept'] + parameters['temperature'] * data['temperature'] + parameters['rainfall'] * data['rainfall'] ax0 = fig.add_subplot(3, 1, 1) ax0.plot(predicted, data['mosquitos'], 'go') ax0.set_xlabel('predicted mosquito population') ax0.set_ylabel('measured mosquito population') # plot population vs. temperature ax1 = fig.add_subplot(3, 1, 2) ax1.plot(data['temperature'], data['mosquitos'], 'ro') ax1.set_xlabel('Temperature') ax1.set_ylabel('Mosquitos') # plot population vs. rainfall ax2 = fig.add_subplot(3, 1, 3) ax2.plot(data['rainfall'], data['mosquitos'], 'bs') ax2.set_xlabel('Rainfall') ax2.set_ylabel('Mosquitos') return fig
Congratulations: you just made your first Python module! We can import it directly:
import datastuff
And we can see our documentation for these functions and use them as you'd expect:
datastuff.fahrenheit_to_celsius?
datastuff.analyze(data)
Intercept 11.016743 temperature 13.396049 rainfall 43.072783 dtype: float64
Note: if you make modifications to your module, the will not show up in this session, even if you rerun import datastuff
. You will either need to restart your Python session, or explicitly tell the Python session to rebuild the module object by re-reading the file:
import importlib
importlib.reload(datastuff)
<module 'datastuff' from '/home/alter/Documents/SWC/Events/2016_UCSF/swc_ucsf_intermediate_data/python/datastuff.py'>
The reason this is the case is because it's more efficient for the Python interpreter to build a module object once from a module file, and then with each import encountered use that same object instead of building a whole new one. It's faster, and requires less memory, for this to be the case.
Although we didn't need to do it for these examples, pandas
is generally more useful when we can pack as much of our data as is reasonable into a single DataFrame
. Since all our datasets have the same form, we can probably gain by combining them.
import pandas as pd
import glob
We will use the pandas.concat
function to make a single DataFrame
from all our others:
filenames = glob.glob('*.csv')
datas = []
keys = []
for filename in filenames:
keys.append(filename.split('_')[0])
datas.append(pd.read_csv(filename))
df = pd.concat(datas, keys=keys, names=['location'])
We now have a "multi-index" DataFrame
:
df
year | temperature | rainfall | mosquitos | ||
---|---|---|---|---|---|
location | |||||
A1 | 0 | 2001 | 80 | 157 | 150 |
1 | 2002 | 85 | 252 | 217 | |
2 | 2003 | 86 | 154 | 153 | |
3 | 2004 | 87 | 159 | 158 | |
4 | 2005 | 74 | 292 | 243 | |
5 | 2006 | 75 | 283 | 237 | |
6 | 2007 | 80 | 214 | 190 | |
7 | 2008 | 85 | 197 | 181 | |
8 | 2009 | 74 | 231 | 200 | |
9 | 2010 | 74 | 207 | 184 | |
A3 | 0 | 1960 | 76 | 191 | 93 |
1 | 1961 | 73 | 249 | 109 | |
2 | 1962 | 81 | 112 | 73 | |
3 | 1963 | 78 | 113 | 72 | |
4 | 1964 | 81 | 159 | 89 | |
5 | 1965 | 87 | 222 | 109 | |
6 | 1966 | 72 | 103 | 68 | |
7 | 1967 | 77 | 176 | 92 | |
8 | 1968 | 89 | 236 | 114 | |
9 | 1969 | 88 | 283 | 128 | |
10 | 1970 | 89 | 151 | 89 | |
11 | 1971 | 71 | 121 | 72 | |
12 | 1972 | 88 | 267 | 124 | |
13 | 1973 | 85 | 211 | 102 | |
14 | 1974 | 75 | 101 | 67 | |
15 | 1975 | 72 | 173 | 85 | |
16 | 1976 | 74 | 254 | 117 | |
17 | 1977 | 85 | 155 | 90 | |
18 | 1978 | 89 | 158 | 92 | |
19 | 1979 | 86 | 117 | 75 | |
... | ... | ... | ... | ... | ... |
B1 | 21 | 1981 | 77 | 115 | 77 |
22 | 1982 | 71 | 267 | 106 | |
23 | 1983 | 89 | 164 | 99 | |
24 | 1984 | 75 | 175 | 90 | |
25 | 1985 | 70 | 210 | 95 | |
26 | 1986 | 72 | 214 | 99 | |
27 | 1987 | 79 | 123 | 81 | |
28 | 1988 | 74 | 195 | 96 | |
29 | 1989 | 84 | 129 | 90 | |
30 | 1990 | 80 | 276 | 111 | |
31 | 1991 | 73 | 147 | 82 | |
32 | 1992 | 73 | 140 | 80 | |
33 | 1993 | 78 | 285 | 115 | |
34 | 1994 | 71 | 280 | 107 | |
35 | 1995 | 80 | 180 | 95 | |
36 | 1996 | 90 | 280 | 119 | |
37 | 1997 | 86 | 166 | 100 | |
38 | 1998 | 82 | 285 | 117 | |
39 | 1999 | 75 | 215 | 101 | |
40 | 2000 | 86 | 212 | 101 | |
41 | 2001 | 72 | 138 | 83 | |
42 | 2002 | 84 | 130 | 85 | |
43 | 2003 | 74 | 166 | 87 | |
44 | 2004 | 87 | 280 | 121 | |
45 | 2005 | 85 | 130 | 88 | |
46 | 2006 | 83 | 102 | 77 | |
47 | 2007 | 85 | 163 | 94 | |
48 | 2008 | 83 | 190 | 99 | |
49 | 2009 | 79 | 151 | 88 | |
50 | 2010 | 88 | 295 | 122 |
214 rows × 4 columns
The outer index gives the area name ("location") the rows came from. These allow us to select rows on the basis of area:
df.loc[('A1', 1)]
year 2002 temperature 85 rainfall 252 mosquitos 217 Name: (A1, 1), dtype: int64
More powerfully, this allows us to get aggregates over all areas:
df[df['year'] > 2005]['temperature'].mean()
81.040000000000006
Or get descriptive statistics directly as a function of location:
df.mean(level='location')
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
location | ||||
A1 | 2005.5 | 80.000000 | 214.600000 | 191.300000 |
A3 | 1985.0 | 80.333333 | 193.274510 | 97.647059 |
B2 | 1985.0 | 79.764706 | 196.823529 | 196.058824 |
A2 | 1985.0 | 80.392157 | 207.039216 | 185.235294 |
B1 | 1985.0 | 79.627451 | 204.588235 | 99.509804 |
This is best read backwards. First, let's filter our rows so we only get those for the temperature range we want:
df[(df['temperature'] > 75) & (df['temperature'] < 90)]
year | temperature | rainfall | mosquitos | ||
---|---|---|---|---|---|
location | |||||
A1 | 0 | 2001 | 80 | 157 | 150 |
1 | 2002 | 85 | 252 | 217 | |
2 | 2003 | 86 | 154 | 153 | |
3 | 2004 | 87 | 159 | 158 | |
6 | 2007 | 80 | 214 | 190 | |
7 | 2008 | 85 | 197 | 181 | |
A3 | 0 | 1960 | 76 | 191 | 93 |
2 | 1962 | 81 | 112 | 73 | |
3 | 1963 | 78 | 113 | 72 | |
4 | 1964 | 81 | 159 | 89 | |
5 | 1965 | 87 | 222 | 109 | |
7 | 1967 | 77 | 176 | 92 | |
8 | 1968 | 89 | 236 | 114 | |
9 | 1969 | 88 | 283 | 128 | |
10 | 1970 | 89 | 151 | 89 | |
12 | 1972 | 88 | 267 | 124 | |
13 | 1973 | 85 | 211 | 102 | |
17 | 1977 | 85 | 155 | 90 | |
18 | 1978 | 89 | 158 | 92 | |
19 | 1979 | 86 | 117 | 75 | |
21 | 1981 | 80 | 153 | 87 | |
23 | 1983 | 80 | 145 | 82 | |
27 | 1987 | 83 | 261 | 117 | |
28 | 1988 | 89 | 280 | 127 | |
29 | 1989 | 81 | 271 | 121 | |
30 | 1990 | 77 | 188 | 94 | |
31 | 1991 | 80 | 218 | 105 | |
33 | 1993 | 83 | 211 | 105 | |
34 | 1994 | 77 | 224 | 108 | |
36 | 1996 | 86 | 247 | 115 | |
... | ... | ... | ... | ... | ... |
B1 | 5 | 1965 | 76 | 272 | 113 |
6 | 1966 | 77 | 140 | 87 | |
7 | 1967 | 87 | 207 | 106 | |
8 | 1968 | 86 | 266 | 115 | |
9 | 1969 | 79 | 159 | 92 | |
11 | 1971 | 79 | 252 | 108 | |
13 | 1973 | 82 | 281 | 116 | |
14 | 1974 | 80 | 295 | 120 | |
15 | 1975 | 86 | 166 | 99 | |
16 | 1976 | 87 | 298 | 123 | |
17 | 1977 | 79 | 184 | 95 | |
19 | 1979 | 86 | 157 | 95 | |
21 | 1981 | 77 | 115 | 77 | |
23 | 1983 | 89 | 164 | 99 | |
27 | 1987 | 79 | 123 | 81 | |
29 | 1989 | 84 | 129 | 90 | |
30 | 1990 | 80 | 276 | 111 | |
33 | 1993 | 78 | 285 | 115 | |
35 | 1995 | 80 | 180 | 95 | |
37 | 1997 | 86 | 166 | 100 | |
38 | 1998 | 82 | 285 | 117 | |
40 | 2000 | 86 | 212 | 101 | |
42 | 2002 | 84 | 130 | 85 | |
44 | 2004 | 87 | 280 | 121 | |
45 | 2005 | 85 | 130 | 88 | |
46 | 2006 | 83 | 102 | 77 | |
47 | 2007 | 85 | 163 | 94 | |
48 | 2008 | 83 | 190 | 99 | |
49 | 2009 | 79 | 151 | 88 | |
50 | 2010 | 88 | 295 | 122 |
151 rows × 4 columns
Then, we can grab the minimum rainfall for these:
df[(df['temperature'] > 75) & (df['temperature'] < 90)]['rainfall'].min()
100
But we wanted minimum rainfall for each location; we can tell Series.min
to split the data by location:
df[(df['temperature'] > 75) & (df['temperature'] < 90)]['rainfall'].min(level='location')
location A1 154 A3 112 B2 100 A2 111 B1 102 Name: rainfall, dtype: int64
We don't have time during this lesson to get into groupby
s, but they are such a central part of pandas
that we'd be remiss if we excluded them. A groupby
allows us to split our rows according to values in one or more column. For example, we could ask for the minimum rainfall measured across all areas for each year:
df.groupby('year')['rainfall'].min()
year 1960 129 1961 135 1962 112 1963 113 1964 147 1965 151 1966 103 1967 106 1968 138 1969 147 1970 151 1971 121 1972 153 1973 100 1974 101 1975 134 1976 116 1977 155 1978 128 1979 117 1980 138 1981 115 1982 202 1983 130 1984 151 1985 152 1986 214 1987 111 1988 195 1989 129 1990 159 1991 147 1992 140 1993 211 1994 126 1995 180 1996 187 1997 123 1998 127 1999 123 2000 212 2001 112 2002 130 2003 132 2004 119 2005 130 2006 102 2007 163 2008 186 2009 151 2010 127 Name: rainfall, dtype: int64
This should be read as "group the data by year, then get the minimum rainfall for each group." If we want to do something more complex for each group of rows, we can use a for
loop:
for name, group in df.groupby('year'):
print(name, group['rainfall'].min())
1960 129 1961 135 1962 112 1963 113 1964 147 1965 151 1966 103 1967 106 1968 138 1969 147 1970 151 1971 121 1972 153 1973 100 1974 101 1975 134 1976 116 1977 155 1978 128 1979 117 1980 138 1981 115 1982 202 1983 130 1984 151 1985 152 1986 214 1987 111 1988 195 1989 129 1990 159 1991 147 1992 140 1993 211 1994 126 1995 180 1996 187 1997 123 1998 127 1999 123 2000 212 2001 112 2002 130 2003 132 2004 119 2005 130 2006 102 2007 163 2008 186 2009 151 2010 127
To learn more about groupby
s, check out the pandas
docs on the "split-apply-combine" approach to working with datasets with groupby
: http://pandas.pydata.org/pandas-docs/stable/groupby.html
We've already used boolean operations to filter rows on DataFrames
, but more broadly we must be able to write code that makes decisions based on the inputs it is given.
We can use conditionals. To demonstrate how these work, we'll use absurd toy examples. Say we have some number:
num = 37
We can ask whether that number is greater than 100 with a conditional like this:
if num > 100:
print('greater')
else:
print('not greater')
not greater
We could add some more sophisticatioin to the conditional by checking for equality:
if num > 100:
print('greater')
elif num == 100:
print('equal!')
else:
print('not greater')
not greater
...and we could test it:
num = 100
if num > 100:
print('greater')
elif num == 100:
print('equal!')
else:
print('not greater')
equal!
Conditionals can include boolean operators such as and
:
if (1 > 0) and (-1 > 0):
print('both parts are true')
else:
print('one part is not true')
one part is not true
A "truth table" for and
, given two boolean values s1
and s2
, each either True
or False
gives:
s1 | s2 | s1 and s2 |
---|---|---|
True | True | True |
True | False | False |
False | True | False |
False | False | False |
There's also or
:
if (1 > 0) or (-1 > 0):
print('at least one part is true')
else:
print('no part is true')
at least one part is true
s1 | s2 | s1 or s2 |
---|---|---|
True | True | True |
True | False | True |
False | True | True |
False | False | False |
And not
:
if (1 > 0) and not (-1 > 0):
print('both parts are true')
else:
print('one part is not true')
both parts are true
s1 | not s1 |
---|---|
True | False |
False | True |
There are perhaps other ways to do this, but if we want to specifically handle the case of lists of temperatures, we can use the builtin isinstance
to check that we are dealing with a list:
def fahrenheit_to_celsius(temp):
"""Convert temperature in fahrenheit to celsius.
Parameters
----------
temp : float or array_like
Temperature(s) in fahrenheit.
Returns
-------
float or array_like
Temperatures in celsius.
bubbles
"""
if isinstance(temp, list):
newtemps = []
for value in temp:
newtemps.append(fahrenheit_to_celsius(value))
else:
newtemps = (temp - 32) * 5/9
return newtemps
This function takes advantage of recursion for the case of a list, calling itself on each value in the list and converting those values accordingly.
Encountering errors is part of coding. If you are coding, you will hit errors. The important thing to remember is that errors that are loud are the best kind, because they usually give hints as to what the problem is. In Python, errors are known as exceptions, and there are a few common varieties. Familiarity with these varieties helps to more quickly identify the root of the problem, which means we can more quickly fix it.
We want to run a function inside a module called errors_01.py
; it's inside our scripts
directory:
%ls scripts/
argv_list.py errors_01.py errors_02.py plot_rand_mp.py*
Let's try to import it:
import errors_01
--------------------------------------------------------------------------- ImportError Traceback (most recent call last) <ipython-input-181-56eb787402eb> in <module>() ----> 1 import errors_01 ImportError: No module named 'errors_01'
We got an ImportError
. This exception tells us that Python cannot find a module we are trying to import. To troubleshoot why, we should first look at sys.path
:
import sys
sys.path
['', '/home/alter/.local/lib/python3.5/site-packages/scandir-1.1-py3.5-linux-x86_64.egg', '/home/alter/Library/datreant/datreant.core/src', '/home/alter/Library/datreant/datreant.data/src', '/home/alter/Library/asciitree', '/usr/lib/python35.zip', '/usr/lib/python3.5', '/usr/lib/python3.5/plat-linux', '/usr/lib/python3.5/lib-dynload', '/home/alter/.local/lib/python3.5/site-packages', '/usr/lib/python3.5/site-packages', '/usr/lib/python3.5/site-packages/gtk-2.0', '/home/alter/.local/lib/python3.5/site-packages/IPython/extensions', '/home/alter/.ipython']
The Python interpreter looks at each location in this list, in order, when we try to import a module by name. Since our scripts
directory is not in this list, errors_01
is not found. We can add the directory to the list, though, to make errors_01
findable:
sys.path.append('scripts/')
sys.path
['', '/home/alter/.local/lib/python3.5/site-packages/scandir-1.1-py3.5-linux-x86_64.egg', '/home/alter/Library/datreant/datreant.core/src', '/home/alter/Library/datreant/datreant.data/src', '/home/alter/Library/asciitree', '/usr/lib/python35.zip', '/usr/lib/python3.5', '/usr/lib/python3.5/plat-linux', '/usr/lib/python3.5/lib-dynload', '/home/alter/.local/lib/python3.5/site-packages', '/usr/lib/python3.5/site-packages', '/usr/lib/python3.5/site-packages/gtk-2.0', '/home/alter/.local/lib/python3.5/site-packages/IPython/extensions', '/home/alter/.ipython', 'scripts/']
Note that this change will only be present in this session. It is not permanent. This is a perfectly fine method, though, for using a module you've built yourself. Let's import errors_01
now:
import errors_01
File "scripts/errors_01.py", line 7 print ice_creams[3] ^ SyntaxError: Missing parentheses in call to 'print'
We get a different exception now: a SyntaxError
. This the Python interpreter's way of telling us that we are using poor grammar, and it doesn't understand what we are telling it. In this case, it tells us exactly what's wrong:
SyntaxError: Missing parentheses in call to 'print'
If we make the suggested fix in, say, errors_02.py
(we copied the file here for demo purposes, but normally we would fix it and version control it with git):
%cat scripts/errors_02.py
def favorite_ice_cream(): ice_creams = [ "chocolate", "vanilla", "strawberry" ] print(ice_creams[3])
import errors_02
This one imports just fine. Let's run errors_02.favorite_ice_cream
:
errors_02.favorite_ice_cream()
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-188-8c34ad3792fd> in <module>() ----> 1 errors_02.favorite_ice_cream() /home/alter/Documents/SWC/Events/2016_UCSF/swc_ucsf_intermediate_data/python/scripts/errors_02.py in favorite_ice_cream() 5 "strawberry" 6 ] ----> 7 print(ice_creams[3]) IndexError: list index out of range
This time we get an IndexError
. Let's dissect this. The first line shows where the error originates in our code. In this case it's the only line in the cell. The next set of lines shows where the error originates at the source: that is, inside the definition of errors_02.favorite_ice_cream
in line 7 of the module errors_02
. The last line gives us a hint as to the nature of the IndexError
: we appear to be using an index for an element of a list that doesn't exist.
There are other types of exceptions, but these are some of the most common. You need not know all of them, but having some knowledge of the basic varieties is valuable for understanding what's going wroing, and then how to fix it!
In addition to being useful tools for debugging, exceptions are catchable. That is, you can write code in such a way that it is able to recover from failure or exceptional cases. As an example, let's rewrite our Fahrenheit to Celsius converter one last time:
def fahrenheit_to_celsius(temp):
"""Convert temperature in fahrenheit to celsius.
Parameters
----------
temp : float or array_like
Temperature(s) in fahrenheit.
Returns
-------
float or array_like
Temperatures in celsius.
bubbles
"""
try:
newtemps = (temp - 32) * 5/9
except TypeError:
newtemps = []
for value in temp:
newtemps.append(fahrenheit_to_celsius(value))
return newtemps
This function can take lists just like the one with the if-else
and isinstance
check, but it turns the problem on its head. It uses a try-except
block instead, first trying to use temp
as a single number or array, and when that fails with a TypeError
(as it would if temp
is a list), it then handles temp
as if it were a list.
This is known as "ask forgiveness" style: we do the most common thing, and if that fails, we then do something else. This is in contrast to "ask permission" style, as exhibited in the if-else
implementation: we first checked if we had a funny case, and if not, did the common thing.
Which you choose is often a matter of style, though there are performance benefits to "ask forgiveness" when the odd cases truly are rare; when the other cases a function must deal with are almost equally common, it is better to use if-else
statements to direct flow.
Say we write a function that does something a bit obtuse (heh): Given the coordinates of the lower-left and upper-right corners of a rectangle, it normalizes the rectangle so that its lower-left coordinate is at the origin (0, 0), and its longest side is 1.0 units long:
def normalize_rectangle(rect):
"""Normalizes a rectangle so that it is at the origin,
and 1.0 units long on its longest axis.
:Arguments:
*rect*
a tuple of size four, giving (x0, y0, x1, y1), the (x,y)
positions of the lower-left and upper-right corner
:Returns:
*newrect*
a tuple of size four, giving the new coordinates of our rectangle's
vertices
"""
x0, y0, x1, y1 = rect
assert x0 < x1, 'Invalid x coordinates'
assert y0 < y1, 'Invalid y coordinates'
dx = x1 - x0
dy = y1 - y0
if dx > dy:
scaled = float(dx) / dy
upper_x, upper_y = 1.0, scaled
else:
scaled = float(dx) / dy
upper_x, upper_y = scaled, 1.0
assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid'
assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid'
return (0, 0, upper_x, upper_y)
Notice that we added a nice documentation string at the top that explicitly states the functions inputs (arguments) and outputs (returns), so when we use it we don't need to mind its details anymore. Also, at the beginning of the function we have assert
statements that check the validity of the inputs, while at the end of the function we have assert
statements that check the validity of the outputs.
These statements codify what our documentation says about what the inputs should mean and what the output should look like, guarding against cases where the function is given inputs it wasn't designed to handle.
Let's test the function!
normalize_rectangle((3, 4, 4, 6))
(0, 0, 0.5, 1.0)
Looks good. First point is at the origin; longest side is 1.0.
normalize_rectangle((0.0, 1.0, 2.0, 5.0))
(0, 0, 0.5, 1.0)
Nice!
normalize_rectangle((0.0, 0.0, 5.0, 1.0))
--------------------------------------------------------------------------- AssertionError Traceback (most recent call last) <ipython-input-193-a41732464e3f> in <module>() ----> 1 normalize_rectangle((0.0, 0.0, 5.0, 1.0)) <ipython-input-190-a7edcf11432e> in normalize_rectangle(rect) 30 31 assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid' ---> 32 assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid' 33 34 return (0, 0, upper_x, upper_y) AssertionError: Calculated upper Y coordinate invalid
Oh...what happened here? The inputs should be fine, but an assertion caught something strange in the output. Looking closely at the code, we spot a mistake in line 26:
def normalize_rectangle(rect):
"""Normalizes a rectangle so that it is at the origin,
and 1.0 units long on its longest axis.
:Arguments:
*rect*
a tuple of size four, giving (x0, y0, x1, y1), the (x,y)
positions of the lower-left and upper-right corner
:Returns:
*newrect*
a tuple of size four, giving the new coordinates of our rectangle's
vertices
"""
x0, y0, x1, y1 = rect
assert x0 < x1, 'Invalid x coordinates'
assert y0 < y1, 'Invalid y coordinates'
dx = x1 - x0
dy = y1 - y0
if dx > dy:
scaled = float(dy) / dx
upper_x, upper_y = 1.0, scaled
else:
scaled = float(dx) / dy
upper_x, upper_y = scaled, 1.0
assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid'
assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid'
return (0, 0, upper_x, upper_y)
And now we get:
normalize_rectangle((0.0, 0.0, 5.0, 1.0))
(0, 0, 1.0, 0.2)
normalize_rectangle((2, 3, 4, 6))
(0, 0, 0.6666666666666666, 1.0)
That looks better.
What if we were to try feeding in only three values instead of four?
normalize_rectangle((2, 4, 5))
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-197-c5d8e9954719> in <module>() ----> 1 normalize_rectangle((2, 4, 5)) <ipython-input-194-d4ce9e9a4cfa> in normalize_rectangle(rect) 14 15 """ ---> 16 x0, y0, x1, y1 = rect 17 18 assert x0 < x1, 'Invalid x coordinates' ValueError: not enough values to unpack (expected 4, got 3)
That's a rather unhelpful error message. We can also add an assertion that checks that the input is long enough, and gives a message indicating what the problem is:
def normalize_rectangle(rect):
"""Normalizes a rectangle so that it is at the origin,
and 1.0 units long on its longest axis.
:Arguments:
*rect*
a tuple of size four, giving (x0, y0, x1, y1), the (x,y)
positions of the lower-left and upper-right corner
:Returns:
*newrect*
a tuple of size four, giving the new coordinates of our rectangle's
vertices
"""
assert len(rect) == 4, "Rectangle must have 4 elements"
x0, y0, x1, y1 = rect
assert x0 < x1, 'Invalid x coordinates'
assert y0 < y1, 'Invalid y coordinates'
dx = x1 - x0
dy = y1 - y0
if dx > dy:
scaled = float(dy) / dx
upper_x, upper_y = 1.0, scaled
else:
scaled = float(dx) / dy
upper_x, upper_y = scaled, 1.0
assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid'
assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid'
return (0, 0, upper_x, upper_y)
normalize_rectangle((2, 4, 5))
--------------------------------------------------------------------------- AssertionError Traceback (most recent call last) <ipython-input-199-c5d8e9954719> in <module>() ----> 1 normalize_rectangle((2, 4, 5)) <ipython-input-198-e9732fa1861f> in normalize_rectangle(rect) 14 15 """ ---> 16 assert len(rect) == 4, "Rectangle must have 4 elements" 17 18 x0, y0, x1, y1 = rect AssertionError: Rectangle must have 4 elements
Now when we use this function six months later, we won't have to scratch our heads too long when this sort of thing happens! Defensive programming in this way makes code more robust, easier to debug, and easier to maintain. This is especially important for scientific code, since incorrect results can occur silently and end up getting published, perhaps only getting uncovered years later. Much embarrassment and misinformation can be avoided by being skeptical of our code, using tools such as assert
statements to make sure it keeps doing what we think it should.
Running Python code from a notebook is great, but we often need to run Python code from the command line, perhaps on files as input data. We can turn our module datastuff.py
into a simple script by adding some meat at the bottom:
if __name__ == '__main__':
import pandas as pd
import sys
for dataset in sys.argv[1:]:
print(dataset)
data = pd.read_csv(dataset)
data['temperature'] = fahrenheit_to_celsius(
data['temperature'])
fig = analyze_mosquitos(data)
fig.savefig(dataset + "_plot.pdf")
The if __name__ == '__main__'
statement at the top says that this code should only be executed if the module is being run as a script. It's a common check that the current namespace is __main__
, as it would only be if the code were being executed as a script.
Adding this allows us to run the script on any number of input files as:
python datastuff.py A1_mosquito_data.csv A2_mosquito_data.csv
or even:
python datastuff.py *.csv
It will loop through each filename given as arguments, convert the temperature column to Celsius, then generate and save a PDF of the analysis figures for that dataset.
sys.argv
is a list having the name of the script as its zeroth element, then each argument to the script in order from the first to the last. We do sys.argv[1:]
to slice out only those arguments so we can iterate through them. This is a quick-and-dirty way to parse arguments, but if you're interested in writing more sophisticated command-line utilities, definitely use argparse
.