#!/usr/bin/env python # coding: utf-8 # # Intermediate Scientific Python # # This lesson assumes some programming background, but not necessarily with Python. It moves at a faster pace than a novice lesson, however. # # We will learn how to do the following: # # 1. Reading and manipulating tabular data # 2. Visualization and plotting # 3. Modularization and documentation # 4. Defensive programming # 5. Writing command line scripts with Python # # This lesson is based heavily on the [Software Carpentry](http://software-carpentry.org/) [``python-intermediate-mosquitoes``](https://github.com/swcarpentry/python-intermediate-mosquitoes) lesson, and uses its datasets. It also includes some mixed-in components from the [``python-novice-inflammation``](https://github.com/swcarpentry/python-novice-inflammation) lesson. # ## Getting started # We are interested in understanding the relationship between the weather and the number of mosquitos occuring in a particular year so that we can plan mosquito control measures accordingly. Since we want to apply these mosquito control measures at a number of different sites we need to understand both the relationship at a particular site and whether or not it is consistent across sites. The data we have to address this problem comes from the local government and are stored in tables in comma-separated values (CSV) files. Each file holds the data for a single location, each row holds the information for a single year at that location, and the columns hold the data on both mosquito numbers and the average temperature and rainfall from the beginning of mosquito breeding season. The first few rows of our first file look like: # In[1]: get_ipython().run_line_magic('cat', 'A1_mosquito_data.csv | head -n 5') # And we have five files to work with: # In[2]: get_ipython().run_line_magic('ls', '*.csv') # **Note**: commands preceded with a `%` are known as "magics". These are specfic to the notebook environment. They are *not* Python commands. # Since this is tabular data, our tool of choice is [``pandas``](http://pandas.pydata.org/), a library that provides special data structures for doing fast numerical operations on tabular data. Internally, ``pandas`` uses [``numpy``](http://www.numpy.org/) to do the heavy lifting. # In[3]: import pandas as pd # Importing a library in this way allows us to use the components defined inside of it. First, we'll read in a single dataset using the ``pandas.read_csv`` function: # In[4]: pd.read_csv('A1_mosquito_data.csv') # This reads the CSV from disk and deserializes it into a [``pandas.DataFrame``](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe). But unless we attach a name to the `DataFrame` the function returns, we cannot keep working with the object in memory. # In[5]: data = pd.read_csv('A1_mosquito_data.csv') # In[6]: data # Now we can refer to the `DataFrame` directly, without having to read it from disk every time. # A bit about names. If I attach the name ``weight_kg`` to the value ``55``: # In[7]: weight_kg = 55 # I can then use this value to calculate the same weight in ``weight_lb``: # In[8]: weight_lb = 2.2 * weight_kg # In[9]: weight_lb # If I change my `weight_kg` to 66, what is the current value of `weight_lb`? # In[10]: weight_kg = 66 # In[11]: weight_lb # There's no change. This is because the name `weight_lb` is attached to a value, in this case a floating point number. The value has no concept of how it came to be. # In[12]: weight_lb = 2.2 * weight_kg # In[13]: weight_lb # Back to our `DataFrame`. A `DataFrame` is an **object**. We can see what type of object it is with the Python builtin, ``type``: # In[14]: type(data) # It turns out that in Python, **everything** is an object. We'll see what this means as we go, but the most important aspect of this is that in Python we have **names**, and we assign these to **objects**. Any name can point to any object, and more than one name can point to a single object. # Anyway, a ``DataFrame`` allows us to get at individual components of our tabular data. We can get single columns like: # In[15]: data['year'] # Or multiple columns with: # In[16]: data[['rainfall', 'temperature']] # Slicing can be used to get back subsets of rows: # In[17]: data[0:2] # Python indices are 0-based, meaning counting goes as 0, 1, 2, 3...; this means that the first row is row 0, the second row is row 1, etc. It's best to refer to row 0 as the "zeroth row" to avoid confusion. # # This slice should be read as "get the 0th element up to and not including the 2nd element". The "not including" is important, and the cause of much initial frustration. It does take some getting used to. # What if we want a single row? # In[18]: data[1] # For a DataFrame, this is ambiguous, since a single value is interpreted as a column name. We can only get at rows by slicing at the top level: # In[19]: data[1:2] # Or we could use `.iloc`: # In[20]: data.iloc[1] # Getting a single row in this way returns a `Series`: # In[21]: type(data.iloc[1]) # A `Series` is a 1-D column of values, having all the same datatype. Since each of the datatypes of our columns were integers, we got a `Series` with dtype `int64` this time. If we had columns with, e.g. strings, then we'd get back dtype `object`, which is a catchall for ``pandas``. # We can also get the data in our ``Series`` as a raw ``numpy`` array: # In[22]: type(data.iloc[1].values) # Pandas is a relatively young library, but it's built on top of the venerable ``numpy`` array, which makes it possible to do fast numerical work in Python. A `Series` is basically a 1-D ``numpy`` array with the ability to select by labeled indices: # ## Subsetting data # More usefully than simple slicing, we can use boolean indexing to subselect our data. Say we want only data for years beyond 2005? # In[23]: data[data['year'] > 2005] # There's no magic here; we get a boolean index directly from a comparison: # In[24]: gt_2005 = data['year'] > 2005 gt_2005 # And using this `Series` of bools will then give only the rows for which the `Series` had `True`: # In[25]: data[gt_2005] # This is the same behavior as ``numpy`` arrays: using most binary operators, such as ``+``, ``*``, ``>``, ``&``, work element-wise. With a single value on one side (such as ``2005``), we get the result of the operation for each element. # A ``DataFrame`` is an *object*, and objects have **methods**. These are functions that are *part of* the object itself, often doing operations on the object's data. One of these is ``DataFrame.mean``: # In[26]: data.mean() # We get back the mean value of each column as a single ``Series``. There's more like this: # In[27]: data.max() # There's also ``DataFrame.describe``, which gives common descriptive statistics of the whole `DataFrame`: # In[28]: data.describe() # This is, itself, a ``DataFrame``: # In[29]: data.describe()['temperature'] # Documentation is a key part of Python's design. In the notebook, you can get a quick look at the docs for a given Python function or method with: # In[30]: get_ipython().run_line_magic('pinfo', 'data.describe') # Or more generally (built-in Python behavior): # In[31]: help(data.describe) # -------------- # ### Challenge: obtain the standard deviation of mosquito count for years in which the rainfall was greater than 200: # One way we could do this is to first grab the ``"mosquitos"`` column, then use a fancy index obtained from comparing the ``"rainfall"`` column to ``200``. We can then call the ``std`` method of the resulting ``Series``: # In[32]: data['mosquitos'][data['rainfall'] > 200].std() # Note that this is a key part of the power of ``pandas`` objects: operations for subsetting and calculating descriptive statistics can often be stacked to great effect. # # ----------------------- # What if we know our temperatures are in fahrenheit, but we want them in celsius? We can convert them. # In[33]: (data['temperature'] - 32) * 5 / 9 # This gives us back a new ``Series`` object. If we want to change the values in our existing ``DataFrame`` to be these, we can just set the column to this new ``Series``: # In[34]: data['temperature'] = (data['temperature'] - 32) * 5 / 9 # In[35]: data['temperature'] # Similarly, it's also possible to add new columns. We could add one giving us, e.g. the ratio of rainfall to mosquitos: # In[36]: data['rainfall / mosquitos'] = data['rainfall'] / data['mosquitos'] # In[37]: data # It's probably a bit silly to have such a column. So, let's get rid of it: # In[38]: del data['rainfall / mosquitos'] # In[39]: data # Don't like the order of your columns? We can make a ``DataFrame`` with a different column order by selecting them out in the order we want: # In[40]: data[['rainfall', 'year', 'mosquitos', 'temperature']] # We can even have duplicates: # In[41]: data[['rainfall', 'year', 'mosquitos', 'temperature', 'year']] # Remember: this returns a copy. It does not change our existing ``DataFrame``: # In[42]: data # ## Plotting data from a DataFrame # Just as ``pandas`` is the de-facto library for working with tabular data, ``matplotlib`` is the de-facto library for producing high-quality plots from numerical data. A saying often goes that matplotlib makes easy things easy and hard things possible when it comes to making plots. # In[43]: import matplotlib.pyplot as plt # We need to tell the notebook to render plots in the notebook itself: # In[44]: get_ipython().run_line_magic('matplotlib', 'inline') # Before we use ``matplotlib`` explicitly, ``pandas`` objects feature convenience methods for common plotting operations. These use ``matplotlib`` internally, so everything we learn later about ``matplotlib`` objects applies to these. For example, we can plot both ``"temperature"`` and ``"mosquitos"`` as a function of ``"year"``: # In[45]: data.plot(x='year', y=['temperature', 'mosquitos']) # Let's load a larger dataset: # In[46]: data = pd.read_csv('A2_mosquito_data.csv') # This dataset has 51 rows, instead of just 10: # In[47]: len(data) # In[48]: data.plot(x='year', y=['temperature', 'mosquitos']) # There are [other convenience methods](http://pandas.pydata.org/pandas-docs/stable/visualization.html) for different ways of plotting the data. For example, we can get a kernel-density estimate: # In[49]: data['mosquitos'].plot.kde() # These convenience methods are great, but for more complex plots we can, and should, use ``matplotlib`` directly. We can make a multi-paneled figure giving both "temperature" and "rainfall" for each "year." # In[50]: fig = plt.figure(figsize=(4, 6)) ax1 = fig.add_subplot(2, 1, 1) ax1.plot(data['year'], data['temperature'], 'ro-') ax1.set_xlabel('Year') ax1.set_ylabel('Temperature') ax2 = fig.add_subplot(2, 1, 2) ax2.plot(data['year'], data['rainfall'], 'bs-') ax2.set_xlabel('Year') ax2.set_ylabel('Rainfall') # ---------- # ### Challenge: plot the relationship between the number of mosquitos and temperature and the number of mosquitos and rainfall. # We can do this in a similar way as we did above. # In[51]: fig = plt.figure(figsize=(4, 6)) ax1 = fig.add_subplot(2, 1, 1) ax1.plot(data['temperature'], data['mosquitos'], 'ro') ax1.set_xlabel('Temperature') ax1.set_ylabel('Mosquitos') ax2 = fig.add_subplot(2, 1, 2) ax2.plot(data['rainfall'], data['mosquitos'], 'bs') ax2.set_xlabel('Rainfall') ax2.set_ylabel('Mosquitos') # Note that the *linestyle* code on the right can be used to turn off interpolation lines, since we want to just plot the points, and don't care about their order. # # ------------- # From this one dataset we see what looks like a linear relationship between mosquitos and rainfall, but not really any relationship between mosquitos and temperature. We'd like to quantify this further by applying a statistical model to the data, but we also want to do this same treatment to **all** our datasets. We need to learn a few more things, including **loops**, before continuing further. # ## Repetition with loops # First, an aside with strings. Most programming languages generally deal in a few basic data types, including integers, floating-point numbers, and characters. A string is a sequence of characters, and we can manipulate them easily with python. For example: # In[52]: word = 'bird' # We have a string ``'lead'`` that we've assigned the name ``word``. We can examine the string's type, and indeed any object in python, with the ``type`` builtin. # In[54]: type(word) # We can also access its characters (it's a sequence!) with indexing: # In[55]: word[0] # In[56]: word[1] # And slicing: # In[57]: word[1:] # If I want to output each letter of ``word`` on a new line, I could do: # In[59]: print(word[0]) print(word[1]) print(word[2]) print(word[3]) # But this doesn't scale to words of arbitrary length. Instead, let's let the computer do the boring work of iterating with a ``for`` loop: # In[44]: for letter in word: print letter # This same bit of code will work for *any* ``word``, including ridiculously long ones. Like this gem: # In[61]: word = 'supercallifragilisticexpialidocious' print(word) # In[62]: for letter in word: print(letter) # We can do more with loops than iterate through the elements of a sequence. A common use-case is building a sum. For example, getting the number of letters in ``word``: # In[63]: counter = 0 for letter in word: counter = counter + 1 print(counter) # For this very particular use, there's a python built-in, which is guaranteed to be more efficient. Use it instead: # In[64]: len(word) # It works on any sequence! For ``pandas.DataFrame``s it gives the length of the number of rows: # In[66]: len(data) # # Lists: ordered collections of other objects (even other lists) # Loops work best when they have something to iterate over. In python, the ``list`` is a built-in data structure that serves as a container for a sequence of other objects. We can make an empty list with: # In[100]: stuff = list() # In[101]: print(stuff) # And we can append things to the end of the list, in this case a bunch of random strings: # In[102]: stuff.append('marklar') # In[103]: stuff.append('chewbacca') # In[104]: stuff.append('chickenfingers') # In[105]: stuff # Just like with the ``word`` example, we can iterate through the elements of the list: # In[106]: for item in stuff: print(item) # And slicing works too! Slicing with negative numbers counts backwards from the end of the list. The slice ``stuff[-2:]`` could be read as "get the 2nd-to-last element of ``stuff``, through the last element": # In[107]: for item in stuff[-2:]: print(item) # So indexing with ``-1`` gives the last element: # In[108]: stuff[-1] # And consistently, but uselessly, slicing from and to the same index yeilds nothing. Remember, for a slice ``n:m``, it reads as "get all elements starting with element ``n`` and up to but not including element ``m``." # In[109]: stuff[0:0] # Lists are mutable: that is, their elements can be changed. For example, we can replace ``'chewbacca'`` with ``'hansolo'``: # In[110]: stuff[1] # In[111]: stuff[1] = 'hansolo' # In[112]: stuff[1] # ...or replace ``'hansolo'`` with a ``DataFrame``: # In[113]: stuff[1] = data # In[114]: stuff[1] # In[115]: print(stuff) # We also add a list as an element to itself: # In[116]: stuff.append(stuff) # So now ``stuff[-1]`` points to the same list the name ``stuff`` does: # In[117]: stuff[-1] is stuff # It's important to recognize that these are exactly the same list. Modifications to ``stuff[-1]`` will be seen in ``stuff``. These are just names, and in this case they refer to the same object in memory. # # Lists can contain lists can contain lists can contain lists...and of course any number of other python objects. Even functions can be included as elements: # In[118]: stuff.append(len) # In[119]: stuff # Lists also have their own methods, which usually alter the list itself: # In[120]: stuff.reverse() # In[121]: stuff # ``list.pop`` removes the element at the given index: # In[122]: stuff.pop(-2) # ...while ``list.remove`` removes the first instance of the given element in the list: # In[123]: stuff.remove('chickenfingers') # So now our list looks like: # In[124]: stuff # ## Plot all the data # Now that we know how to do loops and use lists in Python, we'll use these to make a plot for each dataset. First, we'll import a module called ``glob``: # In[125]: import glob # We can use this to grab all the filenames matching a "globbing" pattern: # In[126]: glob.glob("*.csv") # "Globbing" is the name of the shell-completion patterns common for shells like ``sh`` and ``bash``. # ### Challenge: write a for-loop that produces a plot of mosquito population vs. temperature and mosquito population vs. rainfall for each dataset we have. # We can grab the block of code we wrote previously, and loop through it for each file obtained by globbing: # In[127]: import glob files = glob.glob('*.csv') for file in files: print(file) data = pd.read_csv(file) fig = plt.figure(figsize=(4, 6)) ax1 = fig.add_subplot(2, 1, 1) ax1.plot(data['temperature'], data['mosquitos'], 'ro') ax1.set_xlabel('Temperature') ax1.set_ylabel('Mosquitos') ax2 = fig.add_subplot(2, 1, 2) ax2.plot(data['rainfall'], data['mosquitos'], 'bs') ax2.set_xlabel('Rainfall') ax2.set_ylabel('Mosquitos') # tell matplotlib to render the plot *now* plt.show() # It looks like there's a similar pattern for each area we have data for. We'd like to quantify more closely the relationship between mosquito population and each of these variables. # ## Applying a statistical model to the dataset # There are many statistics packages in Python, but one of the most well-developed and widely-used is [``statsmodels``](http://statsmodels.sourceforge.net/): # In[128]: import statsmodels.api as sm # We'll use our second dataset for the initial stab at fitting a statistical model to our data: # In[129]: data = pd.read_csv('A2_mosquito_data.csv') # And let's convert the temperatures to Celsius, for good measure: # In[130]: data['temperature'] = (data['temperature'] - 32) * 5 / 9 # We can build an ordinary least-squares fit to the data using ``statsmodels.OLS``. Remember we can check the docs with: # In[131]: get_ipython().run_line_magic('pinfo', 'sm.OLS') # We want to fit the relationship between mosquito population and, for a start, rainfall: # In[132]: regr_results = sm.OLS(data['mosquitos'], data['rainfall']).fit() # In[133]: regr_results.summary() # So, this gives us a lot of information. But is this what we want? # # We actually want to quantify which variable correlates more with the mosquito population. One way we can build a multivariate model with statsmodels is to use a **formula**: # In[134]: regr_results = sm.OLS.from_formula('mosquitos ~ temperature + rainfall', data).fit() regr_results.summary() # Each of these results can be accessed as **attributes** of the ``Results`` object: # In[135]: print(regr_results.params) # In[136]: print(regr_results.rsquared) # Our model "explains" most of the data. Nice! Let's plot the predicted population of mosquitos from our fitted model against the measured population to see how well we did: # In[137]: import matplotlib.pyplot as plt figure = plt.figure() ax = figure.add_subplot(1,1,1) parameters = regr_results.params predicted = parameters['Intercept'] + parameters['temperature'] * data['temperature'] + parameters['rainfall'] * data['rainfall'] ax.plot(predicted, data['mosquitos'], 'ro') ax.set_xlabel('predicted mosquito population') ax.set_ylabel('measured mosquito population') # The more linear this plot is, the better our model fits the data. Also, it turns out that, as you might have guessed from the plots we made previously, rainfall is the better predictor than average temperature for how many mosquitoes we expect to see. # In[138]: regr_results.tvalues # ---------------- # ### Challenge: for each dataset, print the t-value for temperature and rainfall, and plot the statsmodel results along with plots for the relationships between mosquitos and the two variables separately. # We can take our block of code we worked on previously, and make a three-panel plot instead of just a two-panel one: # In[139]: filenames = glob.glob('*.csv') for filename in filenames: print(filename) data = pd.read_csv(filename) # convert temperatures to celsius from fahrenheit data['temperature'] = (data['temperature'] - 32) * 5/9 # perform fit regr_results = sm.OLS.from_formula('mosquitos ~ temperature + rainfall', data).fit() print(regr_results.tvalues) fig = plt.figure(figsize=(6, 9)) # plot predicted vs. measured mosquito populations from fitted model parameters = regr_results.params predicted = parameters['Intercept'] + parameters['temperature'] * data['temperature'] + parameters['rainfall'] * data['rainfall'] ax0 = fig.add_subplot(3, 1, 1) ax0.plot(predicted, data['mosquitos'], 'gd') ax0.set_xlabel('predicted mosquito population') ax0.set_ylabel('measured mosquito population') # plot population vs. temperature ax1 = fig.add_subplot(3, 1, 2) ax1.plot(data['temperature'], data['mosquitos'], 'ro') ax1.set_xlabel('Temperature') ax1.set_ylabel('Mosquitos') # plot population vs. rainfall ax2 = fig.add_subplot(3, 1, 3) ax2.plot(data['rainfall'], data['mosquitos'], 'bs') ax2.set_xlabel('Rainfall') ax2.set_ylabel('Mosquitos') plt.show() # For all areas, it looks as if rainfall is indeed the stronger predictor of mosquito population than temperature. But what if this is just the beginning of the number of datasets we have to analyze? Are we going to just copy this block of code around each time we want to analyze new data? # # No. That would be awful. # ## Writing functions: we need to break things up # We have code blocks we've been copying and pasting around a lot, and this is fine for a while. But once we've built some useful things, we want to re-use them, and we don't want to maintain many versions of the same thing lying around. # In[140]: def square(x): x_squared = x ** 2 return x_squared # In[141]: square(3) # In[142]: square(4) # Functions are also **composable**: # In[143]: square(square(3)) # We can improve on this a bit by adding **documentation**. In Python, we can write a **docstring** for our function directly in its definition: # In[144]: def square(x): """Return the square of the value, `x`. Parameters ---------- x : float Number to obtain the square of. Returns ------- square : float The square of `x`. """ x_squared = x ** 2 return x_squared # And this docstring is now what we see when we use the builtin help system: # In[145]: help(square) # Or alternatively, in the notebook: # In[146]: get_ipython().run_line_magic('pinfo', 'square') # --------------- # ### Challenge: write your fahrenheit to celsius conversion as a function that takes temperatures in fahrenheit and converts them to temperatures in celsius # This is one way we could write it: # In[147]: def fahrenheit_to_celsius(temp): """Convert temperature in fahrenheit to celsius. Parameters ---------- temp : float or array_like Temperature(s) in fahrenheit. Returns ------- float or array_like Temperature(s) in celsius. """ return (temp - 32) * 5 / 9 # Note that this same function will work for ``pandas.Series`` objects, ``numpy`` arrays, and single numbers. Writing functions in this way takes advantage of the fact that these objects are roughly [**duck-typed**](https://en.wikipedia.org/wiki/Duck_typing): they all behave as if they were single numbers, so doing arithmetic operations with them yields the expected result. # # ------ # ---------------- # ### Challenge: write a function that prints the t-value for temperature and rainfall, and plots the statsmodel results along with plots for the relationships between mosquitos and the two variables separately. # Let's do our model building and plotting in a single function, returning a useful panel plot: # In[148]: def analyze(data): """Return panel plot of mosquito population vs. temperature, rainfall. Also prints t-values for temperature and rainfall. Panel plot gives: 1. comparison of modeled values of mosquito population vs. actual values 2. mosquito population vs. average temperature 3. mosquito population vs. total rainfall Parameters ---------- data : DataFrame DataFrame giving columns for average temperature, total rainfall, and mosquito population during mosquito breeding season for each year. Returns ------- Figure :mod:`matplotlib.figure.Figure` object giving panel plot. """ # perform fit regr_results = sm.OLS.from_formula('mosquitos ~ temperature + rainfall', data).fit() print(regr_results.tvalues) fig = plt.figure(figsize=(6, 9)) # plot prediction from fitted model against measured mosquito population parameters = regr_results.params predicted = parameters['Intercept'] + parameters['temperature'] * data['temperature'] + parameters['rainfall'] * data['rainfall'] ax0 = fig.add_subplot(3, 1, 1) ax0.plot(predicted, data['mosquitos'], 'go') ax0.set_xlabel('predicted mosquito population') ax0.set_ylabel('measured mosquito population') # plot population vs. temperature ax1 = fig.add_subplot(3, 1, 2) ax1.plot(data['temperature'], data['mosquitos'], 'ro') ax1.set_xlabel('Temperature') ax1.set_ylabel('Mosquitos') # plot population vs. rainfall ax2 = fig.add_subplot(3, 1, 3) ax2.plot(data['rainfall'], data['mosquitos'], 'bs') ax2.set_xlabel('Rainfall') ax2.set_ylabel('Mosquitos') return fig # Note that since this is Python, and everything is an object, we can return our ``Figure`` object. This is generally a useful thing to do for plotting functions, since in different contexts you may want to modify the resulting ``Figure`` before rendering any plots of it. # # -------------- # We can now re-write our big loop from before, and it should look **a lot** simpler. # In[149]: filenames = glob.glob('*.csv') for filename in filenames: print(filename) data = pd.read_csv(filename) # convert temperatures to celsius data['temperature'] = fahrenheit_to_celsius(data['temperature']) # get t-values; return plots analyze(data) plt.show() # Our for loop went from being long and complicated to simple and easy-to-understand. This is part of the power of functions. Because the human brain can only keep about $7 \pm 2$ concepts in working memory at any given time, functions allow us to abstract and build more complicated code without making it more difficult to understand. Instead of having to worry about how filter_data does its thing, we need only know its inputs and its outputs to wield it effectively. It can be re-used without copying and pasting, and we can maintain it and improve it easily since it can be defined in a single place. # # Since we may want to use these functions later in other notebooks/code, we can put their definitions in a file and import it. Opening a text editor and copying the function definitions (and the imports they need) to a file datastuff.py, that looks like this: # In[150]: get_ipython().run_line_magic('cat', 'datastuff.py') # Congratulations: you just made your first Python module! We can import it directly: # In[151]: import datastuff # And we can see our documentation for these functions and use them as you'd expect: # In[152]: get_ipython().run_line_magic('pinfo', 'datastuff.fahrenheit_to_celsius') # In[153]: datastuff.analyze(data) # **Note**: if you make modifications to your module, the will not show up in this session, even if you rerun ``import datastuff``. You will either need to restart your Python session, or explicitly tell the Python session to rebuild the module object by re-reading the file: # In[154]: import importlib # In[155]: importlib.reload(datastuff) # The reason this is the case is because it's more efficient for the Python interpreter to build a module object once from a module file, and then with each import encountered use that same object instead of building a whole new one. It's faster, and requires less memory, for this to be the case. # ## An aside: combining DataFrames # Although we didn't need to do it for these examples, ``pandas`` is generally more useful when we can pack as much of our data as is reasonable into a single ``DataFrame``. Since all our datasets have the same form, we can probably gain by combining them. # In[156]: import pandas as pd import glob # We will use the ``pandas.concat`` function to make a single ``DataFrame`` from all our others: # In[157]: filenames = glob.glob('*.csv') datas = [] keys = [] for filename in filenames: keys.append(filename.split('_')[0]) datas.append(pd.read_csv(filename)) df = pd.concat(datas, keys=keys, names=['location']) # We now have a "multi-index" ``DataFrame``: # In[158]: df # The outer index gives the area name ("location") the rows came from. These allow us to select rows on the basis of area: # In[159]: df.loc[('A1', 1)] # More powerfully, this allows us to get aggregates over all areas: # In[160]: df[df['year'] > 2005]['temperature'].mean() # Or get descriptive statistics directly as a function of location: # In[161]: df.mean(level='location') # ------------- # ### Challenge: obtain the minimum rainfall for each location given that the temperature was between 75F and 90F. # This is best read backwards. First, let's filter our rows so we only get those for the temperature range we want: # In[162]: df[(df['temperature'] > 75) & (df['temperature'] < 90)] # Then, we can grab the minimum rainfall for these: # In[163]: df[(df['temperature'] > 75) & (df['temperature'] < 90)]['rainfall'].min() # But we wanted minimum rainfall **for each location**; we can tell ``Series.min`` to split the data by location: # In[164]: df[(df['temperature'] > 75) & (df['temperature'] < 90)]['rainfall'].min(level='location') # -------------- # We don't have time during this lesson to get into ``groupby``s, but they are such a central part of ``pandas`` that we'd be remiss if we excluded them. A ``groupby`` allows us to split our rows according to values in one or more column. For example, we could ask for the minimum rainfall measured across all areas for each year: # In[165]: df.groupby('year')['rainfall'].min() # This should be read as "group the data by year, then get the minimum rainfall for each group." If we want to do something more complex for each group of rows, we can use a ``for`` loop: # In[166]: for name, group in df.groupby('year'): print(name, group['rainfall'].min()) # To learn more about ``groupby``s, check out the ``pandas`` docs on the "split-apply-combine" approach to working with datasets with ``groupby``: http://pandas.pydata.org/pandas-docs/stable/groupby.html # ## Conditionals # We've already used boolean operations to filter rows on ``DataFrames``, but more broadly we must be able to write code that makes decisions based on the inputs it is given. # We can use conditionals. To demonstrate how these work, we'll use absurd toy examples. Say we have some number: # In[167]: num = 37 # We can ask whether that number is greater than 100 with a conditional like this: # In[169]: if num > 100: print('greater') else: print('not greater') # We could add some more sophisticatioin to the conditional by checking for equality: # In[170]: if num > 100: print('greater') elif num == 100: print('equal!') else: print('not greater') # ...and we could test it: # In[171]: num = 100 # In[172]: if num > 100: print('greater') elif num == 100: print('equal!') else: print('not greater') # Conditionals can include boolean operators such as ``and``: # In[173]: if (1 > 0) and (-1 > 0): print('both parts are true') else: print('one part is not true') # A "truth table" for ``and``, given two boolean values ``s1`` and ``s2``, each either ``True`` or ``False`` gives: # | s1 | s2 | s1 and s2 | # | ----- | ----- | --------- | # | True | True | True | # | True | False | False | # | False | True | False | # | False | False | False | # There's also ``or``: # In[174]: if (1 > 0) or (-1 > 0): print('at least one part is true') else: print('no part is true') # | s1 | s2 | s1 or s2 | # | ----- | ----- | --------- | # | True | True | True | # | True | False | True | # | False | True | True | # | False | False | False | # And ``not``: # In[176]: if (1 > 0) and not (-1 > 0): print('both parts are true') else: print('one part is not true') # | s1 | not s1 | # | ----- | ------ | # | True | False | # | False | True | # ------- # ### Challenge: rewrite your Fahrenheit to Celsius converter with an if-else statement such that it can also take a list of values in addition to arrays/single numbers. # There are perhaps other ways to do this, but if we want to specifically handle the case of **lists** of temperatures, we can use the builtin ``isinstance`` to check that we are dealing with a list: # In[177]: def fahrenheit_to_celsius(temp): """Convert temperature in fahrenheit to celsius. Parameters ---------- temp : float or array_like Temperature(s) in fahrenheit. Returns ------- float or array_like Temperatures in celsius. bubbles """ if isinstance(temp, list): newtemps = [] for value in temp: newtemps.append(fahrenheit_to_celsius(value)) else: newtemps = (temp - 32) * 5/9 return newtemps # This function takes advantage of recursion for the case of a list, calling itself on each value in the list and converting those values accordingly. # ## Errors and exceptions # Encountering errors is part of coding. If you are coding, you will hit errors. The important thing to remember is that errors that are loud are the best kind, because they usually give hints as to what the problem is. In Python, errors are known as **exceptions**, and there are a few common varieties. Familiarity with these varieties helps to more quickly identify the root of the problem, which means we can more quickly fix it. # We want to run a function inside a module called ``errors_01.py``; it's inside our ``scripts`` directory: # In[180]: get_ipython().run_line_magic('ls', 'scripts/') # Let's try to import it: # In[181]: import errors_01 # We got an ``ImportError``. This exception tells us that Python cannot find a module we are trying to import. To troubleshoot why, we should first look at ``sys.path``: # In[182]: import sys sys.path # The Python interpreter looks at each location in this list, in order, when we try to import a module by name. Since our ``scripts`` directory is not in this list, ``errors_01`` is not found. We can add the directory to the list, though, to make ``errors_01`` findable: # In[183]: sys.path.append('scripts/') # In[184]: sys.path # Note that this change will only be present in this session. It is not permanent. This is a perfectly fine method, though, for using a module you've built yourself. Let's import ``errors_01`` now: # In[185]: import errors_01 # We get a different exception now: a ``SyntaxError``. This the Python interpreter's way of telling us that we are using poor grammar, and it doesn't understand what we are telling it. In this case, it tells us exactly what's wrong: # # SyntaxError: Missing parentheses in call to 'print' # # If we make the suggested fix in, say, ``errors_02.py`` (we copied the file here for demo purposes, but normally we would fix it and [version control it with git](https://git-scm.com/)): # In[186]: get_ipython().run_line_magic('cat', 'scripts/errors_02.py') # In[187]: import errors_02 # This one imports just fine. Let's run ``errors_02.favorite_ice_cream``: # In[188]: errors_02.favorite_ice_cream() # This time we get an ``IndexError``. Let's dissect this. The first line shows where the error originates in our code. In this case it's the only line in the cell. The next set of lines shows where the error originates at the source: that is, inside the definition of ``errors_02.favorite_ice_cream`` in line 7 of the module ``errors_02``. The last line gives us a hint as to the nature of the ``IndexError``: we appear to be using an index for an element of a list that doesn't exist. # There are [other types of exceptions](https://docs.python.org/3/library/exceptions.html), but these are some of the most common. You need not know all of them, but having some knowledge of the basic varieties is valuable for understanding what's going wroing, and then how to fix it! # ### Aside: catching exceptions # In addition to being useful tools for debugging, exceptions are **catchable**. That is, you can write code in such a way that it is able to recover from failure or exceptional cases. As an example, let's rewrite our Fahrenheit to Celsius converter one last time: # In[189]: def fahrenheit_to_celsius(temp): """Convert temperature in fahrenheit to celsius. Parameters ---------- temp : float or array_like Temperature(s) in fahrenheit. Returns ------- float or array_like Temperatures in celsius. bubbles """ try: newtemps = (temp - 32) * 5/9 except TypeError: newtemps = [] for value in temp: newtemps.append(fahrenheit_to_celsius(value)) return newtemps # This function can take lists just like the one with the ``if-else`` and ``isinstance`` check, but it turns the problem on its head. It uses a ``try-except`` block instead, first trying to use ``temp`` as a single number or array, and when that fails with a ``TypeError`` (as it would if ``temp`` is a list), it then handles ``temp`` as if it were a list. # # This is known as "ask forgiveness" style: we do the most common thing, and if that fails, we then do something else. This is in contrast to "ask permission" style, as exhibited in the ``if-else`` implementation: we first checked if we had a funny case, and if not, did the common thing. # # Which you choose is often a matter of style, though there are performance benefits to "ask forgiveness" when the odd cases truly are rare; when the other cases a function must deal with are almost equally common, it is better to use ``if-else`` statements to direct flow. # ## Defensive programming: being skeptical of your code # Say we write a function that does something a bit obtuse (heh): Given the coordinates of the lower-left and upper-right corners of a rectangle, it normalizes the rectangle so that its lower-left coordinate is at the origin (0, 0), and its longest side is 1.0 units long: # In[190]: def normalize_rectangle(rect): """Normalizes a rectangle so that it is at the origin, and 1.0 units long on its longest axis. :Arguments: *rect* a tuple of size four, giving (x0, y0, x1, y1), the (x,y) positions of the lower-left and upper-right corner :Returns: *newrect* a tuple of size four, giving the new coordinates of our rectangle's vertices """ x0, y0, x1, y1 = rect assert x0 < x1, 'Invalid x coordinates' assert y0 < y1, 'Invalid y coordinates' dx = x1 - x0 dy = y1 - y0 if dx > dy: scaled = float(dx) / dy upper_x, upper_y = 1.0, scaled else: scaled = float(dx) / dy upper_x, upper_y = scaled, 1.0 assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid' assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid' return (0, 0, upper_x, upper_y) # Notice that we added a nice documentation string at the top that explicitly states the functions inputs (arguments) and outputs (returns), so when we use it we don't need to mind its details anymore. Also, at the beginning of the function we have ``assert`` statements that check the validity of the inputs, while at the end of the function we have ``assert`` statements that check the validity of the outputs. # # These statements codify what our documentation says about what the inputs should mean and what the output should look like, guarding against cases where the function is given inputs it wasn't designed to handle. # Let's test the function! # In[191]: normalize_rectangle((3, 4, 4, 6)) # Looks good. First point is at the origin; longest side is 1.0. # In[192]: normalize_rectangle((0.0, 1.0, 2.0, 5.0)) # Nice! # In[193]: normalize_rectangle((0.0, 0.0, 5.0, 1.0)) # Oh...what happened here? The inputs should be fine, but an assertion caught something strange in the output. Looking closely at the code, we spot a mistake in line 26: # In[194]: def normalize_rectangle(rect): """Normalizes a rectangle so that it is at the origin, and 1.0 units long on its longest axis. :Arguments: *rect* a tuple of size four, giving (x0, y0, x1, y1), the (x,y) positions of the lower-left and upper-right corner :Returns: *newrect* a tuple of size four, giving the new coordinates of our rectangle's vertices """ x0, y0, x1, y1 = rect assert x0 < x1, 'Invalid x coordinates' assert y0 < y1, 'Invalid y coordinates' dx = x1 - x0 dy = y1 - y0 if dx > dy: scaled = float(dy) / dx upper_x, upper_y = 1.0, scaled else: scaled = float(dx) / dy upper_x, upper_y = scaled, 1.0 assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid' assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid' return (0, 0, upper_x, upper_y) # And now we get: # In[195]: normalize_rectangle((0.0, 0.0, 5.0, 1.0)) # In[196]: normalize_rectangle((2, 3, 4, 6)) # That looks better. # What if we were to try feeding in only three values instead of four? # In[197]: normalize_rectangle((2, 4, 5)) # That's a rather unhelpful error message. We can also add an assertion that checks that the input is long enough, and gives a message indicating what the problem is: # In[198]: def normalize_rectangle(rect): """Normalizes a rectangle so that it is at the origin, and 1.0 units long on its longest axis. :Arguments: *rect* a tuple of size four, giving (x0, y0, x1, y1), the (x,y) positions of the lower-left and upper-right corner :Returns: *newrect* a tuple of size four, giving the new coordinates of our rectangle's vertices """ assert len(rect) == 4, "Rectangle must have 4 elements" x0, y0, x1, y1 = rect assert x0 < x1, 'Invalid x coordinates' assert y0 < y1, 'Invalid y coordinates' dx = x1 - x0 dy = y1 - y0 if dx > dy: scaled = float(dy) / dx upper_x, upper_y = 1.0, scaled else: scaled = float(dx) / dy upper_x, upper_y = scaled, 1.0 assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid' assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid' return (0, 0, upper_x, upper_y) # In[199]: normalize_rectangle((2, 4, 5)) # Now when we use this function six months later, we won't have to scratch our heads too long when this sort of thing happens! *Defensive programming* in this way makes code more robust, easier to debug, and easier to maintain. This is especially important for scientific code, since incorrect results can occur silently and end up getting published, perhaps only getting uncovered years later. Much embarrassment and misinformation can be avoided by being skeptical of our code, using tools such as ``assert`` statements to make sure it keeps doing what we think it should. # ## Writing Python scripts # Running Python code from a notebook is great, but we often need to run Python code from the command line, perhaps on files as input data. We can turn our module ``datastuff.py`` into a simple script by adding some meat at the bottom: # In[ ]: if __name__ == '__main__': import pandas as pd import sys for dataset in sys.argv[1:]: print(dataset) data = pd.read_csv(dataset) data['temperature'] = fahrenheit_to_celsius( data['temperature']) fig = analyze_mosquitos(data) fig.savefig(dataset + "_plot.pdf") # The ``if __name__ == '__main__'`` statement at the top says that this code should only be executed if the module is being run as a script. It's a common check that the current namespace is `__main__`, as it would only be if the code were being executed as a script. # # Adding this allows us to run the script on any number of input files as: # # python datastuff.py A1_mosquito_data.csv A2_mosquito_data.csv # # or even: # # python datastuff.py *.csv # # It will loop through each filename given as arguments, convert the temperature column to Celsius, then generate and save a PDF of the analysis figures for that dataset. # # ``sys.argv`` is a list having the name of the script as its zeroth element, then each argument to the script in order from the first to the last. We do ``sys.argv[1:]`` to slice out only those arguments so we can iterate through them. This is a quick-and-dirty way to parse arguments, but if you're interested in writing more sophisticated command-line utilities, definitely use [``argparse``](https://docs.python.org/3/library/argparse.html).