Scientific Programming II

Justin Kitzes

In this second lesson, we'll apply the core elements of programming languages to a common scientific task: reading a data table, performing analysis on its contents, and saving the results. As we go through these different stages of analysis, we'll note how each task relates to one of the elements from the last lesson (we'll note these as E1, for example, for element 1, "a thing").

For this lesson, we'll use two important scientific Python packages. numpy is the main package providing numerical analysis functions, and pandas is designed to make it easy to work with tabular data.

In [ ]:
# Import pandas and set up inline plotting
import pandas as pd
%matplotlib inline

1. Reading and examining a data table

Unless you're generating your own data by simulation (as in our previous logistic growth function), most scientific analyses begin with loading an external data set.

For this lesson, we'll use data from the North American Breeding Bird Survey. As part of this survey, volunteers have driven cars along fixed routes once a year for the past forty years, stopping periodically along the way and counting all of the birds that they see when they do. The particular data tables that we'll work with today summarize the number of birds of many different species that were counted along routes in the state of California. The large table contains forty years of data for all sighted species, while the small table is a subset of the large table.

You can download and play with this data yourself at:

Pardieck, K.L., D.J. Ziolkowski Jr., M.-A.R. Hudson. 2015. North American Breeding Bird Survey Dataset 1966 - 2014, version 2014.0. U.S. Geological Survey, Patuxent Wildlife Research Center http://www.pwrc.usgs.gov/BBS/RawData/.

Tip: It's often a good idea to take a large data set and extract a small portion of it to use while building and testing your analysis code. Small data sets can be analyzed faster and allow you to see, visually, what the "right answer" should be when you write code to perform analysis. Determining whether your function gives the right answer on a small data set is the core idea behind unit testing, which we'll discuss later.

In [ ]:
# You can use the exclamation point symbol (the "bang") to run a shell command
# Let's use cat to see the contents of the small data table
In [ ]:
# Read the small table using pandas
# The DataFrame function (E2) in pandas creates a thing (E1) called a DataFrame
In [ ]:
# Now let's look at the contents of our data frame "thing" (E1)
In [ ]:
# Like other "things" in Python, a data frame is an object
# The object contains methods that operate on it (E2)

A data frame can be conceptualized as a kind of "thing", like we have above, that we can move around and perform operations on. However, it also shares some characteristics in common with a collection of things (E3) because we can use indexes and slicing to pull out subsets of the data table.

There are two main ways that we can select rows and columns from our table: using the labels for the rows and columns or using numeric indexes for the row and column locations. Below we'll focus on label names - check out the pandas help for the method iloc to learn about using numeric indexes.

In [ ]:
# Look at the table again and think about it as a collection (E3)
In [ ]:
# Use the loc method to pull out rows and columns by name
# Like a matrix, the row goes first, then the column
In [ ]:
# You can use ranges of names, similar to what we saw before for lists
In [ ]:
# You can also use lists of names

2. Perform analysis

Once we have our data table read in, we generally want to perform some sort of analysis with it. Let's presume that we want to get a count of the mean number of individuals sighted per species in each year. However, we only want the average over the species that were actually sighted in the state that year, ignoring species with counts of zero (this is a fairly common analysis in ecology).

Conceptually, one way to approach this problem is to imagine looping through (E4) all of the years, that is the columns of the data frame, one by one. For each year, we want to count the number of species present, sum their counts, and divide the sum of the counts by the number of species seen. We should record this information in some other sort of collection (E3) - we'll use another data frame.

In [ ]:
# First, let's set up a new data frame to hold the result of our calculation
# We'll get the column names from the bird table, then use DataFrame to make a new df
In [ ]:
# Next, let's figure out how we would do our analysis for one year, say 2010
In [ ]:
# Our final calculation code could look like this

Exercise 1 - Calculating mean counts for every year

  1. Put the above calculation code into a for loop (E4) that loops over all years, calculating the mean count of birds per species present each year, and stores the result in a new empty data frame.
  2. Put all of the code that you just wrote into a new function (E6) that takes, as an argument, a data frame of bird counts (like bird_sm) and returns the result data frame. Test it with bird_sm to make sure that it works.

Bonus:

  1. Using label-based indexing, create a new data frame that has the same years but only includes these three species: Spotted Owl, Barred Owl, Great Gray Owl. Try running your function using this new smaller data frame and look at the results. Do you see a result that you may not want?
  2. Add an if-else statement (E5) that checks for the problem that you just uncovered and takes some reasonable action when it occurs.
In [ ]:
 
In [ ]:
 

3. Save the results

Now that we've managed to generate some useful results, we want to save them somewhere on our computer for later use. There are two broad types of outputs that we might want to save, tables and plots, and we'll use the built-in methods for data frames to do both.

Getting a plot to look just right can take a very long time. Here we'll just use the pandas default styles. For more help on plotting, have a look at the extra lesson on matplotlib.

In [ ]:
# First we make sure that we've saved our results table
In [ ]:
# Data frames have a method to save themselves as a csv file - easy!
In [ ]:
# Data frames also have a method to plot their contents
# There's one trick though - by default they put the rows on the x axis and columns on the y
# We want the reverse, so we need to transpose our data frame before plotting it
In [ ]:
# With a few extra steps, we can save the plot
# This code looks strange, since we haven't talked about the details of matplotlib
# At this stage, it's best to just use it as a recipe

Exercise 2 - A complete analysis

  1. Using all of the code that we wrote above, put the following lines of code in the cell below (this will form a complete analysis that would run without the rest of this notebook):
    • Import the pandas package
    • Read the birds_sm.csv table
    • Define a function to perform the analysis (just copy the one you wrote in Exercise 1)
    • Use that function to make a results dataframe for the birds_sm.csv data
    • Saves the resulting table as birds_results.csv
    • Saves a plot of the result as birds_results.pdf
  2. To test that your cell works on its own, go to the Jupyter menu bar, under Kernel, and choose "Restart Kernel". This will restart your notebook, so that everything that you've run so far (all the variables stored in memory, in particular) is erased. Run the cell below, and make sure it works correctly.
  3. Instead of bird_sm.csv, make your cell use bird_lg.csv and see what the saved results look like. If necessary, modify your code and variable names so that all you have to do is change two letters (sm to lg) in one place in the code to make this change.
In [ ]: