%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
In this exercise we will do a simple time series analysis. The data is taken from an experiment that measures the growth of bacteria (E. coli) in a 96 wells microplate. The growth is measured in OD (optical density) over time in seconds.
The data file is in CSV format (comma separated values). The first row in the file is the time of measurements. The next 96 rows are the OD values in each well at each time points.
a) Start by loading the data using the loadtxt
function in numpy
. Note that in order to load a CSV file you must give it the proper delimiter
argument (see lecture 7).
After you load the data, put the first row of the data in a variable called t
(for time) and the rest of the rows in a variable called OD
. Make sure (assert
) that OD
has 96 rows and that the number of columns in OD
is equal to the length of t
.
The can be found at https://raw.githubusercontent.com/Py4Life/TAU2015/master/bacterial_growth.csv.
b) Plot all the growth cruves - one per well, or per row in the data. Matplotlib will assign each line you plot with a different color. Note that Matplotlib expects the length of x
and y
to be equal, but the length of OD
is 96. To fix this you can transpose OD
.
Don't forget to label the x and y axes.
c) Now we want to present an aggregated version of the previous plot. In the next plot we will plot the mean and SEM (standrad error of the mean) of the OD values across the wells at each time point. We will present the result as a line with errorbars, where the length of the errorbars is given by the SEM. Reminders:
assert
to make sure you get what you expect: becuase we want to do a mean over all the wells we expect the result of the mean to have the same length as t
.
d) Finally, we want to check thed distributions of the maximum and minimum OD values in each well (row of data). To do this, we will calculate the maximum and minimum OD over time in each well and plot two histograms, one for the maximum OD and one for the minimum OD.
subplots
.bar
plot, but a better, easier way to do this is with the hist
function. Make sure you use enough bins to make the plot intersting but not too much.assert
to make sure you aggregated on the right axis: check the length of the aggregation result against you expectation.
In this question we will learn how to use the very useful split-apply-combine paradigm in Pandas and how to use it to create sophisticated plots with very little coding.
Start by reading this blog post by Brian Connelly which described the data and functions we will use.
0) Start by loading the dataset from this URL: http://bconnelly.net/wp-content/uploads/2013/10/TradeoffData.csv.
This data set contains the fitness of a flocculated strain of Escherichia coli relative to a non-floculated strain when grown alone in either spatially-structured (dish) or spatially-unstructured (tube) environments.
Use the read_csv
function in pandas
which can open csv files from the local filesystem using a filename or from a remote resource using a URL. After reading the file into a variable called data
(in your research consider using a less generic name for the DataFrame
variable), view it by calling the method head
to see the first few rows in data
.
a) First, you should group the data by the Treatment
variable and call the describe
method on the grouped data to see a textual summary of the RelativeFitness
distribution in each Treatment
(Dish
and Tube
).
b) Next, we want to plot a summary of the distribution of RelativeFitness
in each of the Treatment
s.
Here we aim at getting a plot of the mean or median of the RelativeFitness
together with some meaure of the variance in the data. This can be achieved with a boxplot, violinplot, whiskerplot and a regular plot with errorbars.
So - plot either a boxplot, violinplot or a plot with errorbars of the data. A boxplot will show the media, quartiles and outliers; the violinplot will show the entire distribution of values; the errorbar plot will show the mean and the standard deviation; factorplots are seaborn's version of the errorbar
plot.
Here are some references to get you started:
c) We now want to check if the variance between Group
s in the same Treatment
is large and if the Treatment
had the same effect on all Group
s.
Do a new grouping, this time by both Group
and Treatment
, and print the resut of the describe
method.
d) Now use the sns.FacetGrid
function to create a faceted plot of the distributions of RelativeFitness
. Each facet should be similar to the plot you made in (b) (but you are free to choose a different plot type if you want to practive it!). Facet on either column (col
) or row (row
) to make a wide or long plot.
Create two figures - in the first you facet according to Treatment
and group by Group
, and in the second vice-versa, facet by Group
and group by Treatment
.
For clarity and bonus points, use the hue
argument of FacetGrid
and set it to the same variable as you facet by.
Note that you may have to set the value of the argument size
in FacetGrid
to a number larger than the default 3.
e) Finally, we want to save a file with the mean and standard deviation of RelativeFitness
for each of the Group
s and Treatment
s. Use the aggregate
method of the DataFrameGroupBy
object created by groupby
an give it the names of required functions - np.mean
for the mean and np.std
for the standard deviation.
Save the result to a csv file using the to_csv
method of the DataFrame
object created by aggregate
. The filename should be agg.csv
.