Plotting of data is pandas is handled by an external Python module called matplotlib. Like pandas it is a large library and has a venerable history (first released in 2003) and so we couldn't hope to cover all its functionality in this course. To see the wide range of possibilities you have with matplotlib see its example gallery.
While working through these examples you will likely find it very useful to refer to the matplotlib documentation.
First we import pandas
in the same way as we did previously.
import pandas as pd
from pandas import Series, DataFrame
Throughout this section we will also need to use some mathematical functions such as $\sin$ and $\cos$. They are provided by numpy
, or numerical Python:
import numpy as np
Some matplotlib functionality is provided directly through pandas (such as the plot()
metho as we will see) but for much of it you need to import the matplotlib interface itself.
The most common interface to matplotlib is its pyplot
module which provides a way to create figures and display them in the notebook. By convention this is imported as plt
.
import matplotlib.pyplot as plt
Once we have imported matplotlib we can start calling its functions. Any functions called on the plt
object will affect all of matplotlib from that point on in the script.
We first need to import some data to plot. Let's start with the data from the pandas section (available from cetml1659on.dat) and import it into a DataFrame
:
df = pd.read_csv(
'cetml1659on.dat', # file name
skiprows=6, # skip header
sep='\s+', # whitespace separated
na_values=['-99.9', '-99.99'], # NaNs
)
df.head()
JAN | FEB | MAR | APR | MAY | JUN | JUL | AUG | SEP | OCT | NOV | DEC | YEAR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1659 | 3.0 | 4.0 | 6.0 | 7.0 | 11.0 | 13.0 | 16.0 | 16.0 | 13.0 | 10.0 | 5.0 | 2.0 | 8.87 |
1660 | 0.0 | 4.0 | 6.0 | 9.0 | 11.0 | 14.0 | 15.0 | 16.0 | 13.0 | 10.0 | 6.0 | 5.0 | 9.10 |
1661 | 5.0 | 5.0 | 6.0 | 8.0 | 11.0 | 14.0 | 15.0 | 15.0 | 13.0 | 11.0 | 8.0 | 6.0 | 9.78 |
1662 | 5.0 | 6.0 | 6.0 | 8.0 | 11.0 | 15.0 | 15.0 | 15.0 | 13.0 | 11.0 | 6.0 | 3.0 | 9.52 |
1663 | 1.0 | 1.0 | 5.0 | 7.0 | 10.0 | 14.0 | 15.0 | 15.0 | 13.0 | 10.0 | 7.0 | 5.0 | 8.63 |
Pandas integrates matplotlib directly into itself so any dataframe can be plotted easily simply by calling the plot()
method on one of the columns. This creates a plot object which you can then edit and alter which we save as the variable year_plot
. We can then manipulate this object, for example by setting the axis labels using the year_plot.set_ylabel()
function before displaying it with plt.show()
.
year_plot = df['YEAR'].plot()
year_plot.set_ylabel(r'Temperature ($^\circ$C)')
plt.show()
plot()
calls with different months (January and July for example) before calling show()
.While it's useful to be able to quickly plot any data we have in front of us, matplotlib's power comes from its configurability. Let's experiment with a dataset and see how much we can change the plot.
We'll start with a simple DataFrame
containing two columns, one with the values of a cosine, the other with the values of a sine.
X = np.linspace(-np.pi, np.pi, 256, endpoint=True)
data = {'cos': np.cos(X), 'sin': np.sin(X)}
trig = DataFrame(index=X, data=data)
trig.plot()
plt.show()
You can see that it has plotted the sine and cosine curves between $\pi$ and $-\pi$. Now, let's go through and see how we can affect the display of this plot.
First step, we want to have the cosine in blue and the sine in red and a slighty thicker line for both of them. To do this we need separate calls to the plot methods, one for each column in our DataFrame.
Also, to be explicit about where we want things plotted, we will create a Figure
object using plot.figure()
, then ask that figure to create a single subplot onto which we will do our drawing. Confusingly in matplotlib subplots are also referred to as Axes
so we name our subplot reference ax
. We can pass this Axes
object to our pandas plotting function to tell it where it should plot the data:
fig, ax = plt.subplots()
trig["cos"].plot(color="blue", linewidth=2.5, linestyle="-", ax=ax)
trig["sin"].plot(color="red", linewidth=2.5, linestyle="-", ax=ax)
plt.show()
Current limits of the figure are a bit too tight and we want to make some space in order to clearly see all data points.
fig, ax = plt.subplots()
trig["cos"].plot(color="blue", linewidth=2.5, linestyle="-", ax=ax)
trig["sin"].plot(color="red", linewidth=2.5, linestyle="-", ax=ax)
### New code
ax.set_xlim(trig.index.min() * 1.1, trig.index.max() * 1.1)
ax.set_ylim(trig.cos.min() * 1.1, trig.cos.max() * 1.1)
### End of new code
plt.show()
Current ticks are not ideal because they do not show the interesting values ($\pm\pi$,$\pm\frac{\pi}{2}$) for sine and cosine. We’ll change them such that they show only these values.
fig, ax = plt.subplots()
trig["cos"].plot(color="blue", linewidth=2.5, linestyle="-", ax=ax)
trig["sin"].plot(color="red", linewidth=2.5, linestyle="-", ax=ax)
ax.set_xlim(trig.index.min() * 1.1, trig.index.max() * 1.1)
ax.set_ylim(trig.cos.min() * 1.1, trig.cos.max() * 1.1)
### New code
ax.set_xticks([-np.pi, -np.pi/2, 0, np.pi/2, np.pi])
ax.set_yticks([-1, 0, +1])
### End of new code
plt.show()
Ticks are now properly placed but their label is not very explicit. We could guess that 3.142 is $\pi$ but it would be better to make it explicit. When we set tick values, we can also provide a corresponding label in the second argument list. Note that we’ll use LaTeX to allow for nice rendering of the label.
fig, ax = plt.subplots()
trig["cos"].plot(color="blue", linewidth=2.5, linestyle="-", ax=ax)
trig["sin"].plot(color="red", linewidth=2.5, linestyle="-", ax=ax)
ax.set_xlim(trig.index.min() * 1.1, trig.index.max() * 1.1)
ax.set_ylim(trig.cos.min() * 1.1, trig.cos.max() * 1.1)
ax.set_xticks([-np.pi, -np.pi/2, 0, np.pi/2, np.pi])
ax.set_yticks([-1, 0, +1])
### New code
ax.set_xticklabels([r'$-\pi$', r'$-\pi/2$', r'$0$', r'$+\pi/2$', r'$+\pi$'])
ax.set_yticklabels([r'$-1$', r'$0$', r'$+1$'])
### End of new code
plt.show()
Spines are the lines connecting the axis tick marks and noting the boundaries of the data area. They can be placed at arbitrary positions and until now, they were on the border of the axis. We’ll change that since we want to have them in the middle. Since there are four of them (top/bottom/left/right), we’ll discard the top and right by setting their color to none and we’ll move the bottom and left ones to coordinate 0 in data space coordinates.
fig, ax = plt.subplots()
trig["cos"].plot(color="blue", linewidth=2.5, linestyle="-", ax=ax)
trig["sin"].plot(color="red", linewidth=2.5, linestyle="-", ax=ax)
ax.set_xlim(trig.index.min() * 1.1, trig.index.max() * 1.1)
ax.set_ylim(trig.cos.min() * 1.1, trig.cos.max() * 1.1)
ax.set_xticks([-np.pi, -np.pi/2, 0, np.pi/2, np.pi])
ax.set_yticks([-1, 0, +1])
ax.set_xticklabels([r'$-\pi$', r'$-\pi/2$', r'$0$', r'$+\pi/2$', r'$+\pi$'])
ax.set_yticklabels([r'$-1$', r'$0$', r'$+1$'])
### New code
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.spines['bottom'].set_position(('data',0))
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data',0))
### End of new code
plt.show()
Let’s add a legend in the upper left corner. This only requires adding the keyword argument label (that will be used in the legend box) to the plot commands.
fig, ax = plt.subplots()
trig["cos"].plot(color="blue", linewidth=2.5, linestyle="-", ax=ax)
trig["sin"].plot(color="red", linewidth=2.5, linestyle="-", ax=ax)
ax.set_xlim(trig.index.min() * 1.1, trig.index.max() * 1.1)
ax.set_ylim(trig.cos.min() * 1.1, trig.cos.max() * 1.1)
ax.set_xticks([-np.pi, -np.pi/2, 0, np.pi/2, np.pi])
ax.set_yticks([-1, 0, +1])
ax.set_xticklabels([r'$-\pi$', r'$-\pi/2$', r'$0$', r'$+\pi/2$', r'$+\pi$'])
ax.set_yticklabels([r'$-1$', r'$0$', r'$+1$'])
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.spines['bottom'].set_position(('data',0))
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data',0))
### New code
ax.legend(loc='upper left')
### End of new code
plt.show()
Let’s annotate some interesting points using the annotate command. We chose the $\frac{2}{3}\pi$ value and we want to annotate both the sine and the cosine. We’ll first draw a marker on the curve as well as a straight dotted line. Then, we’ll use the annotate command to display some text with an arrow.
fig, ax = plt.subplots()
trig["cos"].plot(color="blue", linewidth=2.5, linestyle="-", ax=ax)
trig["sin"].plot(color="red", linewidth=2.5, linestyle="-", ax=ax)
ax.set_xlim(trig.index.min() * 1.1, trig.index.max() * 1.1)
ax.set_ylim(trig.cos.min() * 1.1, trig.cos.max() * 1.1)
ax.set_xticks([-np.pi, -np.pi/2, 0, np.pi/2, np.pi])
ax.set_yticks([-1, 0, +1])
ax.set_xticklabels([r'$-\pi$', r'$-\pi/2$', r'$0$', r'$+\pi/2$', r'$+\pi$'])
ax.set_yticklabels([r'$-1$', r'$0$', r'$+1$'])
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.spines['bottom'].set_position(('data',0))
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data',0))
ax.legend(loc='upper left')
### New code
t = 2 * np.pi / 3
ax.plot([t, t], [0, np.cos(t)], color='blue', linewidth=2.5, linestyle="--")
ax.scatter([t, ], [np.cos(t), ], 50, color='blue')
ax.annotate(r'$cos(\frac{2\pi}{3})=-\frac{1}{2}$',
xy=(t, np.cos(t)), xycoords='data',
xytext=(-90, -50), textcoords='offset points', fontsize=16,
arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2"))
ax.plot([t, t],[0, np.sin(t)], color='red', linewidth=2.5, linestyle="--")
ax.scatter([t, ],[np.sin(t), ], 50, color='red')
ax.annotate(r'$sin(\frac{2\pi}{3})=\frac{\sqrt{3}}{2}$',
xy=(t, np.sin(t)), xycoords='data',
xytext=(+10, +30), textcoords='offset points', fontsize=16,
arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2"))
### End of new code
plt.show()
Now you know how to make different modifications to your plots we can make some of these changes to our temerature data.
You can take any plot you've created within Jupyter and save it to a file on disk using the plt.savefig()
function. You give the function the name of the file to create and it will use whatever format is specified by the name. Note that you must save the fig before you show()
it, otherwise it will not create the figure correctly.
fig, ax = plt.subplots()
trig.plot(ax=ax)
fig.savefig('my_fig.png')
You can then display the figure in Markdown node in Jupyter with ![](my_fig.png)
warm_winter_year = df['JAN'].idxmax()
warm_winter_temp = df['JAN'].max()
Of course, Matplotlib can plot more than just line graphs. One of the other most common plot types is a bar chart. Let's work towards plotting a bar chart of the average temperature per decade.
Let's start by adding a new column to the data frame which represents the decade. We create it by taking the index (which is a list of years), converting each element to a string and then replacing the fourth character with a '0'
.
years = Series(df.index, index=df.index).apply(str)
decade = years.apply(lambda x: x[:3]+'0')
df['decade'] = decade
df.head()
JAN | FEB | MAR | APR | MAY | JUN | JUL | AUG | SEP | OCT | NOV | DEC | YEAR | decade | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1659 | 3.0 | 4.0 | 6.0 | 7.0 | 11.0 | 13.0 | 16.0 | 16.0 | 13.0 | 10.0 | 5.0 | 2.0 | 8.87 | 1650 |
1660 | 0.0 | 4.0 | 6.0 | 9.0 | 11.0 | 14.0 | 15.0 | 16.0 | 13.0 | 10.0 | 6.0 | 5.0 | 9.10 | 1660 |
1661 | 5.0 | 5.0 | 6.0 | 8.0 | 11.0 | 14.0 | 15.0 | 15.0 | 13.0 | 11.0 | 8.0 | 6.0 | 9.78 | 1660 |
1662 | 5.0 | 6.0 | 6.0 | 8.0 | 11.0 | 15.0 | 15.0 | 15.0 | 13.0 | 11.0 | 6.0 | 3.0 | 9.52 | 1660 |
1663 | 1.0 | 1.0 | 5.0 | 7.0 | 10.0 | 14.0 | 15.0 | 15.0 | 13.0 | 10.0 | 7.0 | 5.0 | 8.63 | 1660 |
Once we have our decade column, we can use Pandas groupby()
function to gather our data by decade and then aggregate it by taking the mean of each decade.
by_decade = df.groupby('decade')
agg = by_decade.aggregate(np.mean)
agg.head()
JAN | FEB | MAR | APR | MAY | JUN | JUL | AUG | SEP | OCT | NOV | DEC | YEAR | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
decade | |||||||||||||
1650 | 3.00 | 4.00 | 6.00 | 7.00 | 11.00 | 13.00 | 16.00 | 16.00 | 13.00 | 10.00 | 5.00 | 2.00 | 8.870 |
1660 | 2.60 | 4.00 | 5.10 | 7.70 | 10.60 | 14.50 | 16.00 | 15.70 | 13.30 | 10.00 | 6.30 | 3.80 | 9.157 |
1670 | 3.25 | 2.35 | 4.50 | 7.25 | 11.05 | 14.40 | 15.80 | 15.25 | 12.40 | 8.95 | 5.20 | 2.45 | 8.607 |
1680 | 2.50 | 2.80 | 4.80 | 7.40 | 11.45 | 14.00 | 15.45 | 14.90 | 12.70 | 9.55 | 5.45 | 4.05 | 8.785 |
1690 | 1.89 | 2.49 | 3.99 | 6.79 | 9.60 | 13.44 | 15.27 | 14.65 | 11.93 | 8.64 | 5.26 | 3.31 | 8.134 |
At this point, agg
is a standard Pandas DataFrame
so we can plot it like any other, by putting .bar
after the plot
call:
ax = agg["YEAR"].plot.bar()
ax.set_ylabel(r'Temperature ($^\circ$C)')
plt.show()
Continue to the [next section](numpy arrays.ipynb).