Pandas is a Python library that helps you load, analyze and visualize data. It plays especially well with Jupyter Notebook, taking advantage of the notebook format to display data in easy-to-read ways. The data types that come with Pandas are kind of like super-charged lists and dictionaries, with built-in functionality for common tasks in statistics and data analysis that have the potential to run faster than their equivalents written with more familiar Python data types.
The purpose of this tutorial is to give you a taste for how Pandas works. By the end of the tutorial, you'll be able to use Pandas to load some data from a CSV file into a Pandas data frame and use Pandas' data visualization functions to draw a handful of simple graphs. The tutorial is aimed at people who know at least a little bit about how regular Python data types (like lists and dictionaries) work.
To fully take advantage of the capabilities of Pandas, you need to import not just Pandas but a handful of other libraries:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('max_rows', 25)
plt.style.use('ggplot')
plt.rcParams["figure.figsize"] = (10, 4)
The first line is what's known as an IPython magic; it tells the notebook server to display plots inline. The next three lines import Pandas (using the as
clause to shorten its name to pd
) and two other libraries, numpy
and matplotlib
, in case we need them. The final two lines set some options to make our plots look prettier.
Whenever you start a new notebook and want to use Pandas, it's a good idea to just copy and paste the code from that cell and make it the first cell in your own notebook.
Let's look at a couple of Pandas data types.
The Series
data type in Pandas is like a Python list, in that it stores a sequence of values. But it has a few extra goodies that make it appealing for data analysis.
One way to create a Series
is to just pass a Python list to pd.Series()
:
s = pd.Series([5, 5, 5, 10, 10, 12, 15, 15, 23, 27, 30])
s
0 5 1 5 2 5 3 10 4 10 5 12 6 15 7 15 8 23 9 27 10 30 dtype: int64
Unlike Python lists, you can operate on a Series using arithmetic operations. So, for example, you can multiply an entire Series by 0.5:
s * 0.5
0 2.5 1 2.5 2 2.5 3 5.0 4 5.0 5 6.0 6 7.5 7 7.5 8 11.5 9 13.5 10 15.0 dtype: float64
... or create a Series with 100 added to each entry from the original Series:
s + 100
0 105 1 105 2 105 3 110 4 110 5 112 6 115 7 115 8 123 9 127 10 130 dtype: int64
Series support a variety of statistical operations through methods. To get the smallest value in a Series:
s.min()
5
The greatest value:
s.max()
30
The arithmetic mean:
s.mean()
14.272727272727273
Various other operations are supported as well:
s.median()
12.0
s.mode()
0 5 dtype: int64
s.std() # standard deviation
8.866689450870703
s.quantile(0.75) # 75th percentile
19.0
The .describe()
method gives you some quick insight on the statistical properties of the series as a whole:
s.describe()
count 11.000000 mean 14.272727 std 8.866689 min 5.000000 25% 7.500000 50% 12.000000 75% 19.000000 max 30.000000 dtype: float64
Every Series object has a .plot()
method that will display a plot of the data contained in the series. Very easy!
s.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1175014e0>
By default, you get a line plot, but the .plot()
method can take a named parameter kind
that allows you to specify different types of plots. There's a full list here, but just to demonstrate, here's a bar graph from our test series:
s.plot(kind="bar")
<matplotlib.axes._subplots.AxesSubplot at 0x12b9567b8>
A horizontal bar chart:
s.plot(kind="barh")
<matplotlib.axes._subplots.AxesSubplot at 0x12bdd8898>
A pie chart:
s.plot(kind="pie")
<matplotlib.axes._subplots.AxesSubplot at 0x12bf8e4a8>
The default behavior of a Series is to use integers as indices: if you initialize a Series with just a list, then the indices start at 0 and go up to the length of the list (minus 1). But the indices of a Series can be essentially any data type. You can specify the values and indices in a Series by passing them as a dictionary, or as two lists (values first, indices second):
planet_moons = pd.Series(
[0, 0, 1, 2, 69, 62, 27, 14],
['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune'])
planet_moons
Mercury 0 Venus 0 Earth 1 Mars 2 Jupiter 69 Saturn 62 Uranus 27 Neptune 14 dtype: int64
All the various statistical operations still work, e.g.:
planet_moons.mean()
21.875
Plots work as well:
planet_moons.plot(kind="barh")
<matplotlib.axes._subplots.AxesSubplot at 0x12c109128>
Even if indices are integers, they don't have to be sequential integers. A good example of this is what happens when you use the .value_counts()
method, which returns a new Series with totals for each unique value (like a Counter object):
s_counts = s.value_counts()
s_counts
5 3 15 2 10 2 30 1 12 1 27 1 23 1 dtype: int64
We'll get back to why this is important in a second...
To get a particular value from a Series, you can use the square bracket syntax familiar to you from Python lists and dictionaries:
s[0]
5
s[4]
10
Using the slice operator gives you a new Series representing the corresponding slice:
s[1:4]
1 5 2 5 3 10 dtype: int64
This syntax works for Series with non-integer indices as well:
planet_moons["Neptune"]
14
Somewhat weirdly, you can use slice syntax with non-integer indices. This is something you can do with a Pandas Series that you definitely can't do with a regular list or dictionary:
planet_moons["Mercury":"Jupiter"]
Mercury 0 Venus 0 Earth 1 Mars 2 Jupiter 69 dtype: int64
Even with Series with non-integer indices will allow you to use numerical indices, to refer to the item in the series corresponding to that entry in numerical order:
planet_moons[5]
62
planet_moons[:4]
Mercury 0 Venus 0 Earth 1 Mars 2 dtype: int64
Where this gets even weirder is with Series that have non-consecutive integer indices. Recall the result of .value_counts()
for our original Series s
:
s_counts
5 3 15 2 10 2 30 1 12 1 27 1 23 1 dtype: int64
It's not clear what the expression s[5]
should evaluate to: the item at numerical index 5
in the Series, or the value for the index 5
. Let's see what happens:
s_counts[5]
3
It looks like the Series gives us the value for the index 5
(i.e., not the value for the index 27
, which is in the fifth numerical index position). Weird! To avoid this ambiguity, you can use the .iloc
attribute, which always uses numerical position:
s_counts.iloc[5]
1
Another way to get portions of a Series is to "select" items from it. Series values support an unusual syntax where you can put a list inside of the square bracket indexing syntax, and in that list you can specify which fields in particular you want. So for example:
planet_moons[ ["Jupiter", "Saturn"] ]
Jupiter 69 Saturn 62 dtype: int64
Very weird, right? But it's also quite handy in certain circumstances. You can also pass a list of Boolean values (i.e., True
or False
), in which case you'll receive a new Series that only has values for the items in the original series that correspond with a True
value in the list. That's confusing to explain, but easy to understand if you see it in action:
planet_moons
Mercury 0 Venus 0 Earth 1 Mars 2 Jupiter 69 Saturn 62 Uranus 27 Neptune 14 dtype: int64
planet_moons[ [False, False, False, True, False, False, False, True] ]
Mars 2 Neptune 14 dtype: int64
This feature is of limited utility on its own, but there's another bit of functionality that the Series value gives you that works alongside it. The same way that you can multiply a Series, or add a constant to a Series, you can also use a relational operator on a Series. When you do so, you get back a Series that has True
for every item that passed the test and False
for every item that failed. For example:
planet_moons < 20
Mercury True Venus True Earth True Mars True Jupiter False Saturn False Uranus False Neptune True dtype: bool
If you combine these two features, you can write an expression that returns a Series with only those items that meet particular criteria. For example, the following expression gives us only those planets that have fewer than twenty known moons:
planet_moons[planet_moons < 20]
Mercury 0 Venus 0 Earth 1 Mars 2 Neptune 14 dtype: int64
I wanted to discuss the Series data type because you'll see it again and again when you're working with Pandas, and it's important to understand what it is and what it can do. But for the most part when you're working with Pandas, you'll be working with a data type called the DataFrame
. A DataFrame
is sort of like a spreadsheet, consisting of rows and columns. As with series, the rows and columns can have labels (i.e., the items have names like they do in the planet_moons
Series above) and can also be indexed purely by position.
You can create a DataFrame
by passing in a dictionary, where the keys of the dictionary are the column labels and the values are lists of individual values for each row. Here I'm creating a very simple DataFrame for the longest rivers in the world, including their names, their length (in kilometers), their drainage areas (in square kilometers) and their average discharge (in cubic meters per second):
river_data = {
"Name": ["Amazon", "Nile", "Yangtze", "Mississippi"],
"Length": [6992, 6835, 6300, 6275],
"Drainage area": [7050000, 3254555, 1800000, 2980000],
"Discharge": [209000, 2800, 31900, 16200]
}
river_df = pd.DataFrame(river_data)
Evaluating the DataFrame in Jupyter Notebook displays the data in a nice, clean HTML table:
river_df
Discharge | Drainage area | Length | Name | |
---|---|---|---|---|
0 | 209000 | 7050000 | 6992 | Amazon |
1 | 2800 | 3254555 | 6835 | Nile |
2 | 31900 | 1800000 | 6300 | Yangtze |
3 | 16200 | 2980000 | 6275 | Mississippi |
You'll notice that the order of the indices is (probably) wrong. That's because when you initialize a DataFrame with a dictionary, Pandas sorts the keys alphabetically by default. If you want to specify a different order, use the columns
named parameter, with a list of the column labels in the order you want:
river_df = pd.DataFrame(river_data, columns=["Name", "Length", "Drainage area", "Discharge"])
river_df
Name | Length | Drainage area | Discharge | |
---|---|---|---|---|
0 | Amazon | 6992 | 7050000 | 209000 |
1 | Nile | 6835 | 3254555 | 2800 |
2 | Yangtze | 6300 | 1800000 | 31900 |
3 | Mississippi | 6275 | 2980000 | 16200 |
Just as with a Series, you can plot the data in a DataFrame right away using the .plot()
method. (The x
named parameter sets the column to use to label each bar; if you do this call without the x
the bars will be labelled by their row number, which isn't terribly helpful.) By default, all columns are plotted, which isn't super useful, but it is easy:
river_df.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x12b6586a0>
That graph doesn't make any sense, and it doesn't make sense for several different reasons:
To fix this, we can pass a couple of parameters to the .plot()
method. For example:
y
named parameterx
named parameterCombining these, we can get a nice bar chart of our rivers' discharges, showing that the amount of water but out by the Amazon is truly tremendous:
river_df.plot(kind="bar", x="Name", y="Discharge")
<matplotlib.axes._subplots.AxesSubplot at 0x12c2dac18>
When you're working with DataFrames, sometimes you want to isolate an individual row or column as a series. In other cases, you want to construct a new DataFrame based on a subset of rows or columns from the original DataFrame. Or, you might just want to get a single value at the intersection of a row and column. In other words, there are three different operations, which we can think about in terms of the types involved:
DataFrame
→ Series
(i.e., get a column or row)DataFrame
→ DataFrame
(i.e., filter a DataFrame based on rows or columns that meet particular criteria)DataFrame
→ single value (i.e., get a number, string, etc. from a particular row/column intersection)We'll talk about these one by one below.
Getting a Series from a column of a DataFrame is easy: just use the label of the column in square brackets after the DataFrame:
river_df["Length"]
0 6992 1 6835 2 6300 3 6275 Name: Length, dtype: int64
With the resulting series, you can do any of the statistical operations discussed earlier for Series:
river_df["Length"].max()
6992
You can even plot the series, though it's not terribly useful because we're missing the names of the rivers:
river_df["Length"].plot(kind="bar")
<matplotlib.axes._subplots.AxesSubplot at 0x12c5d06d8>
Getting an individual row as a series is also possible. Just use the .iloc[]
attribute with the numerical index of the row inside the brackets:
river_df.iloc[2]
Name Yangtze Length 6300 Drainage area 1800000 Discharge 31900 Name: 2, dtype: object
You can use the indexing syntax to give you a new DataFrame that includes only particular columns and rows from the original DataFrame. If you wanted a new DataFrame that only includes particular columns, then pass a list of the columns you want inside the square bracket indexing syntax:
name_length_df = river_df[["Name", "Length"]]
type(name_length_df)
pandas.core.frame.DataFrame
name_length_df
Name | Length | |
---|---|---|
0 | Amazon | 6992 |
1 | Nile | 6835 |
2 | Yangtze | 6300 |
3 | Mississippi | 6275 |
Weirdly, you can use this syntax to get a new DataFrame with just a single column, which is different from a Series object:
river_df[["Name"]]
Name | |
---|---|
0 | Amazon |
1 | Nile |
2 | Yangtze |
3 | Mississippi |
river_df["Name"]
0 Amazon 1 Nile 2 Yangtze 3 Mississippi Name: Name, dtype: object
To get a new DataFrame with just a subset of rows from the original DataFrame, you can use slice syntax with either row labels or numbers. So to get rows 2 through 4:
a_few_rivers_df = river_df[1:3]
type(a_few_rivers_df)
pandas.core.frame.DataFrame
a_few_rivers_df
Name | Length | Drainage area | Discharge | |
---|---|---|---|---|
1 | Nile | 6835 | 3254555 | 2800 |
2 | Yangtze | 6300 | 1800000 | 31900 |
Just as with Series values, you can use a list of Boolean (i.e., True
or False
) values to select particular rows from a DataFrame:
river_df[ [True, False, False, True] ]
Name | Length | Drainage area | Discharge | |
---|---|---|---|---|
0 | Amazon | 6992 | 7050000 | 209000 |
3 | Mississippi | 6275 | 2980000 | 16200 |
You can get a list of Boolean values for any column of a DataFrame (as a Series) using the the square brackets to get the column and then a comparison operator:
river_df["Discharge"] > 30000
0 True 1 False 2 True 3 False Name: Discharge, dtype: bool
Combine the two, and you can write an expression that creates a new DataFrame with only the rows from the original DataFrame that match a particular criterion:
river_df[river_df["Discharge"] > 30000]
Name | Length | Drainage area | Discharge | |
---|---|---|---|---|
0 | Amazon | 6992 | 7050000 | 209000 |
2 | Yangtze | 6300 | 1800000 | 31900 |
Okay, enough playtime, let's work with some real data! Let's load up this Beijing PM2.5 data set. Download the CSV file using this link and save it in the same folder as your Jupyter Notebook. The data describes several years of hourly weather and pollution readings in Beijing. The people who produced the data also wrote a paper on it:
The paper has some technical content, but overall it's very readable and giving it a skim will help you understand the data a bit better.
Pandas makes it very easy to use data in CSV format. Just use the read_csv()
function and pass it the filename of your data:
df = pd.read_csv("./PRSA_data_2010.1.1-2014.12.31.csv")
Pandas does a good job of guessing the correct data types for the values in the CSV file. (If Pandas gets it wrong, though, don't lose hope: here's a good overview of strategies you can use to clean it up.)
Let's take a look at the DataFrame we ended up with:
df
No | year | month | day | hour | pm2.5 | DEWP | TEMP | PRES | cbwd | Iws | Is | Ir | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2010 | 1 | 1 | 0 | NaN | -21 | -11.0 | 1021.0 | NW | 1.79 | 0 | 0 |
1 | 2 | 2010 | 1 | 1 | 1 | NaN | -21 | -12.0 | 1020.0 | NW | 4.92 | 0 | 0 |
2 | 3 | 2010 | 1 | 1 | 2 | NaN | -21 | -11.0 | 1019.0 | NW | 6.71 | 0 | 0 |
3 | 4 | 2010 | 1 | 1 | 3 | NaN | -21 | -14.0 | 1019.0 | NW | 9.84 | 0 | 0 |
4 | 5 | 2010 | 1 | 1 | 4 | NaN | -20 | -12.0 | 1018.0 | NW | 12.97 | 0 | 0 |
5 | 6 | 2010 | 1 | 1 | 5 | NaN | -19 | -10.0 | 1017.0 | NW | 16.10 | 0 | 0 |
6 | 7 | 2010 | 1 | 1 | 6 | NaN | -19 | -9.0 | 1017.0 | NW | 19.23 | 0 | 0 |
7 | 8 | 2010 | 1 | 1 | 7 | NaN | -19 | -9.0 | 1017.0 | NW | 21.02 | 0 | 0 |
8 | 9 | 2010 | 1 | 1 | 8 | NaN | -19 | -9.0 | 1017.0 | NW | 24.15 | 0 | 0 |
9 | 10 | 2010 | 1 | 1 | 9 | NaN | -20 | -8.0 | 1017.0 | NW | 27.28 | 0 | 0 |
10 | 11 | 2010 | 1 | 1 | 10 | NaN | -19 | -7.0 | 1017.0 | NW | 31.30 | 0 | 0 |
11 | 12 | 2010 | 1 | 1 | 11 | NaN | -18 | -5.0 | 1017.0 | NW | 34.43 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
43812 | 43813 | 2014 | 12 | 31 | 12 | 17.0 | -22 | 0.0 | 1033.0 | NW | 177.44 | 0 | 0 |
43813 | 43814 | 2014 | 12 | 31 | 13 | 11.0 | -27 | 0.0 | 1032.0 | NW | 186.38 | 0 | 0 |
43814 | 43815 | 2014 | 12 | 31 | 14 | 9.0 | -27 | 1.0 | 1032.0 | NW | 196.21 | 0 | 0 |
43815 | 43816 | 2014 | 12 | 31 | 15 | 11.0 | -26 | 1.0 | 1032.0 | NW | 205.15 | 0 | 0 |
43816 | 43817 | 2014 | 12 | 31 | 16 | 8.0 | -23 | 0.0 | 1032.0 | NW | 214.09 | 0 | 0 |
43817 | 43818 | 2014 | 12 | 31 | 17 | 9.0 | -22 | -1.0 | 1033.0 | NW | 221.24 | 0 | 0 |
43818 | 43819 | 2014 | 12 | 31 | 18 | 10.0 | -22 | -2.0 | 1033.0 | NW | 226.16 | 0 | 0 |
43819 | 43820 | 2014 | 12 | 31 | 19 | 8.0 | -23 | -2.0 | 1034.0 | NW | 231.97 | 0 | 0 |
43820 | 43821 | 2014 | 12 | 31 | 20 | 10.0 | -22 | -3.0 | 1034.0 | NW | 237.78 | 0 | 0 |
43821 | 43822 | 2014 | 12 | 31 | 21 | 10.0 | -22 | -3.0 | 1034.0 | NW | 242.70 | 0 | 0 |
43822 | 43823 | 2014 | 12 | 31 | 22 | 8.0 | -22 | -4.0 | 1034.0 | NW | 246.72 | 0 | 0 |
43823 | 43824 | 2014 | 12 | 31 | 23 | 12.0 | -21 | -3.0 | 1034.0 | NW | 249.85 | 0 | 0 |
43824 rows × 13 columns
You can see that because there are so many rows in this DataFrame (43,824!), Pandas shows only a subset. But it's enough for us to get an idea of what the DataFrame looks like.
The .info()
method shows us the rows and their data types:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 43824 entries, 0 to 43823 Data columns (total 13 columns): No 43824 non-null int64 year 43824 non-null int64 month 43824 non-null int64 day 43824 non-null int64 hour 43824 non-null int64 pm2.5 41757 non-null float64 DEWP 43824 non-null int64 TEMP 43824 non-null float64 PRES 43824 non-null float64 cbwd 43824 non-null object Iws 43824 non-null float64 Is 43824 non-null int64 Ir 43824 non-null int64 dtypes: float64(4), int64(8), object(1) memory usage: 4.3+ MB
The int64
, float64
, etc. data types are specific to Pandas, and are not the same thing as their regular Python equivalent. (Actually, they're specific to Numpy, but that's a different story.)
Of course, Pandas can't tell us what the data in these columns mean. For that, we need to consult the documentation that accompanies the data. Copying and pasting from the web page linked to above, here are the meanings for each field:
Note that these aren't universal names for these fields. You can't expect to download a different data set from another set of researchers that records similar phenomena and expect that file to use (e.g.)
TEMP
as the column name for temperature.
As with Series in general, we can grab one of these columns and use .describe()
to get a general overview of what data it contains:
df["pm2.5"].describe()
count 41757.000000 mean 98.613215 std 92.050387 min 0.000000 25% 29.000000 50% 72.000000 75% 137.000000 max 994.000000 Name: pm2.5, dtype: float64
This tells us, e.g., that the "average" level of PM2.5 concentration in Beijing over the four-year period of the data was 98.6, with half of days being over 72 and half under. The highest PM2.5 recorded in the data was 994.
Looking at the plot for the pm2.5
column, you can kind of make out yearly cycles in PM2.5 concentration:
df.plot(y="pm2.5")
<matplotlib.axes._subplots.AxesSubplot at 0x128a3eb00>
We can do the same analysis with the other fields. For example, here's a plot of temperature readings for each hour:
df.plot(y="TEMP")
<matplotlib.axes._subplots.AxesSubplot at 0x12c8b3240>
Plotting these two together shows an interesting pattern, maybe:
df.plot(y=["pm2.5", "TEMP"])
<matplotlib.axes._subplots.AxesSubplot at 0x12f6d39e8>
It looks like when temperature dips, pm2.5 spikes! (There are various statistical ways to confirm this suspicion, but for now we're going to stick with drawing the graphs.)
A histogram is a kind of plot that helps you understand how data are distributed. Understanding distribution helps you better reason about how often particular values are found in your data, and helps you easily formulate hypotheses about the phenomena your data is tracking. Let's look at a histogram of temperature data in our Beijing data set, using the hist
plot kind:
df.plot(kind="hist", y="TEMP")
<matplotlib.axes._subplots.AxesSubplot at 0x12eddad68>
Each bar in this graph corresponds to a "bin" of values surrounding the value on the X axis. When drawing a histogram, Pandas looks at each item in the data and puts it in the bin corresponding to the closest value. So for example, the graph above tells us that there are a lot of temperature readings (~8000) around 20 degrees C, but very few (less than 300) readings around 40 degrees C. You can increase the "resolution" of the histogram by providing a bins
named parameter:
df.plot(kind="hist", y="TEMP", bins=20)
<matplotlib.axes._subplots.AxesSubplot at 0x13154f518>
From this graph, we might hypothesize that a way to characterize Beijing temperatures is that they mostly cluster in either the 20—30 degrees C range, or the -5 to +5 degrees C range. Temperatures above 40 degrees C or below -20 degrees C are rare. The histogram for temperatures looks very different from the histogram for PM2.5:
df.plot(kind="hist", y="pm2.5", bins=20)
<matplotlib.axes._subplots.AxesSubplot at 0x12f5f4d30>
This histogram shows that while there are a number of outliers, by far most of the PM2.5 readings are in the 0–200 range.
A scatter plot is an easy way to confirm your suspicion that two columns in your data set are somehow related. In a scatter plot, you select two columns, and every row in the data set becomes a point in a two-dimensional space, based on the value of those two columns in the row. You need to specify both columns using the x
and y
named parameters. So, for example, here's a scatter plot with temperature and dew point:
df.plot(kind="scatter", x="DEWP", y="TEMP")
<matplotlib.axes._subplots.AxesSubplot at 0x131b36eb8>
Each dot in this scatterplot represents a row from the DataFrame. (Sometimes these dots are so dense that they appear to form solid masses or lines.) This scatter plot shows that as the temperature rises, so does the dew point (as you might expect from the definition of dew point). One way to talk about this relationship is to say that the values in these two columns are correlated.
However, drawing a scatter plot of PM2.5 concentration with the cumulative wind speed shows an inverse relationship:
df.plot(kind="scatter", x="pm2.5", y="Iws")
<matplotlib.axes._subplots.AxesSubplot at 0x1344be0f0>
You might interpret this graph as an indication that in general, as the wind speed goes up, the PM2.5 concentration falls. (This is intuitively true, and the authors of the paper go into a bit more detail about this effect in particular.) A scatter plot of PM2.5 and dew point also shows a correlation:
df.plot(kind="scatter", x="DEWP", y="pm2.5")
<matplotlib.axes._subplots.AxesSubplot at 0x134f78e10>
Again, this is a Pandas tutorial, not a statistics tutorial, so take these characterizations with a grain of salt. My goal here is to show you how histograms and scatter plots are good starting points for getting a "feel" for your data and how the variables might be related.
Let's say we wanted to find out how many readings in the data had a PM2.5 concentration of greater than 500. One easy way to do this is to use Boolean indexing, as discussed above. The following expression gives us a Boolean Series, with True values for every row with a PM2.5 greater than 400:
df["pm2.5"] > 400
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False ... 43812 False 43813 False 43814 False 43815 False 43816 False 43817 False 43818 False 43819 False 43820 False 43821 False 43822 False 43823 False Name: pm2.5, Length: 43824, dtype: bool
And then we can use that to subscript the DataFrame, giving us a new DataFrame with only the rows where the condition obtains:
df[df["pm2.5"] > 400]
No | year | month | day | hour | pm2.5 | DEWP | TEMP | PRES | cbwd | Iws | Is | Ir | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
412 | 413 | 2010 | 1 | 18 | 4 | 407.0 | -13 | -11.0 | 1028.0 | NW | 7.60 | 0 | 0 |
427 | 428 | 2010 | 1 | 18 | 19 | 408.0 | -7 | -3.0 | 1025.0 | SE | 0.89 | 0 | 0 |
428 | 429 | 2010 | 1 | 18 | 20 | 435.0 | -5 | -2.0 | 1026.0 | cv | 0.89 | 0 | 0 |
429 | 430 | 2010 | 1 | 18 | 21 | 403.0 | -5 | -1.0 | 1026.0 | NW | 1.79 | 0 | 0 |
431 | 432 | 2010 | 1 | 18 | 23 | 402.0 | -6 | -1.0 | 1026.0 | cv | 0.89 | 0 | 0 |
450 | 451 | 2010 | 1 | 19 | 18 | 485.0 | -3 | 1.0 | 1018.0 | cv | 0.89 | 0 | 0 |
451 | 452 | 2010 | 1 | 19 | 19 | 426.0 | -3 | 2.0 | 1020.0 | NW | 1.79 | 0 | 0 |
452 | 453 | 2010 | 1 | 19 | 20 | 403.0 | -3 | 1.0 | 1020.0 | NW | 4.92 | 0 | 0 |
1057 | 1058 | 2010 | 2 | 14 | 1 | 980.0 | -14 | -7.0 | 1029.0 | cv | 0.89 | 0 | 0 |
1059 | 1060 | 2010 | 2 | 14 | 3 | 599.0 | -14 | -6.0 | 1030.0 | NW | 6.26 | 0 | 0 |
1876 | 1877 | 2010 | 3 | 20 | 4 | 700.0 | 2 | 6.0 | 1000.0 | NW | 4.92 | 0 | 0 |
1879 | 1880 | 2010 | 3 | 20 | 7 | 473.0 | -2 | 7.0 | 1002.0 | NW | 23.25 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
42210 | 42211 | 2014 | 10 | 25 | 18 | 440.0 | 13 | 15.0 | 1011.0 | cv | 0.89 | 0 | 0 |
42211 | 42212 | 2014 | 10 | 25 | 19 | 449.0 | 12 | 13.0 | 1011.0 | SE | 0.89 | 0 | 0 |
42212 | 42213 | 2014 | 10 | 25 | 20 | 472.0 | 14 | 15.0 | 1012.0 | SE | 2.68 | 0 | 0 |
42213 | 42214 | 2014 | 10 | 25 | 21 | 453.0 | 12 | 13.0 | 1012.0 | cv | 0.89 | 0 | 0 |
42214 | 42215 | 2014 | 10 | 25 | 22 | 413.0 | 12 | 12.0 | 1013.0 | cv | 1.78 | 0 | 0 |
42814 | 42815 | 2014 | 11 | 19 | 22 | 409.0 | -1 | 1.0 | 1020.0 | cv | 1.34 | 0 | 0 |
42981 | 42982 | 2014 | 11 | 26 | 21 | 436.0 | 1 | 3.0 | 1020.0 | cv | 0.45 | 0 | 0 |
42982 | 42983 | 2014 | 11 | 26 | 22 | 502.0 | 1 | 2.0 | 1020.0 | cv | 1.34 | 0 | 0 |
42983 | 42984 | 2014 | 11 | 26 | 23 | 522.0 | 1 | 2.0 | 1020.0 | cv | 2.23 | 0 | 0 |
42984 | 42985 | 2014 | 11 | 27 | 0 | 470.0 | 1 | 2.0 | 1021.0 | cv | 3.12 | 0 | 0 |
42985 | 42986 | 2014 | 11 | 27 | 1 | 439.0 | 0 | 2.0 | 1022.0 | NW | 3.13 | 0 | 0 |
43731 | 43732 | 2014 | 12 | 28 | 3 | 444.0 | -9 | -3.0 | 1023.0 | NW | 6.26 | 0 | 0 |
545 rows × 13 columns
Pandas tells us that there are 545 such rows. With this limited DataFrame, we can still draw plots! So, for example, if we wanted to see a temperature histogram just for these days:
df[df["pm2.5"] > 400].plot(kind="hist", y="TEMP", bins=20)
<matplotlib.axes._subplots.AxesSubplot at 0x13696ae10>
Comparing this distribution to the rows where PM2.5 is less than 400:
df[df["pm2.5"] < 400].plot(kind="hist", y="TEMP", bins=20)
<matplotlib.axes._subplots.AxesSubplot at 0x136ae00b8>
You can see that the two distributions are quite different, with the temperatures on days with high PM2.5 concentrations being lower on average.
The cbwd
indicates the prevailing wind direction, which the researchers have narrowed down to four distinct values: NE (northeast), NW (northwest), SE (southeast) and "cv" ("calm or variable"). They outline the reasons for recording the data this way in their paper. The values in this column, unlike the values in the other columns, consist of a discrete set, rather than continuous numbers. As such, Pandas will be confused if we ask for a plot:
df["cbwd"].plot()
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-570-995cbf56cb57> in <module>() ----> 1 df["cbwd"].plot() /Users/allison/anaconda/lib/python3.6/site-packages/pandas/plotting/_core.py in __call__(self, kind, ax, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, label, secondary_y, **kwds) 2444 colormap=colormap, table=table, yerr=yerr, 2445 xerr=xerr, label=label, secondary_y=secondary_y, -> 2446 **kwds) 2447 __call__.__doc__ = plot_series.__doc__ 2448 /Users/allison/anaconda/lib/python3.6/site-packages/pandas/plotting/_core.py in plot_series(data, kind, ax, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, label, secondary_y, **kwds) 1886 yerr=yerr, xerr=xerr, 1887 label=label, secondary_y=secondary_y, -> 1888 **kwds) 1889 1890 /Users/allison/anaconda/lib/python3.6/site-packages/pandas/plotting/_core.py in _plot(data, x, y, subplots, ax, kind, **kwds) 1680 plot_obj = klass(data, subplots=subplots, ax=ax, kind=kind, **kwds) 1681 -> 1682 plot_obj.generate() 1683 plot_obj.draw() 1684 return plot_obj.result /Users/allison/anaconda/lib/python3.6/site-packages/pandas/plotting/_core.py in generate(self) 234 def generate(self): 235 self._args_adjust() --> 236 self._compute_plot_data() 237 self._setup_subplots() 238 self._make_plot() /Users/allison/anaconda/lib/python3.6/site-packages/pandas/plotting/_core.py in _compute_plot_data(self) 343 if is_empty: 344 raise TypeError('Empty {0!r}: no numeric data to ' --> 345 'plot'.format(numeric_data.__class__.__name__)) 346 347 self.data = numeric_data TypeError: Empty 'DataFrame': no numeric data to plot
The "no numeric data to plot" error is Pandas saying, "hey you wanted me to draw a graph, but there are no numbers in this field, what gives." Probably the best way to visualize discrete values is by counting them and then drawing a bar graph. As discussed earlier, the .value_counts()
method returns a Series that counts how many times each value occurs in a column:
df["cbwd"].value_counts()
SE 15290 NW 14150 cv 9387 NE 4997 Name: cbwd, dtype: int64
Plotting this data as a bar chart shows us how many times each of these discrete values were recorded:
df["cbwd"].value_counts().plot(kind="barh")
<matplotlib.axes._subplots.AxesSubplot at 0x1366ba400>
sorted_df = df.sort_values(by=["pm2.5"], ascending=False)
sorted_df
No | year | month | day | hour | pm2.5 | DEWP | TEMP | PRES | cbwd | Iws | Is | Ir | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
18049 | 18050 | 2012 | 1 | 23 | 1 | 994.0 | -24 | -12.0 | 1032.0 | NW | 4.92 | 0 | 0 |
1057 | 1058 | 2010 | 2 | 14 | 1 | 980.0 | -14 | -7.0 | 1029.0 | cv | 0.89 | 0 | 0 |
18050 | 18051 | 2012 | 1 | 23 | 2 | 972.0 | -24 | -12.0 | 1032.0 | NW | 8.05 | 0 | 0 |
26588 | 26589 | 2013 | 1 | 12 | 20 | 886.0 | -8 | -7.0 | 1023.0 | cv | 1.34 | 0 | 0 |
26590 | 26591 | 2013 | 1 | 12 | 22 | 858.0 | -10 | -9.0 | 1024.0 | cv | 0.89 | 0 | 0 |
26589 | 26590 | 2013 | 1 | 12 | 21 | 852.0 | -9 | -8.0 | 1023.0 | NE | 0.89 | 0 | 0 |
26584 | 26585 | 2013 | 1 | 12 | 16 | 845.0 | -7 | -2.0 | 1021.0 | SE | 8.95 | 0 | 0 |
26587 | 26588 | 2013 | 1 | 12 | 19 | 824.0 | -8 | -7.0 | 1022.0 | cv | 0.89 | 0 | 0 |
26585 | 26586 | 2013 | 1 | 12 | 17 | 810.0 | -7 | -4.0 | 1021.0 | SE | 9.84 | 0 | 0 |
26591 | 26592 | 2013 | 1 | 12 | 23 | 805.0 | -10 | -9.0 | 1024.0 | NW | 1.79 | 0 | 0 |
26583 | 26584 | 2013 | 1 | 12 | 15 | 802.0 | -7 | -1.0 | 1021.0 | SE | 7.16 | 0 | 0 |
1930 | 1931 | 2010 | 3 | 22 | 10 | 784.0 | -8 | 11.0 | 1013.0 | NW | 11.18 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
43281 | 43282 | 2014 | 12 | 9 | 9 | NaN | -8 | -5.0 | 1037.0 | NE | 1.79 | 0 | 0 |
43282 | 43283 | 2014 | 12 | 9 | 10 | NaN | -8 | -4.0 | 1037.0 | cv | 0.89 | 0 | 0 |
43283 | 43284 | 2014 | 12 | 9 | 11 | NaN | -8 | -3.0 | 1036.0 | NE | 1.79 | 0 | 0 |
43544 | 43545 | 2014 | 12 | 20 | 8 | NaN | -18 | -4.0 | 1031.0 | NW | 225.30 | 0 | 0 |
43545 | 43546 | 2014 | 12 | 20 | 9 | NaN | -17 | -4.0 | 1031.0 | NW | 228.43 | 0 | 0 |
43546 | 43547 | 2014 | 12 | 20 | 10 | NaN | -18 | -2.0 | 1031.0 | NW | 233.35 | 0 | 0 |
43547 | 43548 | 2014 | 12 | 20 | 11 | NaN | -17 | -1.0 | 1031.0 | NW | 239.16 | 0 | 0 |
43548 | 43549 | 2014 | 12 | 20 | 12 | NaN | -18 | 0.0 | 1030.0 | NW | 244.97 | 0 | 0 |
43549 | 43550 | 2014 | 12 | 20 | 13 | NaN | -19 | 1.0 | 1029.0 | NW | 249.89 | 0 | 0 |
43550 | 43551 | 2014 | 12 | 20 | 14 | NaN | -20 | 1.0 | 1029.0 | NW | 257.04 | 0 | 0 |
43551 | 43552 | 2014 | 12 | 20 | 15 | NaN | -20 | 2.0 | 1028.0 | NW | 262.85 | 0 | 0 |
43552 | 43553 | 2014 | 12 | 20 | 16 | NaN | -21 | 1.0 | 1028.0 | NW | 270.00 | 0 | 0 |
43824 rows × 13 columns
monthly_mean_df = df.groupby("month").mean()
monthly_mean_df.plot(kind="bar", y=["pm2.5", "Iws"])
<matplotlib.axes._subplots.AxesSubplot at 0x1375a2b38>