This notebook is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
%matplotlib inline
statement. This is required if you want to embed images in a Jupyter notebook.%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandas
.pandas.read_csv
.data = pd.read_csv('../../data/gapminder/gapminder_gdp_oceania.csv')
data
country | gdpPercap_1952 | gdpPercap_1957 | gdpPercap_1962 | gdpPercap_1967 | gdpPercap_1972 | gdpPercap_1977 | gdpPercap_1982 | gdpPercap_1987 | gdpPercap_1992 | gdpPercap_1997 | gdpPercap_2002 | gdpPercap_2007 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Australia | 10039.59564 | 10949.64959 | 12217.22686 | 14526.12465 | 16788.62948 | 18334.19751 | 19477.00928 | 21888.88903 | 23424.76683 | 26997.93657 | 30687.75473 | 34435.36744 |
1 | New Zealand | 10556.57566 | 12247.39532 | 13175.67800 | 14463.91893 | 16046.03728 | 16233.71770 | 17632.41040 | 19007.19129 | 18363.32494 | 21050.41377 | 23189.80135 | 25185.00911 |
print(data)
country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 \ 0 Australia 10039.59564 10949.64959 12217.22686 1 New Zealand 10556.57566 12247.39532 13175.67800 gdpPercap_1967 gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 \ 0 14526.12465 16788.62948 18334.19751 19477.00928 1 14463.91893 16046.03728 16233.71770 17632.41040 gdpPercap_1987 gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 \ 0 21888.88903 23424.76683 26997.93657 30687.75473 1 19007.19129 18363.32494 21050.41377 23189.80135 gdpPercap_2007 0 34435.36744 1 25185.00911
A DataFrame is a collection of Series; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.
Pandas is built on top of the Numpy library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.
What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.
index_col
to specify that a column's values should be used as row headings.¶read_csv
as its index_col
parameter to do this.data = pd.read_csv('../../data/gapminder/gapminder_gdp_oceania.csv', index_col='country')
data
gdpPercap_1952 | gdpPercap_1957 | gdpPercap_1962 | gdpPercap_1967 | gdpPercap_1972 | gdpPercap_1977 | gdpPercap_1982 | gdpPercap_1987 | gdpPercap_1992 | gdpPercap_1997 | gdpPercap_2002 | gdpPercap_2007 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
country | ||||||||||||
Australia | 10039.59564 | 10949.64959 | 12217.22686 | 14526.12465 | 16788.62948 | 18334.19751 | 19477.00928 | 21888.88903 | 23424.76683 | 26997.93657 | 30687.75473 | 34435.36744 |
New Zealand | 10556.57566 | 12247.39532 | 13175.67800 | 14463.91893 | 16046.03728 | 16233.71770 | 17632.41040 | 19007.19129 | 18363.32494 | 21050.41377 | 23189.80135 | 25185.00911 |
DataFrame.info
to find out more about a dataframe.¶DataFrame
'Australia'
and 'New Zealand'
data.info()
<class 'pandas.core.frame.DataFrame'> Index: 2 entries, Australia to New Zealand Data columns (total 12 columns): gdpPercap_1952 2 non-null float64 gdpPercap_1957 2 non-null float64 gdpPercap_1962 2 non-null float64 gdpPercap_1967 2 non-null float64 gdpPercap_1972 2 non-null float64 gdpPercap_1977 2 non-null float64 gdpPercap_1982 2 non-null float64 gdpPercap_1987 2 non-null float64 gdpPercap_1992 2 non-null float64 gdpPercap_1997 2 non-null float64 gdpPercap_2002 2 non-null float64 gdpPercap_2007 2 non-null float64 dtypes: float64(12) memory usage: 208.0+ bytes
DataFrame.columns
variable stores information about the dataframe’s columns.¶math.pi
.()
to try to call it.data.columns
Index(['gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962', 'gdpPercap_1967', 'gdpPercap_1972', 'gdpPercap_1977', 'gdpPercap_1982', 'gdpPercap_1987', 'gdpPercap_1992', 'gdpPercap_1997', 'gdpPercap_2002', 'gdpPercap_2007'], dtype='object')
DataFrame.T
to transpose a dataframe.¶.T
) doesn’t copy the data, just changes the program’s view of it.columns
, it is a member variable.data.T
country | Australia | New Zealand |
---|---|---|
gdpPercap_1952 | 10039.59564 | 10556.57566 |
gdpPercap_1957 | 10949.64959 | 12247.39532 |
gdpPercap_1962 | 12217.22686 | 13175.67800 |
gdpPercap_1967 | 14526.12465 | 14463.91893 |
gdpPercap_1972 | 16788.62948 | 16046.03728 |
gdpPercap_1977 | 18334.19751 | 16233.71770 |
gdpPercap_1982 | 19477.00928 | 17632.41040 |
gdpPercap_1987 | 21888.88903 | 19007.19129 |
gdpPercap_1992 | 23424.76683 | 18363.32494 |
gdpPercap_1997 | 26997.93657 | 21050.41377 |
gdpPercap_2002 | 30687.75473 | 23189.80135 |
gdpPercap_2007 | 34435.36744 | 25185.00911 |
DataFrame.describe
to get summary statistics about data.¶DataFrame.describe() gets the summary statistics of only the columns that have numerical data. All other columns are ignored, unless you use the argument include='all'
.
data.describe()
gdpPercap_1952 | gdpPercap_1957 | gdpPercap_1962 | gdpPercap_1967 | gdpPercap_1972 | gdpPercap_1977 | gdpPercap_1982 | gdpPercap_1987 | gdpPercap_1992 | gdpPercap_1997 | gdpPercap_2002 | gdpPercap_2007 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.00000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 | 2.000000 |
mean | 10298.085650 | 11598.522455 | 12696.452430 | 14495.021790 | 16417.33338 | 17283.957605 | 18554.709840 | 20448.040160 | 20894.045885 | 24024.175170 | 26938.778040 | 29810.188275 |
std | 365.560078 | 917.644806 | 677.727301 | 43.986086 | 525.09198 | 1485.263517 | 1304.328377 | 2037.668013 | 3578.979883 | 4205.533703 | 5301.853680 | 6540.991104 |
min | 10039.595640 | 10949.649590 | 12217.226860 | 14463.918930 | 16046.03728 | 16233.717700 | 17632.410400 | 19007.191290 | 18363.324940 | 21050.413770 | 23189.801350 | 25185.009110 |
25% | 10168.840645 | 11274.086022 | 12456.839645 | 14479.470360 | 16231.68533 | 16758.837652 | 18093.560120 | 19727.615725 | 19628.685413 | 22537.294470 | 25064.289695 | 27497.598692 |
50% | 10298.085650 | 11598.522455 | 12696.452430 | 14495.021790 | 16417.33338 | 17283.957605 | 18554.709840 | 20448.040160 | 20894.045885 | 24024.175170 | 26938.778040 | 29810.188275 |
75% | 10427.330655 | 11922.958888 | 12936.065215 | 14510.573220 | 16602.98143 | 17809.077557 | 19015.859560 | 21168.464595 | 22159.406358 | 25511.055870 | 28813.266385 | 32122.777857 |
max | 10556.575660 | 12247.395320 | 13175.678000 | 14526.124650 | 16788.62948 | 18334.197510 | 19477.009280 | 21888.889030 | 23424.766830 | 26997.936570 | 30687.754730 | 34435.367440 |
There are several ways to structure your data when manually creating a Pandas DataFrame
. Below shows how to create the DataFrame
using a Python dictionary.
manual_data = {
"x": [1, 2, 3],
"y": [2, 4, 6]
}
manual_df = pd.DataFrame(manual_data)
The dictionary keys become the column names and the lists become the rows.
manual_df
x | y | |
---|---|---|
0 | 1 | 2 |
1 | 2 | 4 |
2 | 3 | 6 |
Use the .to_csv()
method to save the data to disk. If you don't want the row indices to be saved in the file, pass the argument index=False
as shown below.
manual_df.to_csv("manual_data.csv", index=False)
To access a value at the position [i, j]
of a DataFrame, we have two options, depending on what is the meaning of i
in use. Remember that a DataFrame provides a index as a way to identify the rows of the table; a row, then, has a position inside the table as well as a label, which uniquely identifies its entry in the DataFrame.
First, let's load the Europe subset of the Gapminder dataset sot that we have more rows and columns to work with.
data = pd.read_csv(
"../../data/gapminder/gapminder_gdp_europe.csv",
index_col="country"
)
DataFrame.iloc[..., ...]
to select values by their (entry) position¶Can specify location by numerical index analogously to 2D version of character selection in strings.
data.iloc[0, 0]
1601.056136
DataFrame.loc[..., ...]
to select values by their (entry) label.¶Can specify location by row name analogously to 2D version of dictionary keys.
data.loc["Albania", "gdpPercap_1952"]
1601.056136
:
on its own to mean all columns or all rows.¶# Would get the same result printing data.loc["Albania"] (without a second index).
data.loc["Albania", :]
gdpPercap_1952 1601.056136 gdpPercap_1957 1942.284244 gdpPercap_1962 2312.888958 gdpPercap_1967 2760.196931 gdpPercap_1972 3313.422188 gdpPercap_1977 3533.003910 gdpPercap_1982 3630.880722 gdpPercap_1987 3738.932735 gdpPercap_1992 2497.437901 gdpPercap_1997 3193.054604 gdpPercap_2002 4604.211737 gdpPercap_2007 5937.029526 Name: Albania, dtype: float64
DataFrame.loc
and a named slice.¶You can slice several columns for a single row.
data.loc["Albania", "gdpPercap_1962":"gdpPercap_1972"]
gdpPercap_1962 2312.888958 gdpPercap_1967 2760.196931 gdpPercap_1972 3313.422188 Name: Albania, dtype: float64
Method chaining allows you to establish a visual pipeline of the transformations you apply to a dataset. Below, we use .loc
to subset our dataset, then apply .max()
to get the maximum values in the three remaining columns, and then take the average:
data \
.loc["Albania":"Poland", "gdpPercap_1962":"gdpPercap_1972"] \
.max() \
.mean()
16303.415163333333
True
and False
.data_subset = data \
.loc["Albania":"Poland", "gdpPercap_1962":"gdpPercap_1972"]
A frame full of Booleans is sometimes called a mask because of how it can be used.
mask = data_subset > 10000
data_subset[mask]
gdpPercap_1962 | gdpPercap_1967 | gdpPercap_1972 | |
---|---|---|---|
country | |||
Albania | NaN | NaN | NaN |
Austria | 10750.72111 | 12834.60240 | 16661.62560 |
Belgium | 10991.20676 | 13149.04119 | 16672.14356 |
Bosnia and Herzegovina | NaN | NaN | NaN |
Bulgaria | NaN | NaN | NaN |
Croatia | NaN | NaN | NaN |
Czech Republic | 10136.86713 | 11399.44489 | 13108.45360 |
Denmark | 13583.31351 | 15937.21123 | 18866.20721 |
Finland | NaN | 10921.63626 | 14358.87590 |
France | 10560.48553 | 12999.91766 | 16107.19171 |
Germany | 12902.46291 | 14745.62561 | 18016.18027 |
Greece | NaN | NaN | 12724.82957 |
Hungary | NaN | NaN | 10168.65611 |
Iceland | 10350.15906 | 13319.89568 | 15798.06362 |
Ireland | NaN | NaN | NaN |
Italy | NaN | 10022.40131 | 12269.27378 |
Montenegro | NaN | NaN | NaN |
Netherlands | 12790.84956 | 15363.25136 | 18794.74567 |
Norway | 13450.40151 | 16361.87647 | 18965.05551 |
Poland | NaN | NaN | NaN |
The .query()
method can also be used to find data using syntax that's reminiscent of R's filter rules and syntax in dplyr
.
data_subset.query("gdpPercap_1972 > 10000 & gdpPercap_1967 > 12000")
gdpPercap_1962 | gdpPercap_1967 | gdpPercap_1972 | |
---|---|---|---|
country | |||
Austria | 10750.72111 | 12834.60240 | 16661.62560 |
Belgium | 10991.20676 | 13149.04119 | 16672.14356 |
Denmark | 13583.31351 | 15937.21123 | 18866.20721 |
France | 10560.48553 | 12999.91766 | 16107.19171 |
Germany | 12902.46291 | 14745.62561 | 18016.18027 |
Iceland | 10350.15906 | 13319.89568 | 15798.06362 |
Netherlands | 12790.84956 | 15363.25136 | 18794.74567 |
Norway | 13450.40151 | 16361.87647 | 18965.05551 |
You can create new columns that are the result of data transformations applied to existing columns. This example illustrates how to compute the ratio between the 1962 and 1972 GDP per capita for each country:
data_subset["ratio_62_to_72"] = \
data_subset["gdpPercap_1962"] / data_subset["gdpPercap_1972"]
data_subset
gdpPercap_1962 | gdpPercap_1967 | gdpPercap_1972 | ratio_62_to_72 | |
---|---|---|---|---|
country | ||||
Albania | 2312.888958 | 2760.196931 | 3313.422188 | 0.698036 |
Austria | 10750.721110 | 12834.602400 | 16661.625600 | 0.645238 |
Belgium | 10991.206760 | 13149.041190 | 16672.143560 | 0.659256 |
Bosnia and Herzegovina | 1709.683679 | 2172.352423 | 2860.169750 | 0.597756 |
Bulgaria | 4254.337839 | 5577.002800 | 6597.494398 | 0.644841 |
Croatia | 5477.890018 | 6960.297861 | 9164.090127 | 0.597756 |
Czech Republic | 10136.867130 | 11399.444890 | 13108.453600 | 0.773308 |
Denmark | 13583.313510 | 15937.211230 | 18866.207210 | 0.719981 |
Finland | 9371.842561 | 10921.636260 | 14358.875900 | 0.652686 |
France | 10560.485530 | 12999.917660 | 16107.191710 | 0.655638 |
Germany | 12902.462910 | 14745.625610 | 18016.180270 | 0.716160 |
Greece | 6017.190733 | 8513.097016 | 12724.829570 | 0.472870 |
Hungary | 7550.359877 | 9326.644670 | 10168.656110 | 0.742513 |
Iceland | 10350.159060 | 13319.895680 | 15798.063620 | 0.655154 |
Ireland | 6631.597314 | 7655.568963 | 9530.772896 | 0.695809 |
Italy | 8243.582340 | 10022.401310 | 12269.273780 | 0.671888 |
Montenegro | 4649.593785 | 5907.850937 | 7778.414017 | 0.597756 |
Netherlands | 12790.849560 | 15363.251360 | 18794.745670 | 0.680555 |
Norway | 13450.401510 | 16361.876470 | 18965.055510 | 0.709220 |
Poland | 5338.752143 | 6557.152776 | 8006.506993 | 0.666802 |
matplotlib
is the most widely used scientific plotting library in Python.¶%matplotlib inline
import matplotlib.pyplot as plt
Simple plots are then (fairly) simple to create.
time = [1, 2, 3]
position = [2, 4, 16]
plt.plot(time, position)
plt.xlabel("Time (hr)")
plt.ylabel("Position (m)");
plt.plot(time, position, "o");
matplotlib.pyplot
.string
to integer
data type, since they represent numerical valuesdata = pd.read_csv('../../data/gapminder/gapminder_gdp_oceania.csv', index_col='country')
years = data.columns.str.strip('gdpPercap_')
data.columns = years.astype(int)
By default, DataFrame.plot
plots with the rows as the X axis.
We can transpose the data in order to plot multiple series.
data.T.plot();
kind=
input in the DataFrame
is how you change the plot aesthetic.plt.style.use()
lets you change the overall theme of your plots.data.T.plot(kind="bar");
plt.style.use("ggplot")
data.T.plot(kind="bar")
plt.ylabel("GDP per capita");
matplotlib
plot
function directly.¶plt.plot(x, y)
years = data.columns
gdp_australia = data.loc['Australia']
plt.plot(years, gdp_australia, 'g--');
For complete control, extract the data from Pandas and use matplotlib directly.
gdp_australia = data.loc['Australia']
gdp_nz = data.loc['New Zealand']
plt.plot(years, gdp_australia, 'b-', label='Australia')
plt.plot(years, gdp_nz, 'g-', label='New Zealand')
plt.legend(loc='upper left')
plt.xlabel('Year')
plt.ylabel('GDP per capita ($)');
data.T.plot.scatter(x = "Australia", y = "New Zealand");
If you are satisfied with the plot you see you may want to save it to a file, perhaps to include it in a publication. There is a function in the matplotlib.pyplot module that accomplishes this: savefig. Calling this function, e.g. with
plt.savefig('my_figure.png')
will save the current figure to the file my_figure.png
. The file format will automatically be deduced from the file name extension (other formats are pdf, ps, eps and svg).
Note that functions in plt
refer to a global figure variable and after a figure has been displayed to the screen (e.g. with plt.show
) matplotlib will make this variable refer to a new empty figure. Therefore, make sure you call plt.savefig
before the plot is displayed to the screen, otherwise you may find a file with an empty plot.
When using dataframes, data is often generated and plotted to screen in one line, and plt.savefig
seems not to be a possible approach. One possibility to save the figure to file is then to
plt.gcf
)savefig
class method from that varible.data.T.plot(kind='bar');
fig = plt.gcf() # get current figure
data.plot(kind='bar');
fig.savefig('gdp_comparison.png');
Content from the following episodes of the Software Carpentry lesson Plotting and Programming in Python made available under the CC BY 4.0 license