This Jupyter notebook is for Part 3 of The Open University's Learn to code for Data Analysis course.
This notebook has all code examples and coding exercises. Remember to start by running the code in this notebook. You will need to add a code cell below each task to complete it.
You'll come across steps in the course directing you to this notebook. Once you've done each exercise, go back to the corresponding step and mark it as complete.
import warnings
warnings.simplefilter('ignore', FutureWarning)
from pandas import *
Dataframes can be constructed from scratch as follows.
headings = ['Country', 'GDP (US$)']
table = [
['UK', 2678454886796.7], # 1st row
['USA', 16768100000000.0], # 2nd row
['China', 9240270452047.0], # and so on...
['Brazil', 2245673032353.8],
['South Africa', 366057913367.1]
]
gdp = DataFrame(columns=headings, data=table)
gdp
Country | GDP (US$) | |
---|---|---|
0 | UK | 2.678455e+12 |
1 | USA | 1.676810e+13 |
2 | China | 9.240270e+12 |
3 | Brazil | 2.245673e+12 |
4 | South Africa | 3.660579e+11 |
And similarly for the life expectancy of those born in 2013...
headings = ['Country name', 'Life expectancy (years)']
table = [
['China', 75],
['Russia', 71],
['United States', 79],
['India', 66],
['United Kingdom', 81]
]
life = DataFrame(columns=headings, data=table)
life
Country name | Life expectancy (years) | |
---|---|---|
0 | China | 75 |
1 | Russia | 71 |
2 | United States | 79 |
3 | India | 66 |
4 | United Kingdom | 81 |
Create a dataframe with all five BRICS countries and their population, in thousands of inhabitants, in 2013. The values (given in the first exercise notebook) are: Brazil 200362, Russian Federation 142834, India 1252140, China 1393337, South Africa 52776.
Now go back to the course.
The following function, written in two different ways, rounds a number to the nearest million. It calls the Python function round()
which rounds a decimal number to the nearest integer. If two integers are equally near, it rounds to the even integer.
def roundToMillions (value):
result = round(value / 1000000)
return result
def roundToMillions (value):
return round(value / 1000000)
To test a function, write expressions that check for various argument values whether the function returns the expected value in each case.
roundToMillions(4567890.1) == 5
True
roundToMillions(0) == 0 # always test with zero...
True
roundToMillions(-1) == 0 # ...and negative numbers
True
roundToMillions(1499999) == 1 # test rounding to the nearest
True
The next function converts US dollars to British pounds.
def usdToGBP (usd):
return usd / 1.564768 # average rate during 2013
usdToGBP(0) == 0
True
usdToGBP(1.564768) == 1
True
usdToGBP(-1) < 0
True
roundToMillions()
to round the population to millions of inhabitants? Write a new function and test it. You need to write this function in preparation for Exercise 4.Now go back to the course.
The next function uses the full form of the conditional statement to expand the abbreviated country names UK and USA and leave other names unchanged.
def expandCountry (name):
if name == 'UK':
return 'United Kingdom'
elif name == 'USA':
return 'United States'
else:
return name
expandCountry('India') == 'India'
True
Here is the same function, written differently, using the simplest form of the conditional statement, without the elif
and else
parts.
def expandCountry (name):
if name == 'UK':
name = 'United Kingdom'
if name == 'USA':
name = 'United States'
return name
Now go back to the course.
A one-argument function can be applied to each cell in a column, in order to obtain a new column with the converted values.
gdp['Country name'] = gdp['Country'].apply(expandCountry)
gdp
Country | GDP (US$) | Country name | |
---|---|---|---|
0 | UK | 2.678455e+12 | United Kingdom |
1 | USA | 1.676810e+13 | United States |
2 | China | 9.240270e+12 | China |
3 | Brazil | 2.245673e+12 | Brazil |
4 | South Africa | 3.660579e+11 | South Africa |
Given that apply()
is a column method that returns a column, it can be chained, to apply several conversions in one go.
gdp['GDP (£m)'] = gdp['GDP (US$)'].apply(usdToGBP).apply(roundToMillions)
gdp
Country | GDP (US$) | Country name | GDP (£m) | |
---|---|---|---|---|
0 | UK | 2.678455e+12 | United Kingdom | 1711727 |
1 | USA | 1.676810e+13 | United States | 10716029 |
2 | China | 9.240270e+12 | China | 5905202 |
3 | Brazil | 2.245673e+12 | Brazil | 1435148 |
4 | South Africa | 3.660579e+11 | South Africa | 233937 |
Applying the conversion functions in a different order will lead to a different result.
gdp['GDP (US$)'].apply(roundToMillions).apply(usdToGBP).apply(round)
0 1711727 1 10716029 2 5905201 3 1435148 4 233938 Name: GDP (US$), dtype: int64
The original columns can be discarded.
headings = ['Country name', 'GDP (£m)']
gdp = gdp[headings]
gdp
Country name | GDP (£m) | |
---|---|---|
0 | United Kingdom | 1711727 |
1 | United States | 10716029 |
2 | China | 5905202 |
3 | Brazil | 1435148 |
4 | South Africa | 233937 |
Take the dataframe you created for Exercise 1, and apply to its population column the rounding function you wrote in Exercise 2.
Now go back to the course.
At this point, both tables have a common column, 'Country name', with fully expanded country names.
life
Country name | Life expectancy (years) | |
---|---|---|
0 | China | 75 |
1 | Russia | 71 |
2 | United States | 79 |
3 | India | 66 |
4 | United Kingdom | 81 |
gdp
Country name | GDP (£m) | |
---|---|---|
0 | United Kingdom | 1711727 |
1 | United States | 10716029 |
2 | China | 5905202 |
3 | Brazil | 1435148 |
4 | South Africa | 233937 |
A left join takes the rows of the left table and adds the columns of the right table.
merge(gdp, life, on='Country name', how='left')
Country name | GDP (£m) | Life expectancy (years) | |
---|---|---|---|
0 | United Kingdom | 1711727 | 81.0 |
1 | United States | 10716029 | 79.0 |
2 | China | 5905202 | 75.0 |
3 | Brazil | 1435148 | NaN |
4 | South Africa | 233937 | NaN |
A right join takes the rows from the right table, and adds the columns of the left table.
merge(gdp, life, on='Country name', how='right')
Country name | GDP (£m) | Life expectancy (years) | |
---|---|---|---|
0 | United Kingdom | 1711727.0 | 81 |
1 | United States | 10716029.0 | 79 |
2 | China | 5905202.0 | 75 |
3 | Russia | NaN | 71 |
4 | India | NaN | 66 |
An outer join takes the union of the rows, i.e. it has all the rows of the left and right joins.
merge(gdp, life, on='Country name', how='outer')
Country name | GDP (£m) | Life expectancy (years) | |
---|---|---|---|
0 | United Kingdom | 1711727.0 | 81.0 |
1 | United States | 10716029.0 | 79.0 |
2 | China | 5905202.0 | 75.0 |
3 | Brazil | 1435148.0 | NaN |
4 | South Africa | 233937.0 | NaN |
5 | Russia | NaN | 71.0 |
6 | India | NaN | 66.0 |
An inner join takes the intersection of the rows (i.e. the common rows) of the left and right joins.
gdpVsLife = merge(gdp, life, on='Country name', how='inner')
gdpVsLife
Country name | GDP (£m) | Life expectancy (years) | |
---|---|---|---|
0 | United Kingdom | 1711727 | 81 |
1 | United States | 10716029 | 79 |
2 | China | 5905202 | 75 |
Join your population dataframe (from Exercise 4) with gdpVsLife
, in four different ways, and note the differences.
Now go back to the course.
Constants are used to represent fixed values (e.g. strings and numbers) that occur frequently in a program. Constant names are conventionally written in uppercase, with underscores to separate multiple words.
GDP_USD = 'GDP (US$)'
GDP_GBP = 'GDP (£m)'
GDP_USD
'GDP (US$)'
Look through the code you wrote so far, and rewrite it using constants, when appropriate.
Now go back to the course.
It is possible to directly download data from the World Bank, for a particular time period and indicator, like the GDP in current US dollars. The indicator name is given in the URL of the webpage about the dataset.
Getting the data directly from the World Bank only works with Anaconda (or a paid CoCalc account) and requires an Internet connection. It can take some time to download the data, depending on the speed of your connection and the load on the World Bank server. Moreover, the World Bank occasionally changes the layout of the data, which could break the code in the rest of this notebook.
To avoid such problems I have saved the World Bank data into CSV files. The data is in a column with the same name as the indicator. Hence I declare the indicator names as constants, to be used later when processing the dataframe.
GDP_INDICATOR = 'NY.GDP.MKTP.CD'
gdpReset = read_csv('WB GDP 2013.csv')
LIFE_INDICATOR = 'SP.DYN.LE00.IN'
lifeReset = read_csv('WB LE 2013.csv')
The CSV files were obtained in two steps, which are shown next in commented code because we already have the CSV files.
First the data was obtained directly from the World Bank using a particular function in pandas, and indicating the desired indicator and time period. Note that you may have to install the pandas_datareader
module, using Anaconda Navigator.
# if pandas.__version__.startswith('0.23'):
# # this solves an incompatibility between pandas 0.23 and datareader 0.6
# # taken from https://stackoverflow.com/questions/50394873/
# core.common.is_list_like = api.types.is_list_like
# from pandas_datareader.wb import download
# YEAR = 2013
# gdpWB = download(indicator=GDP_INDICATOR, country='all', start=YEAR, end=YEAR)
# lifeWB = download(indicator=LIFE_INDICATOR, country='all', start=YEAR, end=YEAR)
# lifeWB.head()
The downloaded dataframe has descriptive row names instead of the usual 0, 1, 2, etc. In other words, the dataframe's index is given by the country and year instead of integers. Hence the second step was to reset the index.
# gdpReset = gdpWB.reset_index()
# lifeReset = lifeWB.reset_index()
Resetting the index put the dataframes into the usual form, which was saved to CSV files.
lifeReset.head()
country | year | SP.DYN.LE00.IN | |
---|---|---|---|
0 | Arab World | 2013 | 70.631305 |
1 | Caribbean small states | 2013 | 71.901964 |
2 | Central Europe and the Baltics | 2013 | 76.127583 |
3 | East Asia & Pacific (all income levels) | 2013 | 74.604619 |
4 | East Asia & Pacific (developing only) | 2013 | 73.657617 |
Now go back to the course.
The expression frame[m:n]
represents a dataframe with only row m
to row n-1
(or until the end if n
is omitted) of frame
.
lifeReset[0:3]
country | year | SP.DYN.LE00.IN | |
---|---|---|---|
0 | Arab World | 2013 | 70.631305 |
1 | Caribbean small states | 2013 | 71.901964 |
2 | Central Europe and the Baltics | 2013 | 76.127583 |
lifeReset[240:]
country | year | SP.DYN.LE00.IN | |
---|---|---|---|
240 | Vanuatu | 2013 | 71.669244 |
241 | Venezuela, RB | 2013 | 74.074415 |
242 | Vietnam | 2013 | 75.756488 |
243 | Virgin Islands (U.S.) | 2013 | 79.624390 |
244 | West Bank and Gaza | 2013 | 73.203341 |
245 | Yemen, Rep. | 2013 | 63.583512 |
246 | Zambia | 2013 | 59.237366 |
247 | Zimbabwe | 2013 | 55.633000 |
The first rows of World Bank dataframes are aggregated data for country groups, and are thus discarded. There were 34 country groups when I generated the CSV files, but the World Bank sometimes adds or removes groups. Therefore, if you obtained the data directly from the World Bank, you may need to discard more or fewer than 34 rows to get a dataframe that starts with Afghanistan.
gdpCountries = gdpReset[34:]
lifeCountries = lifeReset[34:]
gdpCountries.head()
country | year | NY.GDP.MKTP.CD | |
---|---|---|---|
34 | Afghanistan | 2013 | 2.045894e+10 |
35 | Albania | 2013 | 1.278103e+10 |
36 | Algeria | 2013 | 2.097035e+11 |
37 | American Samoa | 2013 | NaN |
38 | Andorra | 2013 | 3.249101e+09 |
Rows with missing data are dropped.
gdpData = gdpCountries.dropna()
lifeData = lifeCountries.dropna()
gdpData.head()
country | year | NY.GDP.MKTP.CD | |
---|---|---|---|
34 | Afghanistan | 2013 | 2.045894e+10 |
35 | Albania | 2013 | 1.278103e+10 |
36 | Algeria | 2013 | 2.097035e+11 |
38 | Andorra | 2013 | 3.249101e+09 |
39 | Angola | 2013 | 1.383568e+11 |
The year column is discarded.
COUNTRY = 'country'
headings = [COUNTRY, GDP_INDICATOR]
gdpClean = gdpData[headings]
headings = [COUNTRY, LIFE_INDICATOR]
lifeClean = lifeData[headings]
lifeClean.head()
country | SP.DYN.LE00.IN | |
---|---|---|
34 | Afghanistan | 60.028268 |
35 | Albania | 77.537244 |
36 | Algeria | 74.568951 |
39 | Angola | 51.866171 |
40 | Antigua and Barbuda | 75.778659 |
Clean the population dataframe you created in Exercise 7.
If in Exercise 7 you chose to directly get the population data from the World Bank instead of using the provided CSV file, you may need to remove more (or fewer) than 34 rows at the start of the dataframe due to changes done by the World Bank to its data reporting.
Now go back to the course.
The two dataframes can now be merged with an inner join.
gdpVsLifeAll = merge(gdpClean, lifeClean, on=COUNTRY, how='inner')
gdpVsLifeAll.head()
country | NY.GDP.MKTP.CD | SP.DYN.LE00.IN | |
---|---|---|---|
0 | Afghanistan | 2.045894e+10 | 60.028268 |
1 | Albania | 1.278103e+10 | 77.537244 |
2 | Algeria | 2.097035e+11 | 74.568951 |
3 | Angola | 1.383568e+11 | 51.866171 |
4 | Antigua and Barbuda | 1.200588e+09 | 75.778659 |
The dollars are converted to million pounds.
GDP = 'GDP (£m)'
column = gdpVsLifeAll[GDP_INDICATOR]
gdpVsLifeAll[GDP] = column.apply(usdToGBP).apply(roundToMillions)
gdpVsLifeAll.head()
country | NY.GDP.MKTP.CD | SP.DYN.LE00.IN | GDP (£m) | |
---|---|---|---|---|
0 | Afghanistan | 2.045894e+10 | 60.028268 | 13075 |
1 | Albania | 1.278103e+10 | 77.537244 | 8168 |
2 | Algeria | 2.097035e+11 | 74.568951 | 134016 |
3 | Angola | 1.383568e+11 | 51.866171 | 88420 |
4 | Antigua and Barbuda | 1.200588e+09 | 75.778659 | 767 |
The life expectancy is rounded, by applying the round()
function.
LIFE = 'Life expectancy (years)'
gdpVsLifeAll[LIFE] = gdpVsLifeAll[LIFE_INDICATOR].apply(round)
gdpVsLifeAll.head()
country | NY.GDP.MKTP.CD | SP.DYN.LE00.IN | GDP (£m) | Life expectancy (years) | |
---|---|---|---|---|---|
0 | Afghanistan | 2.045894e+10 | 60.028268 | 13075 | 60 |
1 | Albania | 1.278103e+10 | 77.537244 | 8168 | 78 |
2 | Algeria | 2.097035e+11 | 74.568951 | 134016 | 75 |
3 | Angola | 1.383568e+11 | 51.866171 | 88420 | 52 |
4 | Antigua and Barbuda | 1.200588e+09 | 75.778659 | 767 | 76 |
The original GDP and life expectancy columns are dropped.
headings = [COUNTRY, GDP, LIFE]
gdpVsLifeClean = gdpVsLifeAll[headings]
gdpVsLifeClean.head()
country | GDP (£m) | Life expectancy (years) | |
---|---|---|---|
0 | Afghanistan | 13075 | 60 |
1 | Albania | 8168 | 78 |
2 | Algeria | 134016 | 75 |
3 | Angola | 88420 | 52 |
4 | Antigua and Barbuda | 767 | 76 |
gdpVsLifeClean
with the population dataframe obtained in the previous exercise.Now go back to the course.
The Spearman rank correlation coefficient between GDP and life expectancy, and the corresponding p-value are calculated as follows.
from scipy.stats import spearmanr
gdpColumn = gdpVsLifeClean[GDP]
lifeColumn = gdpVsLifeClean[LIFE]
(correlation, pValue) = spearmanr(gdpColumn, lifeColumn)
print('The correlation is', correlation)
if pValue < 0.05:
print('It is statistically significant.')
else:
print('It is not statistically significant.')
The correlation is 0.501023238967 It is statistically significant.
Calculate the correlation between GDP and population.
Now go back to the course.
The dataframe method plot()
can also produce scatterplots. The logx
and logy
arguments set a logarithmic scale on the corresponding axis.
%matplotlib inline
gdpVsLifeClean.plot(x=GDP, y=LIFE, kind='scatter', grid=True, logx=True, figsize = (10, 4))
<matplotlib.axes._subplots.AxesSubplot at 0x1a11a13ba8>
Now go back to the course.