#!/usr/bin/env python
# coding: utf-8

# # Data Bootcamp:  Demography
# 
# **David Backus**
# 
# We love demography, specifically the dynamics of population growth and decline.  You can drill down seemingly without end, as this [terrific graphic](http://www.bloomberg.com/graphics/dataview/how-americans-die/) about causes of death suggests.    
# 
# We take a look here at the UN's [population data](http://esa.un.org/unpd/wpp/Download/Standard/Population/):  the age distribution of the population, life expectancy, fertility (the word we use for births), and mortality (deaths).  Explore the website, it's filled with interesting data.  There are other sources that cover longer time periods, and for some countries you can get detailed data on specific things (causes of death, for example).  
# 
# We use a number of countries as examples, but Japan and China are the most striking.  The code is written so that the country is easily changed.  
# 
# This IPython notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course [Data Bootcamp](http://databootcamp.nyuecon.com/).  

# ## Preliminaries
# 
# Import statements and a date check for future reference.   

# In[1]:


# import packages 
import pandas as pd                   # data management
import matplotlib.pyplot as plt       # graphics 
import matplotlib as mpl              # graphics parameters
import numpy as np                    # numerical calculations 

# IPython command, puts plots in notebook 
get_ipython().run_line_magic('matplotlib', 'inline')

# check Python version 
import datetime as dt 
import sys
print('Today is', dt.date.today())
print('What version of Python are we running? \n', sys.version, sep='') 


# In[ ]:


# ## Population by age 
# 
# We have both "estimates" of the past (1950-2015) and "projections" of the future (out to 2100).  Here we focus on the latter, specifically what the UN refers to as the medium variant:  their middle of the road projection.  It gives us a sense of how Japan's population might change over the next century. 
# 
# It takes a few seconds to read the data.  
# 
# What are the numbers?  Thousands of people in various 5-year age categories.  

# In[2]:


url1 = 'http://esa.un.org/unpd/wpp/DVD/Files/'
url2 = '1_Indicators%20(Standard)/EXCEL_FILES/1_Population/'
url3 = 'WPP2015_POP_F07_1_POPULATION_BY_AGE_BOTH_SEXES.XLS'
url = url1 + url2 + url3 

cols = [2, 5] + list(range(6,28))
#est = pd.read_excel(url, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])
prj = pd.read_excel(url, sheetname=1, skiprows=16, parse_cols=cols, na_values=['…'])

prj.head(3)[list(range(6))]


# In[3]:


# rename some variables 
pop = prj 
names = list(pop) 
pop = pop.rename(columns={names[0]: 'Country', 
                          names[1]: 'Year'}) 
# select country and years 
country = ['Japan']
years     = [2015, 2055, 2095]
pop = pop[pop['Country'].isin(country) & pop['Year'].isin(years)]
pop = pop.drop(['Country'], axis=1)

# set index = Year 
# divide by 1000 to convert numbers from thousands to millions
pop = pop.set_index('Year')/1000

pop.head()[list(range(8))]


# In[4]:


# transpose (T) so that index = age 
pop = pop.T
pop.head(3)


# In[5]:


ax = pop.plot(kind='bar',  
              color='blue', 
              alpha=0.5, subplots=True, sharey=True, figsize=(8,6))

for axnum in range(len(ax)):  
    ax[axnum].set_title('')
    ax[axnum].set_ylabel('Millions')
        
ax[0].set_title('Population by age', fontsize=14, loc='left')   


# In[ ]:


# **Exercise.** What do you see here?  What else would you like to know?  

# **Exercise.** Adapt the preceeding code to do the same thing for China.  Or some other country that sparks your interest.  

# In[ ]:


# ## Fertility:  aka birth rates
# 
# We might wonder, why is the population falling in Japan?  Other countries?  Well, one reason is that birth rates are falling.  Demographers call this fertility.  Here we look at the fertility using the same [UN source](http://esa.un.org/unpd/wpp/Download/Standard/Fertility/) as the previous example.  We look at two variables:  total fertility and fertility by age of mother.  In both cases we explore the numbers to date, but the same files contain projections of future fertility.  

# In[6]:


# fertility overall 
uft  = 'http://esa.un.org/unpd/wpp/DVD/Files/'
uft += '1_Indicators%20(Standard)/EXCEL_FILES/'
uft += '2_Fertility/WPP2015_FERT_F04_TOTAL_FERTILITY.XLS'

cols = [2] + list(range(5,18))
ftot = pd.read_excel(uft, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])

ftot.head(3)[list(range(6))]


# In[7]:


# rename some variables 
names = list(ftot)
f = ftot.rename(columns={names[0]: 'Country'}) 

# select countries 
countries = ['China', 'Japan', 'Germany', 'United States of America']
f = f[f['Country'].isin(countries)]

# shape
f = f.set_index('Country').T 
f = f.rename(columns={'United States of America': 'United States'})
f.tail(3)


# In[8]:


fig, ax = plt.subplots()
f.plot(ax=ax, kind='line', alpha=0.5, lw=3, figsize=(6.5, 4))
ax.set_title('Fertility (births per woman, lifetime)', fontsize=14, loc='left')
ax.legend(loc='best', fontsize=10, handlelength=2, labelspacing=0.15)
ax.set_ylim(ymin=0)
ax.hlines(2.1, -1, 13, linestyles='dashed')
ax.text(8.5, 2.4, 'Replacement = 2.1')


# **Exercise.** What do you see here?  What else would you like to know?  

# **Exercise.** Add Canada to the figure.  How does it compare to the others?  What other countries would you be interested in?

# In[ ]:


# ## Life expectancy 
# 
# One of the bottom line summary numbers for mortality is life expectancy:  if mortaility rates fall, people live longer, on average.  Here we look at life expectancy at birth.  There are also numbers for life expectancy given than you live to some specific age; for example, life expectancy given that you survive to age 60. 

# In[9]:


# life expectancy at birth, both sexes  
ule  = 'http://esa.un.org/unpd/wpp/DVD/Files/1_Indicators%20(Standard)/EXCEL_FILES/3_Mortality/'
ule += 'WPP2015_MORT_F07_1_LIFE_EXPECTANCY_0_BOTH_SEXES.XLS'

cols = [2] + list(range(5,34))
le  = pd.read_excel(ule, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])

le.head(3)[list(range(10))]


# In[10]:


# rename some variables 
oldname = list(le)[0]
l = le.rename(columns={oldname: 'Country'}) 
l.head(3)[list(range(8))]


# In[11]:


# select countries 
countries = ['China', 'Japan', 'Germany', 'United States of America']
l = l[l['Country'].isin(countries)]

# shape
l = l.set_index('Country').T 
l = l.rename(columns={'United States of America': 'United States'})
l.tail()


# In[12]:


fig, ax = plt.subplots()
l.plot(ax=ax, kind='line', alpha=0.5, lw=3, figsize=(6, 8), grid=True)
ax.set_title('Life expectancy at birth', fontsize=14, loc='left')
ax.set_ylabel('Life expectancy in years')
ax.legend(loc='best', fontsize=10, handlelength=2, labelspacing=0.15)
ax.set_ylim(ymin=0)


# **Exercise.** What other countries would you like to see?  Can you add them?  The code below generates a list.  

# In[13]:


countries = le.rename(columns={oldname: 'Country'})['Country']


# **Exercise.**  Why do you think the US is falling behind?  What would you look at to verify your conjecture?

# In[ ]:


# ## Mortality:  aka death rates 
# 
# Another thing that affects the age distribution of the population is the mortality rate:  if mortality rates fall people live longer, on average.  Here we look at how mortality rates have changed over the past 60+ years.  Roughly speaking, people live an extra five years every generation.  Which is a lot.  Some of you will live to be a hundred.  (Look at the 100+ agen category over time for Japan.)  
# 
# The experts look at mortality rates by age.  The UN has a [whole page](http://esa.un.org/unpd/wpp/Download/Standard/Mortality/) devoted to mortality numbers.  We take 5-year mortality rates from the Abridged Life Table.  
# 
# The numbers are percentages of people in a given age group who die over a 5-year period.  0.1 means that 90 percent of an age group is still alive in five years.  

# In[14]:


# mortality overall 
url  = 'http://esa.un.org/unpd/wpp/DVD/Files/'
url += '1_Indicators%20(Standard)/EXCEL_FILES/3_Mortality/'
url += 'WPP2015_MORT_F17_1_ABRIDGED_LIFE_TABLE_BOTH_SEXES.XLS'

cols = [2, 5, 6, 7, 9]
mort = pd.read_excel(url, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])
mort.tail(3)


# In[15]:


# change names 
names = list(mort)
m = mort.rename(columns={names[0]: 'Country', names[2]: 'Age', names[3]: 'Interval', names[4]: 'Mortality'})
m.head(3)


# **Comment.** At this point, we need to pivot the data.  That's not something we've done before, so take it as simply something we can do easily if we have to.  We're going to do this twice to produce different graphs:    
# 
# * Compare countries for the same period.  
# * Compare different periods for the same country.  

# In[16]:


# compare countries for most recent period
countries = ['China', 'Japan', 'Germany', 'United States of America']
mt = m[m['Country'].isin(countries) & m['Interval'].isin([5]) & m['Period'].isin(['2010-2015'])] 
print('Dimensions:', mt.shape) 

mp = mt.pivot(index='Age', columns='Country', values='Mortality')  
mp.head(3)


# In[17]:


fig, ax = plt.subplots()
mp.plot(ax=ax, kind='line', alpha=0.5, linewidth=3, 
#        logy=True, 
        figsize=(6, 4))
ax.set_title('Mortality by age', fontsize=14, loc='left')
ax.set_ylabel('Mortality Rate (log scale)')
ax.legend(loc='best', fontsize=10, handlelength=2, labelspacing=0.15)


# **Exercises.**
# 
# * What country's old people have the lowest mortality?
# * What do you see here for the US?  Why is our life expectancy shorter?
# * What other countries would you like to see?  Can you adapt the code to show them?  
# * Anything else cross your mind?  

# In[ ]:


# In[18]:


# compare periods for the one country -- countries[0] is China 
mt = m[m['Country'].isin([countries[0]]) & m['Interval'].isin([5])] 
print('Dimensions:', mt.shape) 

mp = mt.pivot(index='Age', columns='Period', values='Mortality')  
mp = mp[[0, 6, 12]]
mp.head(3)


# In[19]:


fig, ax = plt.subplots()
mp.plot(ax=ax, kind='line', alpha=0.5, linewidth=3, 
#        logy=True, 
        figsize=(6, 4))
ax.set_title('Mortality over time', fontsize=14, loc='left')
ax.set_ylabel('Mortality Rate (log scale)')
ax.legend(loc='best', fontsize=10, handlelength=2, labelspacing=0.15)


# **Exercise.**  What do you see?  What else would you like to know?  

# **Exercise.** Repeat this graph for the United States?  How does it compare?

# In[ ]:


# In[ ]: