Principal Component Analysis of fertility trends

In this notebook, we use principal components analysis (PCA) to analyze the time series of fertility rates in 192 countries, using data obtained from the World Bank. The main goal is to understand how the trends in fertility over time differ from country to country. This is a slightly atypical illustration of PCA because the data are time series. Methods such as functional PCA have been developed for this setting, but since the fertility data are very smooth, there is no real disadvantage to using standard PCA in this case.

The raw data are available from

http://data.worldbank.org/indicator/SP.DYN.TFRT.IN

Download the archive in csv format, and unzip it. The file of interest is called sp.dyn.tfrt.in_Indicator_en_csv_v2.csv.

In [1]:
import pandas as pd
import numpy as np
from statsmodels.sandbox.tools.tools_pca import pcasvd

First we read the data file.

In [2]:
url = "sp.dyn.tfrt.in_Indicator_en_csv_v2.csv"
data = pd.read_csv(url, skiprows=2)
print data.head()
  Country Name Country Code                            Indicator Name  \
0        Aruba          ABW  Fertility rate, total (births per woman)   
1      Andorra          AND  Fertility rate, total (births per woman)   
2  Afghanistan          AFG  Fertility rate, total (births per woman)   
3       Angola          AGO  Fertility rate, total (births per woman)   
4      Albania          ALB  Fertility rate, total (births per woman)   

   Indicator Code   1960   1961   1962   1963   1964   1965  ...     2005  \
0  SP.DYN.TFRT.IN  4.820  4.655  4.471  4.271  4.059  3.842  ...    1.769   
1  SP.DYN.TFRT.IN    NaN    NaN    NaN    NaN    NaN    NaN  ...      NaN   
2  SP.DYN.TFRT.IN  7.671  7.671  7.671  7.671  7.671  7.671  ...    6.930   
3  SP.DYN.TFRT.IN  7.316  7.354  7.385  7.410  7.425  7.430  ...    6.657   
4  SP.DYN.TFRT.IN  6.186  6.076  5.956  5.833  5.711  5.594  ...    1.919   

    2006   2007   2008   2009   2010   2011   2012  2013  Unnamed: 58  
0  1.754  1.739  1.726  1.713  1.701  1.690  1.681   NaN          NaN  
1  1.240  1.180  1.250  1.190  1.220    NaN    NaN   NaN          NaN  
2  6.702  6.456  6.196  5.928  5.659  5.395  5.141   NaN          NaN  
3  6.598  6.523  6.434  6.331  6.218  6.099  5.979   NaN          NaN  
4  1.849  1.796  1.761  1.744  1.741  1.748  1.760   NaN          NaN  

[5 rows x 59 columns]

Some entries in the data file are not countries, they are regions consisting of multiple countries. We will remove those entries before proceeding. We identified a set of substrings that identify the records that we want to remove. We print out the names we are removing to make sure we don't remove anything incorrectly.

In [3]:
for tok in ["IFC", "developing", "income", "European", "HIPC", "classified", 
    "Least", "Other", "small states", "cone", "world", "baltics"]:
    ii = [i for i in data.index if tok.lower() in data.loc[i, "Country Name"].lower()]
    print "\n".join(data.loc[ii, "Country Name"])
    data = data.drop(ii)
Sub-Saharan Africa (IFC classification)
East Asia and the Pacific (IFC classification)
Europe and Central Asia (IFC classification)
Latin America and the Caribbean (IFC classification)
Middle East and North Africa (IFC classification)
South Asia (IFC classification)
East Asia & Pacific (developing only)
Europe & Central Asia (developing only)
Latin America & Caribbean (developing only)
Middle East & North Africa (developing only)
Sub-Saharan Africa (developing only)
East Asia & Pacific (all income levels)
Europe & Central Asia (all income levels)
High income
Latin America & Caribbean (all income levels)
Low income
Lower middle income
Low & middle income
Middle East & North Africa (all income levels)
Middle income
High income: nonOECD
High income: OECD
Sub-Saharan Africa (all income levels)
Upper middle income
European Union
Heavily indebted poor countries (HIPC)
Not classified
Least developed countries: UN classification
Other small states
Caribbean small states
Pacific island small states
Small states
Southern Cone Extended
Arab World
World
Central Europe and the Baltics

Next we create a numeric data matrix containing the fertility rates, and a separate list of strings containing the country names.

In [4]:
years = range(1960, 2012)
years_str = ["%.0f" % y for y in years]
mat = data.loc[:, ["Country Name",] + years_str].dropna()
countries = mat["Country Name"].tolist()
mat = np.asarray(mat.loc[:, years_str])

There are two ways to use PCA to analyze a rectangular matrix: we can treat the rows as the "objects" and the columns as the "variables", or vice-versa. Here we will treat the fertility measures as "variables" used to measure the countries as "objects". Thus the goal will be to reduce the yearly fertility rate values to a small number of fertility rate "profiles" or "basis functions" that capture most of the variation over time in the different countries.

The mean trend is removed in PCA, but it is worthwhile taking a look at it. It shows that fertility has dropped steadily over the time period covered in this dataset. Note that the mean is calculated using a country as the unit of analysis, ignoring population size. This is also true for the PC analysis conducted below. A more sophisticated analysis might weight the countries, say by population in 1980.

In [5]:
plt.clf()
mn = mat.mean(0)
plt.plot(years, mn, '-', lw=4)
plt.grid(True)
plt.xlabel("Year", size=17)
plt.ylabel("Global mean fertility rate", size=17)
Out[5]:
<matplotlib.text.Text at 0x7f5ca1c73890>

Next we perform the PCA:

In [6]:
xreduced, scores, evals, evecs = pcasvd(mat)

Based on the eigenvalues, we see that the first PC dominates, with perhaps a small amount of meaningful variation captured in the second and third PC's.

In [7]:
plt.clf()
plt.plot(evals[0:10], lw=4)
plt.grid(True)
plt.xlabel("Component number", size=17)
plt.ylabel("Eigenvalue", size=17)
Out[7]:
<matplotlib.text.Text at 0x7f5ca19b7950>

Next we will plot the PC factors. The dominant factor is essentially flat, and negative. Countries with a positive score on the first factor will have lower overall fertility compared to the mean, countries with a negative score on the first factor will have higher overall fertility compared to the mean. The second factor is increasing over time. Countries with a positive score on the second factor will increase faster (or decrease slower) compared to the mean shown above. Countries with a negative score on the second factor will decrease faster than the mean.

In [8]:
plt.figure(figsize=(8,4))
plt.clf()
plt.axes([0.1, 0.1, 0.75, 0.9])
plt.plot(years, evecs[:, 0], lw=4, alpha=0.6, label="PC 1")
plt.plot(years, evecs[:, 1], lw=4, alpha=0.6, label="PC 2")
plt.plot(years, evecs[:, 2], lw=4, alpha=0.6, label="PC 3")
ha,la = plt.gca().get_legend_handles_labels()
leg = plt.figlegend(ha, la, loc="center right")
leg.draw_frame(False)
plt.xlabel("Year", size=17)
plt.xlim(1960, 2010)
Out[8]:
(1960, 2010)

To better understand what is going on, we will plot the fertility trajectories for sets of countries with similar PC scores. The following convenience function produces such a plot for a specified subset of countries. The mean fertility curve is also plotted.

In [9]:
def make_plot(ix):
    """
    Plot the raw fertility trajectories for the countries with index in `ix`.
    """
    plt.figure(figsize=(9,5))
    plt.clf()
    plt.axes([0.1, 0.1, 0.7, 0.8])
    ha, lb = [], []
    for i in ix:
        a, = plt.plot(years, mat[i,:], '-')
        lb.append(countries[i][0:10])
        ha.append(a)
    a, = plt.plot(years, mn, '-', color='grey')
    ha.append(a)
    lb.append("Mean")
    leg = plt.figlegend(ha, lb, "center right")
    leg.draw_frame(False)
    plt.xlabel("Year", size=17)
    plt.xlim(1960, 2010)
    plt.ylabel("Fertility", size=17)    

First we plot the five countries with the greatest scores on PC 1. These countries have a lower overall rate of fertility than the global mean.

In [10]:
ii = np.argsort(scores[:,0])
make_plot(ii[-5:])

Next we plot the fertility trajectories for the five countries with the least (most negative) scores on PC 1. These countries have higher overall fertility than the gobal mean.

In [11]:
make_plot(ii[0:5])

Here are the five countries with the most positive scores on PC 2. These are the countries where the fertility rate increased faster than the global mean (which is decreasing).

In [12]:
ii = np.argsort(scores[:,1])
make_plot(ii[-5:])

Finally we have the five countries with the greatest scores on factor 2. These are countries that decreased faster than the global mean.

In [13]:
make_plot(ii[0:5])

We can also look at a scatterplot of the scores for the first two principal components. The scores must be uncorrelated, but there is a u-shaped pattern to the scores, which we explore further below.

In [16]:
print np.corrcoef(scores[:, 0], scores[:, 1])
plt.plot(scores[:, 0], scores[:, 1], 'o')
plt.xlabel("PC 1", size=17)
plt.ylabel("PC 2", size=17)
[[  1.00000000e+00  -2.37719025e-15]
 [ -2.37719025e-15   1.00000000e+00]]
Out[16]:
<matplotlib.text.Text at 0x7f5ca146ca10>

The points in the upper left corner of the scores scatterplot (above) have low scores for PC1 (high overall levels of fertility) and high scores for PC2 (high level of fertility increase). Note that there are no countries in the lower left corner of the plot, which would have a low overall fertility level and a high rate of increase.

In [18]:
ii = np.flatnonzero((scores[:, 0] < -15) & (scores[:, 1] > 5))
make_plot(ii)

The points at the bottom of the "U" in the scores scatterplot have average fertility levels overall, and a faster than average rate of decrease.

In [19]:
ii = np.flatnonzero((np.abs(scores[:, 0]) < 5) & (scores[:, 1] < -5))
make_plot(ii)

The large cluster of points at the right edge of the plot have lwer than average fertility and slightly slower than average decrease.

In [20]:
ii = np.flatnonzero(scores[:, 0] > 15)
make_plot(ii)