In this notebook, we use principal components analysis (PCA) to analyze the time series of fertility rates in 192 countries, using data obtained from the World Bank. The main goal is to understand how the trends in fertility over time differ from country to country. This is a slightly atypical illustration of PCA because the data are time series. Methods such as functional PCA have been developed for this setting, but since the fertility data are very smooth, there is no real disadvantage to using standard PCA in this case.
The raw data are available from
Download the archive in csv format, and unzip it. The file of interest is called
import pandas as pd import numpy as np from statsmodels.sandbox.tools.tools_pca import pcasvd
First we read the data file.
url = "sp.dyn.tfrt.in_Indicator_en_csv_v2.csv" data = pd.read_csv(url, skiprows=2) print data.head()
Country Name Country Code Indicator Name \ 0 Aruba ABW Fertility rate, total (births per woman) 1 Andorra AND Fertility rate, total (births per woman) 2 Afghanistan AFG Fertility rate, total (births per woman) 3 Angola AGO Fertility rate, total (births per woman) 4 Albania ALB Fertility rate, total (births per woman) Indicator Code 1960 1961 1962 1963 1964 1965 ... 2005 \ 0 SP.DYN.TFRT.IN 4.820 4.655 4.471 4.271 4.059 3.842 ... 1.769 1 SP.DYN.TFRT.IN NaN NaN NaN NaN NaN NaN ... NaN 2 SP.DYN.TFRT.IN 7.671 7.671 7.671 7.671 7.671 7.671 ... 6.930 3 SP.DYN.TFRT.IN 7.316 7.354 7.385 7.410 7.425 7.430 ... 6.657 4 SP.DYN.TFRT.IN 6.186 6.076 5.956 5.833 5.711 5.594 ... 1.919 2006 2007 2008 2009 2010 2011 2012 2013 Unnamed: 58 0 1.754 1.739 1.726 1.713 1.701 1.690 1.681 NaN NaN 1 1.240 1.180 1.250 1.190 1.220 NaN NaN NaN NaN 2 6.702 6.456 6.196 5.928 5.659 5.395 5.141 NaN NaN 3 6.598 6.523 6.434 6.331 6.218 6.099 5.979 NaN NaN 4 1.849 1.796 1.761 1.744 1.741 1.748 1.760 NaN NaN [5 rows x 59 columns]
Some entries in the data file are not countries, they are regions consisting of multiple countries. We will remove those entries before proceeding. We identified a set of substrings that identify the records that we want to remove. We print out the names we are removing to make sure we don't remove anything incorrectly.
for tok in ["IFC", "developing", "income", "European", "HIPC", "classified", "Least", "Other", "small states", "cone", "world", "baltics"]: ii = [i for i in data.index if tok.lower() in data.loc[i, "Country Name"].lower()] print "\n".join(data.loc[ii, "Country Name"]) data = data.drop(ii)
Sub-Saharan Africa (IFC classification) East Asia and the Pacific (IFC classification) Europe and Central Asia (IFC classification) Latin America and the Caribbean (IFC classification) Middle East and North Africa (IFC classification) South Asia (IFC classification) East Asia & Pacific (developing only) Europe & Central Asia (developing only) Latin America & Caribbean (developing only) Middle East & North Africa (developing only) Sub-Saharan Africa (developing only) East Asia & Pacific (all income levels) Europe & Central Asia (all income levels) High income Latin America & Caribbean (all income levels) Low income Lower middle income Low & middle income Middle East & North Africa (all income levels) Middle income High income: nonOECD High income: OECD Sub-Saharan Africa (all income levels) Upper middle income European Union Heavily indebted poor countries (HIPC) Not classified Least developed countries: UN classification Other small states Caribbean small states Pacific island small states Small states Southern Cone Extended Arab World World Central Europe and the Baltics
Next we create a numeric data matrix containing the fertility rates, and a separate list of strings containing the country names.
years = range(1960, 2012) years_str = ["%.0f" % y for y in years] mat = data.loc[:, ["Country Name",] + years_str].dropna() countries = mat["Country Name"].tolist() mat = np.asarray(mat.loc[:, years_str])
There are two ways to use PCA to analyze a rectangular matrix: we can treat the rows as the "objects" and the columns as the "variables", or vice-versa. Here we will treat the fertility measures as "variables" used to measure the countries as "objects". Thus the goal will be to reduce the yearly fertility rate values to a small number of fertility rate "profiles" or "basis functions" that capture most of the variation over time in the different countries.
The mean trend is removed in PCA, but it is worthwhile taking a look at it. It shows that fertility has dropped steadily over the time period covered in this dataset. Note that the mean is calculated using a country as the unit of analysis, ignoring population size. This is also true for the PC analysis conducted below. A more sophisticated analysis might weight the countries, say by population in 1980.
plt.clf() mn = mat.mean(0) plt.plot(years, mn, '-', lw=4) plt.grid(True) plt.xlabel("Year", size=17) plt.ylabel("Global mean fertility rate", size=17)
<matplotlib.text.Text at 0x7f5ca1c73890>
Next we perform the PCA:
xreduced, scores, evals, evecs = pcasvd(mat)
Based on the eigenvalues, we see that the first PC dominates, with perhaps a small amount of meaningful variation captured in the second and third PC's.
plt.clf() plt.plot(evals[0:10], lw=4) plt.grid(True) plt.xlabel("Component number", size=17) plt.ylabel("Eigenvalue", size=17)
<matplotlib.text.Text at 0x7f5ca19b7950>
Next we will plot the PC factors. The dominant factor is essentially flat, and negative. Countries with a positive score on the first factor will have lower overall fertility compared to the mean, countries with a negative score on the first factor will have higher overall fertility compared to the mean. The second factor is increasing over time. Countries with a positive score on the second factor will increase faster (or decrease slower) compared to the mean shown above. Countries with a negative score on the second factor will decrease faster than the mean.
plt.figure(figsize=(8,4)) plt.clf() plt.axes([0.1, 0.1, 0.75, 0.9]) plt.plot(years, evecs[:, 0], lw=4, alpha=0.6, label="PC 1") plt.plot(years, evecs[:, 1], lw=4, alpha=0.6, label="PC 2") plt.plot(years, evecs[:, 2], lw=4, alpha=0.6, label="PC 3") ha,la = plt.gca().get_legend_handles_labels() leg = plt.figlegend(ha, la, loc="center right") leg.draw_frame(False) plt.xlabel("Year", size=17) plt.xlim(1960, 2010)
To better understand what is going on, we will plot the fertility trajectories for sets of countries with similar PC scores. The following convenience function produces such a plot for a specified subset of countries. The mean fertility curve is also plotted.
def make_plot(ix): """ Plot the raw fertility trajectories for the countries with index in `ix`. """ plt.figure(figsize=(9,5)) plt.clf() plt.axes([0.1, 0.1, 0.7, 0.8]) ha, lb = ,  for i in ix: a, = plt.plot(years, mat[i,:], '-') lb.append(countries[i][0:10]) ha.append(a) a, = plt.plot(years, mn, '-', color='grey') ha.append(a) lb.append("Mean") leg = plt.figlegend(ha, lb, "center right") leg.draw_frame(False) plt.xlabel("Year", size=17) plt.xlim(1960, 2010) plt.ylabel("Fertility", size=17)
First we plot the five countries with the greatest scores on PC 1. These countries have a lower overall rate of fertility than the global mean.
ii = np.argsort(scores[:,0]) make_plot(ii[-5:])
Next we plot the fertility trajectories for the five countries with the least (most negative) scores on PC 1. These countries have higher overall fertility than the gobal mean.
Here are the five countries with the most positive scores on PC 2. These are the countries where the fertility rate increased faster than the global mean (which is decreasing).
ii = np.argsort(scores[:,1]) make_plot(ii[-5:])
Finally we have the five countries with the greatest scores on factor 2. These are countries that decreased faster than the global mean.
We can also look at a scatterplot of the scores for the first two principal components. The scores must be uncorrelated, but there is a u-shaped pattern to the scores, which we explore further below.
print np.corrcoef(scores[:, 0], scores[:, 1]) plt.plot(scores[:, 0], scores[:, 1], 'o') plt.xlabel("PC 1", size=17) plt.ylabel("PC 2", size=17)
[[ 1.00000000e+00 -2.37719025e-15] [ -2.37719025e-15 1.00000000e+00]]
<matplotlib.text.Text at 0x7f5ca146ca10>
The points in the upper left corner of the scores scatterplot (above) have low scores for PC1 (high overall levels of fertility) and high scores for PC2 (high level of fertility increase). Note that there are no countries in the lower left corner of the plot, which would have a low overall fertility level and a high rate of increase.
ii = np.flatnonzero((scores[:, 0] < -15) & (scores[:, 1] > 5)) make_plot(ii)
The points at the bottom of the "U" in the scores scatterplot have average fertility levels overall, and a faster than average rate of decrease.
ii = np.flatnonzero((np.abs(scores[:, 0]) < 5) & (scores[:, 1] < -5)) make_plot(ii)
The large cluster of points at the right edge of the plot have lwer than average fertility and slightly slower than average decrease.
ii = np.flatnonzero(scores[:, 0] > 15) make_plot(ii)