In this notebook, we use principal components analysis (PCA) to analyze the time series of fertility rates in 192 countries, using data obtained from the World Bank. The main goal is to understand how the trends in fertility over time differ from country to country. This is a slightly atypical illustration of PCA because the data are time series. Methods such as functional PCA have been developed for this setting, but since the fertility data are very smooth, there is no real disadvantage to using standard PCA in this case.

The raw data are available from

http://data.worldbank.org/indicator/SP.DYN.TFRT.IN

Download the archive in csv format, and unzip it. The file of interest is called `sp.dyn.tfrt.in_Indicator_en_csv_v2.csv`

.

In [1]:

```
import pandas as pd
import numpy as np
from statsmodels.sandbox.tools.tools_pca import pcasvd
```

First we read the data file.

In [2]:

```
url = "sp.dyn.tfrt.in_Indicator_en_csv_v2.csv"
data = pd.read_csv(url, skiprows=2)
print data.head()
```

Some entries in the data file are not countries, they are regions consisting of multiple countries. We will remove those entries before proceeding. We identified a set of substrings that identify the records that we want to remove. We print out the names we are removing to make sure we don't remove anything incorrectly.

In [3]:

```
for tok in ["IFC", "developing", "income", "European", "HIPC", "classified",
"Least", "Other", "small states", "cone", "world", "baltics"]:
ii = [i for i in data.index if tok.lower() in data.loc[i, "Country Name"].lower()]
print "\n".join(data.loc[ii, "Country Name"])
data = data.drop(ii)
```

Next we create a numeric data matrix containing the fertility rates, and a separate list of strings containing the country names.

In [4]:

```
years = range(1960, 2012)
years_str = ["%.0f" % y for y in years]
mat = data.loc[:, ["Country Name",] + years_str].dropna()
countries = mat["Country Name"].tolist()
mat = np.asarray(mat.loc[:, years_str])
```

There are two ways to use PCA to analyze a rectangular matrix: we can treat the rows as the "objects" and the columns as the "variables", or vice-versa. Here we will treat the fertility measures as "variables" used to measure the countries as "objects". Thus the goal will be to reduce the yearly fertility rate values to a small number of fertility rate "profiles" or "basis functions" that capture most of the variation over time in the different countries.

The mean trend is removed in PCA, but it is worthwhile taking a look at it. It shows that fertility has dropped steadily over the time period covered in this dataset. Note that the mean is calculated using a country as the unit of analysis, ignoring population size. This is also true for the PC analysis conducted below. A more sophisticated analysis might weight the countries, say by population in 1980.

In [5]:

```
plt.clf()
mn = mat.mean(0)
plt.plot(years, mn, '-', lw=4)
plt.grid(True)
plt.xlabel("Year", size=17)
plt.ylabel("Global mean fertility rate", size=17)
```

Out[5]:

Next we perform the PCA:

In [6]:

```
xreduced, scores, evals, evecs = pcasvd(mat)
```

Based on the eigenvalues, we see that the first PC dominates, with perhaps a small amount of meaningful variation captured in the second and third PC's.

In [7]:

```
plt.clf()
plt.plot(evals[0:10], lw=4)
plt.grid(True)
plt.xlabel("Component number", size=17)
plt.ylabel("Eigenvalue", size=17)
```

Out[7]:

Next we will plot the PC factors. The dominant factor is essentially flat, and negative. Countries with a positive score on the first factor will have lower overall fertility compared to the mean, countries with a negative score on the first factor will have higher overall fertility compared to the mean. The second factor is increasing over time. Countries with a positive score on the second factor will increase faster (or decrease slower) compared to the mean shown above. Countries with a negative score on the second factor will decrease faster than the mean.

In [8]:

```
plt.figure(figsize=(8,4))
plt.clf()
plt.axes([0.1, 0.1, 0.75, 0.9])
plt.plot(years, evecs[:, 0], lw=4, alpha=0.6, label="PC 1")
plt.plot(years, evecs[:, 1], lw=4, alpha=0.6, label="PC 2")
plt.plot(years, evecs[:, 2], lw=4, alpha=0.6, label="PC 3")
ha,la = plt.gca().get_legend_handles_labels()
leg = plt.figlegend(ha, la, loc="center right")
leg.draw_frame(False)
plt.xlabel("Year", size=17)
plt.xlim(1960, 2010)
```

Out[8]:

To better understand what is going on, we will plot the fertility trajectories for sets of countries with similar PC scores. The following convenience function produces such a plot for a specified subset of countries. The mean fertility curve is also plotted.

In [9]:

```
def make_plot(ix):
"""
Plot the raw fertility trajectories for the countries with index in `ix`.
"""
plt.figure(figsize=(9,5))
plt.clf()
plt.axes([0.1, 0.1, 0.7, 0.8])
ha, lb = [], []
for i in ix:
a, = plt.plot(years, mat[i,:], '-')
lb.append(countries[i][0:10])
ha.append(a)
a, = plt.plot(years, mn, '-', color='grey')
ha.append(a)
lb.append("Mean")
leg = plt.figlegend(ha, lb, "center right")
leg.draw_frame(False)
plt.xlabel("Year", size=17)
plt.xlim(1960, 2010)
plt.ylabel("Fertility", size=17)
```

First we plot the five countries with the greatest scores on PC 1. These countries have a lower overall rate of fertility than the global mean.

In [10]:

```
ii = np.argsort(scores[:,0])
make_plot(ii[-5:])
```

Next we plot the fertility trajectories for the five countries with the least (most negative) scores on PC 1. These countries have higher overall fertility than the gobal mean.

In [11]:

```
make_plot(ii[0:5])
```

Here are the five countries with the most positive scores on PC 2. These are the countries where the fertility rate increased faster than the global mean (which is decreasing).

In [12]:

```
ii = np.argsort(scores[:,1])
make_plot(ii[-5:])
```

Finally we have the five countries with the greatest scores on factor 2. These are countries that decreased faster than the global mean.

In [13]:

```
make_plot(ii[0:5])
```

We can also look at a scatterplot of the scores for the first two principal components. The scores must be uncorrelated, but there is a u-shaped pattern to the scores, which we explore further below.

In [16]:

```
print np.corrcoef(scores[:, 0], scores[:, 1])
plt.plot(scores[:, 0], scores[:, 1], 'o')
plt.xlabel("PC 1", size=17)
plt.ylabel("PC 2", size=17)
```

Out[16]:

The points in the upper left corner of the scores scatterplot (above) have low scores for PC1 (high overall levels of fertility) and high scores for PC2 (high level of fertility increase). Note that there are no countries in the lower left corner of the plot, which would have a low overall fertility level and a high rate of increase.

In [18]:

```
ii = np.flatnonzero((scores[:, 0] < -15) & (scores[:, 1] > 5))
make_plot(ii)
```

The points at the bottom of the "U" in the scores scatterplot have average fertility levels overall, and a faster than average rate of decrease.

In [19]:

```
ii = np.flatnonzero((np.abs(scores[:, 0]) < 5) & (scores[:, 1] < -5))
make_plot(ii)
```

The large cluster of points at the right edge of the plot have lwer than average fertility and slightly slower than average decrease.

In [20]:

```
ii = np.flatnonzero(scores[:, 0] > 15)
make_plot(ii)
```