In this notebook, we use principal components analysis (PCA) to analyze the time series of fertility rates in 192 countries, using data obtained from the World Bank. The main goal is to understand how the trends in fertility over time differ from country to country. This is a slightly atypical illustration of PCA because the data are time series. Methods such as functional PCA have been developed for this setting, but since the fertility data are very smooth, there is no real disadvantage to using standard PCA in this case.

The raw data are available from

http://data.worldbank.org/indicator/SP.DYN.TFRT.IN

Download the archive in csv format, and unzip it. The file of interest is called `sp.dyn.tfrt.in_Indicator_en_csv_v2.csv`

.

In [1]:

```
import pandas as pd
import numpy as np
from statsmodels.sandbox.tools.tools_pca import pcasvd
```

First we read the data file.

In [2]:

```
url = "sp.dyn.tfrt.in_Indicator_en_csv_v2.csv"
data = pd.read_csv(url, skiprows=2)
print data.head()
```

In [3]:

```
for tok in ["IFC", "developing", "income", "European", "HIPC", "classified",
"Least", "Other", "small states", "cone", "world", "baltics"]:
ii = [i for i in data.index if tok.lower() in data.loc[i, "Country Name"].lower()]
print "\n".join(data.loc[ii, "Country Name"])
data = data.drop(ii)
```

In [4]:

```
years = range(1960, 2012)
years_str = ["%.0f" % y for y in years]
mat = data.loc[:, ["Country Name",] + years_str].dropna()
countries = mat["Country Name"].tolist()
mat = np.asarray(mat.loc[:, years_str])
```

In [5]:

```
plt.clf()
mn = mat.mean(0)
plt.plot(years, mn, '-', lw=4)
plt.grid(True)
plt.xlabel("Year", size=17)
plt.ylabel("Global mean fertility rate", size=17)
```

Out[5]:

Next we perform the PCA:

In [6]:

```
xreduced, scores, evals, evecs = pcasvd(mat)
```

In [7]:

```
plt.clf()
plt.plot(evals[0:10], lw=4)
plt.grid(True)
plt.xlabel("Component number", size=17)
plt.ylabel("Eigenvalue", size=17)
```

Out[7]:

In [8]:

```
plt.figure(figsize=(8,4))
plt.clf()
plt.axes([0.1, 0.1, 0.75, 0.9])
plt.plot(years, evecs[:, 0], lw=4, alpha=0.6, label="PC 1")
plt.plot(years, evecs[:, 1], lw=4, alpha=0.6, label="PC 2")
plt.plot(years, evecs[:, 2], lw=4, alpha=0.6, label="PC 3")
ha,la = plt.gca().get_legend_handles_labels()
leg = plt.figlegend(ha, la, loc="center right")
leg.draw_frame(False)
plt.xlabel("Year", size=17)
plt.xlim(1960, 2010)
```

Out[8]:

In [9]:

```
def make_plot(ix):
"""
Plot the raw fertility trajectories for the countries with index in `ix`.
"""
plt.figure(figsize=(9,5))
plt.clf()
plt.axes([0.1, 0.1, 0.7, 0.8])
ha, lb = [], []
for i in ix:
a, = plt.plot(years, mat[i,:], '-')
lb.append(countries[i][0:10])
ha.append(a)
a, = plt.plot(years, mn, '-', color='grey')
ha.append(a)
lb.append("Mean")
leg = plt.figlegend(ha, lb, "center right")
leg.draw_frame(False)
plt.xlabel("Year", size=17)
plt.xlim(1960, 2010)
plt.ylabel("Fertility", size=17)
```

In [10]:

```
ii = np.argsort(scores[:,0])
make_plot(ii[-5:])
```

In [11]:

```
make_plot(ii[0:5])
```

In [12]:

```
ii = np.argsort(scores[:,1])
make_plot(ii[-5:])
```

In [13]:

```
make_plot(ii[0:5])
```

In [16]:

```
print np.corrcoef(scores[:, 0], scores[:, 1])
plt.plot(scores[:, 0], scores[:, 1], 'o')
plt.xlabel("PC 1", size=17)
plt.ylabel("PC 2", size=17)
```

Out[16]:

In [18]:

```
ii = np.flatnonzero((scores[:, 0] < -15) & (scores[:, 1] > 5))
make_plot(ii)
```

In [19]:

```
ii = np.flatnonzero((np.abs(scores[:, 0]) < 5) & (scores[:, 1] < -5))
make_plot(ii)
```

In [20]:

```
ii = np.flatnonzero(scores[:, 0] > 15)
make_plot(ii)
```