In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import patsy
%pylab inline

Populating the interactive namespace from numpy and matplotlib


This is our second look at the well-being dataset from The New York Times. In Part 1 we found that the outlying counties in a regression of income against education were all college towns.

In this note, we'll examine the Principal Components of the data set. Principal Components are the eigenvectors of the variance-covariance matrix $X^TX$. This means they define a basis function, and most importantly, each PC accounts for the maximal possible variance in left in the data.

Put anoher way, let's say we want to produce a low-dimensional representation of our data. (That's basically what the NYT did in their original piece!) PCA provides the "best" representation at each dimensionality, meaning it defines the low-dimensional manifold that most closely matches the data. To make that clear, let's explore the data in more detail.

In [4]:
df = pd.read_table("data/unemployment.tsv")


## PCA¶

In [5]:
df_pca = df[['education', 'income', 'unemployment', 'disability', 'life', 'obesity']]
df_pca -= df_pca.mean()
df_pca /= df_pca.var() ** .5


There's a clear linear relationship here, which we can identify with statsmodels:

In [6]:
var = df_pca.values.T.dot(df_pca.values)

In [7]:
U, d, Vt = np.linalg.svd(df_pca)
V = Vt.T

In [8]:
values = df_pca.values.dot(V)
for i in range(6):
df['v' + str(i+1)] = values[:, i]

In [9]:
plot(d ** 2)

Out[9]:
[<matplotlib.lines.Line2D at 0xbb746ec>]

The first principal component captures most of the variance in the data set. It is also stored in the first row of the following dataframe.

In [10]:
pd.DataFrame(V.T, columns=df_pca.columns)

Out[10]:
education income unemployment disability life obesity
0 -0.407710 -0.425636 0.292399 0.402665 -0.454805 0.445109
1 0.466734 0.220235 0.737400 0.383632 -0.012688 -0.206306
2 -0.406288 -0.175295 0.594250 -0.561673 0.364932 -0.049153
3 0.106573 -0.758451 -0.104842 0.199034 0.152919 -0.582583
4 -0.298136 0.136319 -0.081146 0.554695 0.741906 0.166840
5 -0.592130 0.381800 -0.005846 0.167060 -0.293269 -0.624228

The first component has low education and income, high unemployment, and poor overall health. As we might expect, the counties that project into the negative side of this vector (i.e. better health, better education and more money) have higher ranks according to The New York Times.

In [88]:
plot(df['rank'], df['v1'], 'o')
xlabel("Rank according to the NYT")
ylabel("First Principal Component Value")
title("First principal component matches rank")

Out[88]:
<matplotlib.text.Text at 0xacf91e2c>

What about the other components? It's important to note that since principal components are eigenvectors, they are linearly uncorrelated with each other. That means that subsequent PCs capture variance that is different than the Plotting PC1 against PC2 shows that the outliers on PC1 universally occur on positive PC2 scores.

In [95]:
plot(df.v1, df.v2, 'o')
xlabel("First PC")
ylabel("Second PC")

Out[95]:
<matplotlib.text.Text at 0xacd1b4cc>

What does PC2 represent? Unemployment is the most important factor, with education being a second, less important factor. Keep in mind that unemployment is coded as percent of the population that is unemployed: we want that number to be lower. Therefore, a county that scores positive on PC2 probably has high unemployment, especially for their education level.

Sure enough, the most negative counties have very low unemployment: they tend to cluster in the middle of the country where farming or resource extraction dominate. The other extreme includes NYC, some hard-hit Californian areas, as well as well-to-do, highly educated counties where unemployment is still higher than one would expect given the population.

In [90]:
df.sort('v2')

Out[90]:
County id rank education income unemployment disability life obesity v1 v2 v3 v4 v5 v6
2647 McMullen, Texas 48311 1537 6.6 38438 2.6 0.0 77.4 39 -0.168381 -2.738321 0.371478 0.113751 -0.136297 0.271804
2633 La Salle, Texas 48283 2295 5.4 26731 4.3 1.1 77.4 47 1.706565 -2.568086 0.325654 -0.088072 0.571618 -1.014638
2683 Reagan, Texas 48383 1244 8.1 55968 2.4 0.4 77.8 42 -0.525734 -2.391547 -0.169892 -1.292065 0.471666 0.313923
2622 Kenedy, Texas 48261 1568 12.1 37206 3.3 0.0 77.1 40 -0.140570 -2.326024 0.223772 0.078278 -0.422819 -0.245008
2606 Hudspeth, Texas 48229 1749 9.6 22083 5.7 0.1 79.7 44 0.678369 -2.269859 1.424954 0.595248 0.508386 -1.481947
1679 Logan, Nebraska 31113 669 14.8 44444 2.1 0.0 78.5 36 -1.358469 -2.186498 0.012100 0.325461 -0.070672 0.194691
2049 Holmes, Ohio 39075 1741 8.3 44091 4.7 0.6 77.6 46 0.662116 -2.106735 0.314982 -1.126914 0.443911 -0.607082
2544 Crockett, Texas 48105 1608 9.6 49850 4.0 0.2 75.8 43 0.244940 -2.101557 -0.050098 -1.251244 -0.441892 0.107378
1680 Loup, Nebraska 31115 865 13.5 37708 3.2 0.0 79.6 36 -1.168023 -2.093503 0.595233 0.779039 0.239214 -0.086656
1982 Logan, North Dakota 38047 1011 12.9 47857 3.1 0.4 79.4 41 -0.819579 -2.061432 0.159372 -0.491122 0.702043 -0.348033
1572 Carter, Montana 30011 1074 13.7 38750 3.2 0.0 78.0 36 -0.879750 -2.053991 0.301426 0.601643 -0.302432 0.149769
1579 Fallon, Montana 30025 363 16.5 49732 1.9 0.0 79.0 36 -1.754850 -2.053413 -0.104699 0.048448 0.109399 0.183923
1641 Colfax, Nebraska 31037 1005 12.4 48561 3.4 0.3 79.3 41 -0.804895 -2.028275 0.270312 -0.578826 0.634607 -0.293427
1583 Garfield, Montana 30033 981 13.9 40125 3.4 0.0 79.0 37 -1.024655 -2.017318 0.471680 0.444332 0.080959 -0.098333
1621 Wibaux, Montana 30109 662 16.3 45625 2.8 0.0 79.0 38 -1.296662 -1.994643 0.137461 0.008184 0.118899 -0.224812
1658 Garfield, Nebraska 31071 753 13.3 39241 3.2 0.0 79.2 34 -1.335090 -1.978245 0.537067 0.917737 0.050290 0.317838
1984 McIntosh, North Dakota 38051 1037 18.6 36327 3.3 0.1 79.4 40 -0.857690 -1.974779 0.271421 0.393920 0.181220 -1.009332
1971 Dunn, North Dakota 38025 280 17.4 54539 1.5 0.3 79.8 37 -1.970254 -1.971905 -0.331912 -0.259209 0.606792 0.072817
1965 Burke, North Dakota 38013 572 15.5 52031 1.8 0.5 79.0 37 -1.519530 -1.966513 -0.376498 -0.153028 0.459428 0.257452
2555 Dimmit, Texas 48127 2343 9.2 32960 5.1 1.2 76.7 46 1.469598 -1.948605 0.072583 -0.370924 0.262570 -0.818500
2547 Dallam, Texas 48111 1451 10.7 42781 3.9 0.5 77.4 39 -0.198289 -1.918457 0.146004 -0.076546 -0.016097 0.206997
2609 Irion, Texas 48235 1232 10.2 48646 4.3 -0.2 76.9 38 -0.594135 -1.918011 0.447119 -0.501395 -0.499449 0.536733
2004 Steele, North Dakota 38091 490 17.7 50536 2.5 0.0 79.1 38 -1.592246 -1.909074 -0.048638 -0.272575 0.171057 -0.173237
2367 Jones, South Dakota 46075 638 14.1 44531 3.5 0.0 79.7 37 -1.329273 -1.901544 0.536091 0.208509 0.361460 -0.063817
1637 Chase, Nebraska 31029 724 14.1 43125 2.5 0.6 79.1 35 -1.247749 -1.891550 -0.038537 0.669723 0.390429 0.350631
2546 Culberson, Texas 48109 1396 16.4 33500 3.5 1.0 79.7 42 -0.165729 -1.879379 0.034677 0.454834 0.843293 -1.145783
1659 Gosper, Nebraska 31073 524 16.6 44241 2.8 0.0 78.6 35 -1.484003 -1.859797 0.110800 0.474865 -0.159214 0.194868
1975 Golden Valley, North Dakota 38033 545 20.1 35742 2.9 0.0 79.8 36 -1.478375 -1.859668 0.286786 1.011590 0.068714 -0.623555
1671 Johnson, Nebraska 31097 1364 12.3 42689 3.7 0.6 78.0 39 -0.380506 -1.857601 0.080521 0.016561 0.188887 0.030001
2744 Zapata, Texas 48505 2474 8.5 28617 6.5 1.0 76.9 48 1.898830 -1.856722 0.583528 -0.442634 0.240796 -1.259001
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1752 Hunterdon, New Jersey 34019 37 48.1 105880 7.1 0.3 81.4 26 -6.123408 2.639655 -0.919383 -1.818132 0.114297 1.010727
209 Trinity, California 6105 2167 19.7 36569 15.8 2.6 76.6 33 1.180558 2.644222 1.268836 1.101342 -0.207782 0.654502
67 Apache, Arizona 4001 2987 10.3 31615 19.6 2.7 73.0 43 4.015829 2.649024 1.831213 -0.414968 -0.853883 0.187434
1760 Somerset, New Jersey 34035 53 50.6 98571 7.4 0.4 82.0 28 -5.828805 2.652625 -0.833883 -1.536349 0.267084 0.250555
203 Siskiyou, California 6093 2073 23.1 37948 15.3 2.6 76.6 34 1.021513 2.669983 0.971107 0.938978 -0.254577 0.326512
1021 Knott, Kentucky 21119 3079 13.1 32482 13.2 7.7 73.3 46 5.218588 2.670728 -2.188518 0.316083 1.947160 0.315175
162 Colusa, California 6011 1489 13.5 52165 20.0 0.4 79.2 39 0.629586 2.675873 3.705865 -1.151179 0.087993 0.029347
214 Yuba, California 6115 2402 13.6 46641 16.9 3.0 76.4 38 1.915647 2.702944 1.348718 -0.276588 0.406230 0.760391
416 Fulton, Georgia 13121 979 48.4 57664 9.6 1.5 77.2 29 -2.509756 2.717691 -1.018501 0.711204 -1.253012 -0.262174
199 Santa Clara, California 6085 134 46.0 90747 8.4 0.7 82.8 26 -5.489764 2.721985 -0.281370 -0.748790 0.649295 0.530091
177 Marin, California 6041 26 54.6 90962 6.3 0.5 83.2 22 -6.686936 2.741049 -0.924574 -0.049169 0.304586 0.449857
2894 Fairfax City, Virginia 51600 68 53.2 98563 6.2 0.5 78.8 22 -5.985078 2.807422 -1.734055 -0.861944 -1.060711 1.385768
207 Sutter, California 6101 1385 18.4 50510 17.6 1.2 78.3 33 0.068244 2.827716 2.499369 -0.010336 -0.168513 0.751828
194 San Francisco, California 6075 164 52.0 73802 7.3 1.1 81.6 20 -5.492779 2.859470 -0.880171 1.248345 -0.160972 0.660413
2795 Arlington, Virginia 51013 2 71.2 102459 3.6 0.3 82.0 25 -7.676468 2.914311 -2.585023 -1.006722 -0.442720 -0.593924
232 Douglas, Colorado 8035 13 54.8 101108 6.4 0.2 82.8 21 -7.178944 2.915148 -0.968566 -0.651540 0.088361 0.917882
1056 Owsley, Kentucky 21189 3118 10.9 19624 10.9 11.9 71.4 50 7.861437 2.958594 -4.865725 1.282564 3.528447 0.361832
382 Chattahoochee, Georgia 13053 1529 28.1 48684 18.3 0.6 75.3 38 0.681061 3.073121 1.967593 -0.789482 -1.672015 -0.367797
2921 Williamsburg City, Virginia 51830 516 49.5 50865 13.4 0.6 80.8 34 -2.477910 3.100420 0.852958 0.441046 -0.505254 -1.904013
158 Alpine, California 6003 1138 31.7 59931 13.3 2.8 79.7 33 -1.066325 3.101986 0.244601 0.092170 0.886168 0.215939
65 Wilcox, Alabama 1131 3119 11.1 24212 16.4 8.0 73.6 50 6.405481 3.178160 -1.426958 0.240858 2.230106 -0.394493
1080 Wolfe, Kentucky 21237 3121 11.3 21168 12.8 10.8 72.6 48 7.136569 3.228510 -3.720147 1.272425 3.264457 0.341526
974 Breathitt, Kentucky 21025 3129 10.2 23049 11.9 11.5 71.4 47 7.425312 3.258429 -4.433426 1.346232 3.248916 0.889193
1827 New York, New York 36061 365 58.1 68370 7.7 1.8 81.7 22 -5.102383 3.335175 -1.353639 1.520852 0.014594 -0.125626
289 District of Columbia, District of Columbia 11001 1273 51.2 64267 9.1 2.9 76.4 29 -2.260332 3.341530 -2.192463 0.531909 -0.836304 0.080651
987 Clay, Kentucky 21051 3135 7.4 22296 12.7 11.7 71.4 45 7.535257 3.472849 -4.197241 1.633778 3.334481 1.369707
1038 Magoffin, Kentucky 21153 3127 9.6 25461 16.5 9.6 73.7 45 6.484034 3.933865 -2.084526 1.100257 2.928622 0.690853
2895 Falls Church City, Virginia 51610 18 72.8 122844 6.8 0.1 82.0 19 -8.834204 4.454561 -2.103452 -1.653483 -0.685137 0.783617
169 Imperial, California 6025 1706 13.2 41255 28.3 0.1 80.1 40 1.725613 4.524321 5.965237 -0.892411 -0.075614 -0.631711
81 Yuma, Arizona 4027 1800 14.2 41156 27.5 0.6 79.9 40 1.819515 4.534145 5.462757 -0.768474 0.092388 -0.599200

3112 rows × 15 columns

We're going to stop there. As the skree plot below shows, the first two singular vectors account for most of the variance in this data set. That's not terribly suprising: if we look at the "best" places in PC1 we nearly get the NYT's results. If we look at PC2, we get those places where employment is great but where education is less important.

In [21]:
plot(d ** 2)
title("Skree plot")

Out[21]:
<matplotlib.text.Text at 0xad538d4c>