import pandas as pd
import numpy as np
import statsmodels.api as sm
import patsy
%pylab inline
Populating the interactive namespace from numpy and matplotlib
This is our second look at the well-being dataset from The New York Times. In Part 1 we found that the outlying counties in a regression of income against education were all college towns.
In this note, we'll examine the Principal Components of the data set. Principal Components are the eigenvectors of the variance-covariance matrix $X^TX$. This means they define a basis function, and most importantly, each PC accounts for the maximal possible variance in left in the data.
Put anoher way, let's say we want to produce a low-dimensional representation of our data. (That's basically what the NYT did in their original piece!) PCA provides the "best" representation at each dimensionality, meaning it defines the low-dimensional manifold that most closely matches the data. To make that clear, let's explore the data in more detail.
df = pd.read_table("data/unemployment.tsv")
df_pca = df[['education', 'income', 'unemployment', 'disability', 'life', 'obesity']]
df_pca -= df_pca.mean()
df_pca /= df_pca.var() ** .5
There's a clear linear relationship here, which we can identify with statsmodels:
var = df_pca.values.T.dot(df_pca.values)
U, d, Vt = np.linalg.svd(df_pca)
V = Vt.T
values = df_pca.values.dot(V)
for i in range(6):
df['v' + str(i+1)] = values[:, i]
plot(d ** 2)
[<matplotlib.lines.Line2D at 0xbb746ec>]
The first principal component captures most of the variance in the data set. It is also stored in the first row of the following dataframe.
pd.DataFrame(V.T, columns=df_pca.columns)
education | income | unemployment | disability | life | obesity | |
---|---|---|---|---|---|---|
0 | -0.407710 | -0.425636 | 0.292399 | 0.402665 | -0.454805 | 0.445109 |
1 | 0.466734 | 0.220235 | 0.737400 | 0.383632 | -0.012688 | -0.206306 |
2 | -0.406288 | -0.175295 | 0.594250 | -0.561673 | 0.364932 | -0.049153 |
3 | 0.106573 | -0.758451 | -0.104842 | 0.199034 | 0.152919 | -0.582583 |
4 | -0.298136 | 0.136319 | -0.081146 | 0.554695 | 0.741906 | 0.166840 |
5 | -0.592130 | 0.381800 | -0.005846 | 0.167060 | -0.293269 | -0.624228 |
The first component has low education and income, high unemployment, and poor overall health. As we might expect, the counties that project into the negative side of this vector (i.e. better health, better education and more money) have higher ranks according to The New York Times.
plot(df['rank'], df['v1'], 'o')
xlabel("Rank according to the NYT")
ylabel("First Principal Component Value")
title("First principal component matches rank")
<matplotlib.text.Text at 0xacf91e2c>
What about the other components? It's important to note that since principal components are eigenvectors, they are linearly uncorrelated with each other. That means that subsequent PCs capture variance that is different than the Plotting PC1 against PC2 shows that the outliers on PC1 universally occur on positive PC2 scores.
plot(df.v1, df.v2, 'o')
xlabel("First PC")
ylabel("Second PC")
<matplotlib.text.Text at 0xacd1b4cc>
What does PC2 represent? Unemployment is the most important factor, with education being a second, less important factor. Keep in mind that unemployment is coded as percent of the population that is unemployed: we want that number to be lower. Therefore, a county that scores positive on PC2 probably has high unemployment, especially for their education level.
Sure enough, the most negative counties have very low unemployment: they tend to cluster in the middle of the country where farming or resource extraction dominate. The other extreme includes NYC, some hard-hit Californian areas, as well as well-to-do, highly educated counties where unemployment is still higher than one would expect given the population.
df.sort('v2')
County | id | rank | education | income | unemployment | disability | life | obesity | v1 | v2 | v3 | v4 | v5 | v6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2647 | McMullen, Texas | 48311 | 1537 | 6.6 | 38438 | 2.6 | 0.0 | 77.4 | 39 | -0.168381 | -2.738321 | 0.371478 | 0.113751 | -0.136297 | 0.271804 |
2633 | La Salle, Texas | 48283 | 2295 | 5.4 | 26731 | 4.3 | 1.1 | 77.4 | 47 | 1.706565 | -2.568086 | 0.325654 | -0.088072 | 0.571618 | -1.014638 |
2683 | Reagan, Texas | 48383 | 1244 | 8.1 | 55968 | 2.4 | 0.4 | 77.8 | 42 | -0.525734 | -2.391547 | -0.169892 | -1.292065 | 0.471666 | 0.313923 |
2622 | Kenedy, Texas | 48261 | 1568 | 12.1 | 37206 | 3.3 | 0.0 | 77.1 | 40 | -0.140570 | -2.326024 | 0.223772 | 0.078278 | -0.422819 | -0.245008 |
2606 | Hudspeth, Texas | 48229 | 1749 | 9.6 | 22083 | 5.7 | 0.1 | 79.7 | 44 | 0.678369 | -2.269859 | 1.424954 | 0.595248 | 0.508386 | -1.481947 |
1679 | Logan, Nebraska | 31113 | 669 | 14.8 | 44444 | 2.1 | 0.0 | 78.5 | 36 | -1.358469 | -2.186498 | 0.012100 | 0.325461 | -0.070672 | 0.194691 |
2049 | Holmes, Ohio | 39075 | 1741 | 8.3 | 44091 | 4.7 | 0.6 | 77.6 | 46 | 0.662116 | -2.106735 | 0.314982 | -1.126914 | 0.443911 | -0.607082 |
2544 | Crockett, Texas | 48105 | 1608 | 9.6 | 49850 | 4.0 | 0.2 | 75.8 | 43 | 0.244940 | -2.101557 | -0.050098 | -1.251244 | -0.441892 | 0.107378 |
1680 | Loup, Nebraska | 31115 | 865 | 13.5 | 37708 | 3.2 | 0.0 | 79.6 | 36 | -1.168023 | -2.093503 | 0.595233 | 0.779039 | 0.239214 | -0.086656 |
1982 | Logan, North Dakota | 38047 | 1011 | 12.9 | 47857 | 3.1 | 0.4 | 79.4 | 41 | -0.819579 | -2.061432 | 0.159372 | -0.491122 | 0.702043 | -0.348033 |
1572 | Carter, Montana | 30011 | 1074 | 13.7 | 38750 | 3.2 | 0.0 | 78.0 | 36 | -0.879750 | -2.053991 | 0.301426 | 0.601643 | -0.302432 | 0.149769 |
1579 | Fallon, Montana | 30025 | 363 | 16.5 | 49732 | 1.9 | 0.0 | 79.0 | 36 | -1.754850 | -2.053413 | -0.104699 | 0.048448 | 0.109399 | 0.183923 |
1641 | Colfax, Nebraska | 31037 | 1005 | 12.4 | 48561 | 3.4 | 0.3 | 79.3 | 41 | -0.804895 | -2.028275 | 0.270312 | -0.578826 | 0.634607 | -0.293427 |
1583 | Garfield, Montana | 30033 | 981 | 13.9 | 40125 | 3.4 | 0.0 | 79.0 | 37 | -1.024655 | -2.017318 | 0.471680 | 0.444332 | 0.080959 | -0.098333 |
1621 | Wibaux, Montana | 30109 | 662 | 16.3 | 45625 | 2.8 | 0.0 | 79.0 | 38 | -1.296662 | -1.994643 | 0.137461 | 0.008184 | 0.118899 | -0.224812 |
1658 | Garfield, Nebraska | 31071 | 753 | 13.3 | 39241 | 3.2 | 0.0 | 79.2 | 34 | -1.335090 | -1.978245 | 0.537067 | 0.917737 | 0.050290 | 0.317838 |
1984 | McIntosh, North Dakota | 38051 | 1037 | 18.6 | 36327 | 3.3 | 0.1 | 79.4 | 40 | -0.857690 | -1.974779 | 0.271421 | 0.393920 | 0.181220 | -1.009332 |
1971 | Dunn, North Dakota | 38025 | 280 | 17.4 | 54539 | 1.5 | 0.3 | 79.8 | 37 | -1.970254 | -1.971905 | -0.331912 | -0.259209 | 0.606792 | 0.072817 |
1965 | Burke, North Dakota | 38013 | 572 | 15.5 | 52031 | 1.8 | 0.5 | 79.0 | 37 | -1.519530 | -1.966513 | -0.376498 | -0.153028 | 0.459428 | 0.257452 |
2555 | Dimmit, Texas | 48127 | 2343 | 9.2 | 32960 | 5.1 | 1.2 | 76.7 | 46 | 1.469598 | -1.948605 | 0.072583 | -0.370924 | 0.262570 | -0.818500 |
2547 | Dallam, Texas | 48111 | 1451 | 10.7 | 42781 | 3.9 | 0.5 | 77.4 | 39 | -0.198289 | -1.918457 | 0.146004 | -0.076546 | -0.016097 | 0.206997 |
2609 | Irion, Texas | 48235 | 1232 | 10.2 | 48646 | 4.3 | -0.2 | 76.9 | 38 | -0.594135 | -1.918011 | 0.447119 | -0.501395 | -0.499449 | 0.536733 |
2004 | Steele, North Dakota | 38091 | 490 | 17.7 | 50536 | 2.5 | 0.0 | 79.1 | 38 | -1.592246 | -1.909074 | -0.048638 | -0.272575 | 0.171057 | -0.173237 |
2367 | Jones, South Dakota | 46075 | 638 | 14.1 | 44531 | 3.5 | 0.0 | 79.7 | 37 | -1.329273 | -1.901544 | 0.536091 | 0.208509 | 0.361460 | -0.063817 |
1637 | Chase, Nebraska | 31029 | 724 | 14.1 | 43125 | 2.5 | 0.6 | 79.1 | 35 | -1.247749 | -1.891550 | -0.038537 | 0.669723 | 0.390429 | 0.350631 |
2546 | Culberson, Texas | 48109 | 1396 | 16.4 | 33500 | 3.5 | 1.0 | 79.7 | 42 | -0.165729 | -1.879379 | 0.034677 | 0.454834 | 0.843293 | -1.145783 |
1659 | Gosper, Nebraska | 31073 | 524 | 16.6 | 44241 | 2.8 | 0.0 | 78.6 | 35 | -1.484003 | -1.859797 | 0.110800 | 0.474865 | -0.159214 | 0.194868 |
1975 | Golden Valley, North Dakota | 38033 | 545 | 20.1 | 35742 | 2.9 | 0.0 | 79.8 | 36 | -1.478375 | -1.859668 | 0.286786 | 1.011590 | 0.068714 | -0.623555 |
1671 | Johnson, Nebraska | 31097 | 1364 | 12.3 | 42689 | 3.7 | 0.6 | 78.0 | 39 | -0.380506 | -1.857601 | 0.080521 | 0.016561 | 0.188887 | 0.030001 |
2744 | Zapata, Texas | 48505 | 2474 | 8.5 | 28617 | 6.5 | 1.0 | 76.9 | 48 | 1.898830 | -1.856722 | 0.583528 | -0.442634 | 0.240796 | -1.259001 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1752 | Hunterdon, New Jersey | 34019 | 37 | 48.1 | 105880 | 7.1 | 0.3 | 81.4 | 26 | -6.123408 | 2.639655 | -0.919383 | -1.818132 | 0.114297 | 1.010727 |
209 | Trinity, California | 6105 | 2167 | 19.7 | 36569 | 15.8 | 2.6 | 76.6 | 33 | 1.180558 | 2.644222 | 1.268836 | 1.101342 | -0.207782 | 0.654502 |
67 | Apache, Arizona | 4001 | 2987 | 10.3 | 31615 | 19.6 | 2.7 | 73.0 | 43 | 4.015829 | 2.649024 | 1.831213 | -0.414968 | -0.853883 | 0.187434 |
1760 | Somerset, New Jersey | 34035 | 53 | 50.6 | 98571 | 7.4 | 0.4 | 82.0 | 28 | -5.828805 | 2.652625 | -0.833883 | -1.536349 | 0.267084 | 0.250555 |
203 | Siskiyou, California | 6093 | 2073 | 23.1 | 37948 | 15.3 | 2.6 | 76.6 | 34 | 1.021513 | 2.669983 | 0.971107 | 0.938978 | -0.254577 | 0.326512 |
1021 | Knott, Kentucky | 21119 | 3079 | 13.1 | 32482 | 13.2 | 7.7 | 73.3 | 46 | 5.218588 | 2.670728 | -2.188518 | 0.316083 | 1.947160 | 0.315175 |
162 | Colusa, California | 6011 | 1489 | 13.5 | 52165 | 20.0 | 0.4 | 79.2 | 39 | 0.629586 | 2.675873 | 3.705865 | -1.151179 | 0.087993 | 0.029347 |
214 | Yuba, California | 6115 | 2402 | 13.6 | 46641 | 16.9 | 3.0 | 76.4 | 38 | 1.915647 | 2.702944 | 1.348718 | -0.276588 | 0.406230 | 0.760391 |
416 | Fulton, Georgia | 13121 | 979 | 48.4 | 57664 | 9.6 | 1.5 | 77.2 | 29 | -2.509756 | 2.717691 | -1.018501 | 0.711204 | -1.253012 | -0.262174 |
199 | Santa Clara, California | 6085 | 134 | 46.0 | 90747 | 8.4 | 0.7 | 82.8 | 26 | -5.489764 | 2.721985 | -0.281370 | -0.748790 | 0.649295 | 0.530091 |
177 | Marin, California | 6041 | 26 | 54.6 | 90962 | 6.3 | 0.5 | 83.2 | 22 | -6.686936 | 2.741049 | -0.924574 | -0.049169 | 0.304586 | 0.449857 |
2894 | Fairfax City, Virginia | 51600 | 68 | 53.2 | 98563 | 6.2 | 0.5 | 78.8 | 22 | -5.985078 | 2.807422 | -1.734055 | -0.861944 | -1.060711 | 1.385768 |
207 | Sutter, California | 6101 | 1385 | 18.4 | 50510 | 17.6 | 1.2 | 78.3 | 33 | 0.068244 | 2.827716 | 2.499369 | -0.010336 | -0.168513 | 0.751828 |
194 | San Francisco, California | 6075 | 164 | 52.0 | 73802 | 7.3 | 1.1 | 81.6 | 20 | -5.492779 | 2.859470 | -0.880171 | 1.248345 | -0.160972 | 0.660413 |
2795 | Arlington, Virginia | 51013 | 2 | 71.2 | 102459 | 3.6 | 0.3 | 82.0 | 25 | -7.676468 | 2.914311 | -2.585023 | -1.006722 | -0.442720 | -0.593924 |
232 | Douglas, Colorado | 8035 | 13 | 54.8 | 101108 | 6.4 | 0.2 | 82.8 | 21 | -7.178944 | 2.915148 | -0.968566 | -0.651540 | 0.088361 | 0.917882 |
1056 | Owsley, Kentucky | 21189 | 3118 | 10.9 | 19624 | 10.9 | 11.9 | 71.4 | 50 | 7.861437 | 2.958594 | -4.865725 | 1.282564 | 3.528447 | 0.361832 |
382 | Chattahoochee, Georgia | 13053 | 1529 | 28.1 | 48684 | 18.3 | 0.6 | 75.3 | 38 | 0.681061 | 3.073121 | 1.967593 | -0.789482 | -1.672015 | -0.367797 |
2921 | Williamsburg City, Virginia | 51830 | 516 | 49.5 | 50865 | 13.4 | 0.6 | 80.8 | 34 | -2.477910 | 3.100420 | 0.852958 | 0.441046 | -0.505254 | -1.904013 |
158 | Alpine, California | 6003 | 1138 | 31.7 | 59931 | 13.3 | 2.8 | 79.7 | 33 | -1.066325 | 3.101986 | 0.244601 | 0.092170 | 0.886168 | 0.215939 |
65 | Wilcox, Alabama | 1131 | 3119 | 11.1 | 24212 | 16.4 | 8.0 | 73.6 | 50 | 6.405481 | 3.178160 | -1.426958 | 0.240858 | 2.230106 | -0.394493 |
1080 | Wolfe, Kentucky | 21237 | 3121 | 11.3 | 21168 | 12.8 | 10.8 | 72.6 | 48 | 7.136569 | 3.228510 | -3.720147 | 1.272425 | 3.264457 | 0.341526 |
974 | Breathitt, Kentucky | 21025 | 3129 | 10.2 | 23049 | 11.9 | 11.5 | 71.4 | 47 | 7.425312 | 3.258429 | -4.433426 | 1.346232 | 3.248916 | 0.889193 |
1827 | New York, New York | 36061 | 365 | 58.1 | 68370 | 7.7 | 1.8 | 81.7 | 22 | -5.102383 | 3.335175 | -1.353639 | 1.520852 | 0.014594 | -0.125626 |
289 | District of Columbia, District of Columbia | 11001 | 1273 | 51.2 | 64267 | 9.1 | 2.9 | 76.4 | 29 | -2.260332 | 3.341530 | -2.192463 | 0.531909 | -0.836304 | 0.080651 |
987 | Clay, Kentucky | 21051 | 3135 | 7.4 | 22296 | 12.7 | 11.7 | 71.4 | 45 | 7.535257 | 3.472849 | -4.197241 | 1.633778 | 3.334481 | 1.369707 |
1038 | Magoffin, Kentucky | 21153 | 3127 | 9.6 | 25461 | 16.5 | 9.6 | 73.7 | 45 | 6.484034 | 3.933865 | -2.084526 | 1.100257 | 2.928622 | 0.690853 |
2895 | Falls Church City, Virginia | 51610 | 18 | 72.8 | 122844 | 6.8 | 0.1 | 82.0 | 19 | -8.834204 | 4.454561 | -2.103452 | -1.653483 | -0.685137 | 0.783617 |
169 | Imperial, California | 6025 | 1706 | 13.2 | 41255 | 28.3 | 0.1 | 80.1 | 40 | 1.725613 | 4.524321 | 5.965237 | -0.892411 | -0.075614 | -0.631711 |
81 | Yuma, Arizona | 4027 | 1800 | 14.2 | 41156 | 27.5 | 0.6 | 79.9 | 40 | 1.819515 | 4.534145 | 5.462757 | -0.768474 | 0.092388 | -0.599200 |
3112 rows × 15 columns
We're going to stop there. As the skree plot below shows, the first two singular vectors account for most of the variance in this data set. That's not terribly suprising: if we look at the "best" places in PC1 we nearly get the NYT's results. If we look at PC2, we get those places where employment is great but where education is less important.
plot(d ** 2)
title("Skree plot")
<matplotlib.text.Text at 0xad538d4c>