In [1]:

import pandas as pd
import numpy as np
import statsmodels.api as sm
import patsy
%pylab inline

Populating the interactive namespace from numpy and matplotlib

This is our second look at the well-being dataset from The New York Times. In Part 1 we found that the outlying counties in a regression of income against education were all college towns.

In this note, we'll examine the Principal Components of the data set. Principal Components are the eigenvectors of the variance-covariance matrix $X^TX$. This means they define a basis function, and most importantly, each PC accounts for the maximal possible variance in left in the data.

Put anoher way, let's say we want to produce a low-dimensional representation of our data. (That's basically what the NYT did in their original piece!) PCA provides the "best" representation at each dimensionality, meaning it defines the low-dimensional manifold that most closely matches the data. To make that clear, let's explore the data in more detail.

In [4]:

df = pd.read_table("data/unemployment.tsv")

PCA¶

In [5]:

df_pca = df[['education', 'income', 'unemployment', 'disability', 'life', 'obesity']]
df_pca -= df_pca.mean()
df_pca /= df_pca.var() ** .5

There's a clear linear relationship here, which we can identify with statsmodels:

In [6]:

var = df_pca.values.T.dot(df_pca.values)

In [7]:

U, d, Vt = np.linalg.svd(df_pca)
V = Vt.T

In [8]:

values = df_pca.values.dot(V)
for i in range(6):
    df['v' + str(i+1)] = values[:, i]

In [9]:

plot(d ** 2)

Out[9]:

[<matplotlib.lines.Line2D at 0xbb746ec>]

The first principal component captures most of the variance in the data set. It is also stored in the first row of the following dataframe.

In [10]:

pd.DataFrame(V.T, columns=df_pca.columns)

Out[10]:

	education	income	unemployment	disability	life	obesity
0	-0.407710	-0.425636	0.292399	0.402665	-0.454805	0.445109
1	0.466734	0.220235	0.737400	0.383632	-0.012688	-0.206306
2	-0.406288	-0.175295	0.594250	-0.561673	0.364932	-0.049153
3	0.106573	-0.758451	-0.104842	0.199034	0.152919	-0.582583
4	-0.298136	0.136319	-0.081146	0.554695	0.741906	0.166840
5	-0.592130	0.381800	-0.005846	0.167060	-0.293269	-0.624228

The first component has low education and income, high unemployment, and poor overall health. As we might expect, the counties that project into the negative side of this vector (i.e. better health, better education and more money) have higher ranks according to The New York Times.

In [88]:

plot(df['rank'], df['v1'], 'o')
xlabel("Rank according to the NYT")
ylabel("First Principal Component Value")
title("First principal component matches rank")

Out[88]:

<matplotlib.text.Text at 0xacf91e2c>

What about the other components? It's important to note that since principal components are eigenvectors, they are linearly uncorrelated with each other. That means that subsequent PCs capture variance that is different than the Plotting PC1 against PC2 shows that the outliers on PC1 universally occur on positive PC2 scores.

In [95]:

plot(df.v1, df.v2, 'o')
xlabel("First PC")
ylabel("Second PC")

Out[95]:

<matplotlib.text.Text at 0xacd1b4cc>

What does PC2 represent? Unemployment is the most important factor, with education being a second, less important factor. Keep in mind that unemployment is coded as percent of the population that is unemployed: we want that number to be lower. Therefore, a county that scores positive on PC2 probably has high unemployment, especially for their education level.

Sure enough, the most negative counties have very low unemployment: they tend to cluster in the middle of the country where farming or resource extraction dominate. The other extreme includes NYC, some hard-hit Californian areas, as well as well-to-do, highly educated counties where unemployment is still higher than one would expect given the population.

In [90]:

df.sort('v2')

Out[90]:

	County	id	rank	education	income	unemployment	disability	life	obesity	v1	v2	v3	v4	v5	v6
2647	McMullen, Texas	48311	1537	6.6	38438	2.6	0.0	77.4	39	-0.168381	-2.738321	0.371478	0.113751	-0.136297	0.271804
2633	La Salle, Texas	48283	2295	5.4	26731	4.3	1.1	77.4	47	1.706565	-2.568086	0.325654	-0.088072	0.571618	-1.014638
2683	Reagan, Texas	48383	1244	8.1	55968	2.4	0.4	77.8	42	-0.525734	-2.391547	-0.169892	-1.292065	0.471666	0.313923
2622	Kenedy, Texas	48261	1568	12.1	37206	3.3	0.0	77.1	40	-0.140570	-2.326024	0.223772	0.078278	-0.422819	-0.245008
2606	Hudspeth, Texas	48229	1749	9.6	22083	5.7	0.1	79.7	44	0.678369	-2.269859	1.424954	0.595248	0.508386	-1.481947
1679	Logan, Nebraska	31113	669	14.8	44444	2.1	0.0	78.5	36	-1.358469	-2.186498	0.012100	0.325461	-0.070672	0.194691
2049	Holmes, Ohio	39075	1741	8.3	44091	4.7	0.6	77.6	46	0.662116	-2.106735	0.314982	-1.126914	0.443911	-0.607082
2544	Crockett, Texas	48105	1608	9.6	49850	4.0	0.2	75.8	43	0.244940	-2.101557	-0.050098	-1.251244	-0.441892	0.107378
1680	Loup, Nebraska	31115	865	13.5	37708	3.2	0.0	79.6	36	-1.168023	-2.093503	0.595233	0.779039	0.239214	-0.086656
1982	Logan, North Dakota	38047	1011	12.9	47857	3.1	0.4	79.4	41	-0.819579	-2.061432	0.159372	-0.491122	0.702043	-0.348033
1572	Carter, Montana	30011	1074	13.7	38750	3.2	0.0	78.0	36	-0.879750	-2.053991	0.301426	0.601643	-0.302432	0.149769
1579	Fallon, Montana	30025	363	16.5	49732	1.9	0.0	79.0	36	-1.754850	-2.053413	-0.104699	0.048448	0.109399	0.183923
1641	Colfax, Nebraska	31037	1005	12.4	48561	3.4	0.3	79.3	41	-0.804895	-2.028275	0.270312	-0.578826	0.634607	-0.293427
1583	Garfield, Montana	30033	981	13.9	40125	3.4	0.0	79.0	37	-1.024655	-2.017318	0.471680	0.444332	0.080959	-0.098333
1621	Wibaux, Montana	30109	662	16.3	45625	2.8	0.0	79.0	38	-1.296662	-1.994643	0.137461	0.008184	0.118899	-0.224812
1658	Garfield, Nebraska	31071	753	13.3	39241	3.2	0.0	79.2	34	-1.335090	-1.978245	0.537067	0.917737	0.050290	0.317838
1984	McIntosh, North Dakota	38051	1037	18.6	36327	3.3	0.1	79.4	40	-0.857690	-1.974779	0.271421	0.393920	0.181220	-1.009332
1971	Dunn, North Dakota	38025	280	17.4	54539	1.5	0.3	79.8	37	-1.970254	-1.971905	-0.331912	-0.259209	0.606792	0.072817
1965	Burke, North Dakota	38013	572	15.5	52031	1.8	0.5	79.0	37	-1.519530	-1.966513	-0.376498	-0.153028	0.459428	0.257452
2555	Dimmit, Texas	48127	2343	9.2	32960	5.1	1.2	76.7	46	1.469598	-1.948605	0.072583	-0.370924	0.262570	-0.818500
2547	Dallam, Texas	48111	1451	10.7	42781	3.9	0.5	77.4	39	-0.198289	-1.918457	0.146004	-0.076546	-0.016097	0.206997
2609	Irion, Texas	48235	1232	10.2	48646	4.3	-0.2	76.9	38	-0.594135	-1.918011	0.447119	-0.501395	-0.499449	0.536733
2004	Steele, North Dakota	38091	490	17.7	50536	2.5	0.0	79.1	38	-1.592246	-1.909074	-0.048638	-0.272575	0.171057	-0.173237
2367	Jones, South Dakota	46075	638	14.1	44531	3.5	0.0	79.7	37	-1.329273	-1.901544	0.536091	0.208509	0.361460	-0.063817
1637	Chase, Nebraska	31029	724	14.1	43125	2.5	0.6	79.1	35	-1.247749	-1.891550	-0.038537	0.669723	0.390429	0.350631
2546	Culberson, Texas	48109	1396	16.4	33500	3.5	1.0	79.7	42	-0.165729	-1.879379	0.034677	0.454834	0.843293	-1.145783
1659	Gosper, Nebraska	31073	524	16.6	44241	2.8	0.0	78.6	35	-1.484003	-1.859797	0.110800	0.474865	-0.159214	0.194868
1975	Golden Valley, North Dakota	38033	545	20.1	35742	2.9	0.0	79.8	36	-1.478375	-1.859668	0.286786	1.011590	0.068714	-0.623555
1671	Johnson, Nebraska	31097	1364	12.3	42689	3.7	0.6	78.0	39	-0.380506	-1.857601	0.080521	0.016561	0.188887	0.030001
2744	Zapata, Texas	48505	2474	8.5	28617	6.5	1.0	76.9	48	1.898830	-1.856722	0.583528	-0.442634	0.240796	-1.259001
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1752	Hunterdon, New Jersey	34019	37	48.1	105880	7.1	0.3	81.4	26	-6.123408	2.639655	-0.919383	-1.818132	0.114297	1.010727
209	Trinity, California	6105	2167	19.7	36569	15.8	2.6	76.6	33	1.180558	2.644222	1.268836	1.101342	-0.207782	0.654502
67	Apache, Arizona	4001	2987	10.3	31615	19.6	2.7	73.0	43	4.015829	2.649024	1.831213	-0.414968	-0.853883	0.187434
1760	Somerset, New Jersey	34035	53	50.6	98571	7.4	0.4	82.0	28	-5.828805	2.652625	-0.833883	-1.536349	0.267084	0.250555
203	Siskiyou, California	6093	2073	23.1	37948	15.3	2.6	76.6	34	1.021513	2.669983	0.971107	0.938978	-0.254577	0.326512
1021	Knott, Kentucky	21119	3079	13.1	32482	13.2	7.7	73.3	46	5.218588	2.670728	-2.188518	0.316083	1.947160	0.315175
162	Colusa, California	6011	1489	13.5	52165	20.0	0.4	79.2	39	0.629586	2.675873	3.705865	-1.151179	0.087993	0.029347
214	Yuba, California	6115	2402	13.6	46641	16.9	3.0	76.4	38	1.915647	2.702944	1.348718	-0.276588	0.406230	0.760391
416	Fulton, Georgia	13121	979	48.4	57664	9.6	1.5	77.2	29	-2.509756	2.717691	-1.018501	0.711204	-1.253012	-0.262174
199	Santa Clara, California	6085	134	46.0	90747	8.4	0.7	82.8	26	-5.489764	2.721985	-0.281370	-0.748790	0.649295	0.530091
177	Marin, California	6041	26	54.6	90962	6.3	0.5	83.2	22	-6.686936	2.741049	-0.924574	-0.049169	0.304586	0.449857
2894	Fairfax City, Virginia	51600	68	53.2	98563	6.2	0.5	78.8	22	-5.985078	2.807422	-1.734055	-0.861944	-1.060711	1.385768
207	Sutter, California	6101	1385	18.4	50510	17.6	1.2	78.3	33	0.068244	2.827716	2.499369	-0.010336	-0.168513	0.751828
194	San Francisco, California	6075	164	52.0	73802	7.3	1.1	81.6	20	-5.492779	2.859470	-0.880171	1.248345	-0.160972	0.660413
2795	Arlington, Virginia	51013	2	71.2	102459	3.6	0.3	82.0	25	-7.676468	2.914311	-2.585023	-1.006722	-0.442720	-0.593924
232	Douglas, Colorado	8035	13	54.8	101108	6.4	0.2	82.8	21	-7.178944	2.915148	-0.968566	-0.651540	0.088361	0.917882
1056	Owsley, Kentucky	21189	3118	10.9	19624	10.9	11.9	71.4	50	7.861437	2.958594	-4.865725	1.282564	3.528447	0.361832
382	Chattahoochee, Georgia	13053	1529	28.1	48684	18.3	0.6	75.3	38	0.681061	3.073121	1.967593	-0.789482	-1.672015	-0.367797
2921	Williamsburg City, Virginia	51830	516	49.5	50865	13.4	0.6	80.8	34	-2.477910	3.100420	0.852958	0.441046	-0.505254	-1.904013
158	Alpine, California	6003	1138	31.7	59931	13.3	2.8	79.7	33	-1.066325	3.101986	0.244601	0.092170	0.886168	0.215939
65	Wilcox, Alabama	1131	3119	11.1	24212	16.4	8.0	73.6	50	6.405481	3.178160	-1.426958	0.240858	2.230106	-0.394493
1080	Wolfe, Kentucky	21237	3121	11.3	21168	12.8	10.8	72.6	48	7.136569	3.228510	-3.720147	1.272425	3.264457	0.341526
974	Breathitt, Kentucky	21025	3129	10.2	23049	11.9	11.5	71.4	47	7.425312	3.258429	-4.433426	1.346232	3.248916	0.889193
1827	New York, New York	36061	365	58.1	68370	7.7	1.8	81.7	22	-5.102383	3.335175	-1.353639	1.520852	0.014594	-0.125626
289	District of Columbia, District of Columbia	11001	1273	51.2	64267	9.1	2.9	76.4	29	-2.260332	3.341530	-2.192463	0.531909	-0.836304	0.080651
987	Clay, Kentucky	21051	3135	7.4	22296	12.7	11.7	71.4	45	7.535257	3.472849	-4.197241	1.633778	3.334481	1.369707
1038	Magoffin, Kentucky	21153	3127	9.6	25461	16.5	9.6	73.7	45	6.484034	3.933865	-2.084526	1.100257	2.928622	0.690853
2895	Falls Church City, Virginia	51610	18	72.8	122844	6.8	0.1	82.0	19	-8.834204	4.454561	-2.103452	-1.653483	-0.685137	0.783617
169	Imperial, California	6025	1706	13.2	41255	28.3	0.1	80.1	40	1.725613	4.524321	5.965237	-0.892411	-0.075614	-0.631711
81	Yuma, Arizona	4027	1800	14.2	41156	27.5	0.6	79.9	40	1.819515	4.534145	5.462757	-0.768474	0.092388	-0.599200

3112 rows × 15 columns

We're going to stop there. As the skree plot below shows, the first two singular vectors account for most of the variance in this data set. That's not terribly suprising: if we look at the "best" places in PC1 we nearly get the NYT's results. If we look at PC2, we get those places where employment is great but where education is less important.

In [21]:

plot(d ** 2)
title("Skree plot")

Out[21]:

<matplotlib.text.Text at 0xad538d4c>

In [ ]: