In [52]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import patsy
%pylab inline

Populating the interactive namespace from numpy and matplotlib


The New York Times recently published a piece entitled "The Hardest Places to Live in the US" with a ranking of counties in the US by quality of life. The original piece can be found here.

Their data set includes six factors: educational attainment, household income, jobless rate, disability rate, life expectancy and obesity rate, and is available as a separate download.

I hope to play with this data more, but here's the first interesting observation.

In [18]:
df = pd.read_table("data/unemployment.tsv")


## Income vs Education at the Outliers¶

In [73]:
scatter(df.education, df.income)
ylabel("Median Income")
xlabel("% population with bachelor's degree")

Out[73]:
<matplotlib.text.Text at 0xc9ec8ec>

There's a clear linear relationship here, which we can identify with statsmodels:

In [74]:
y, X = patsy.dmatrices("income ~ education", df)
income_edu_model = sm.OLS(y, X).fit()
income_edu_model.summary()

Out[74]:
Dep. Variable: R-squared: income 0.482 OLS 0.482 Least Squares 2899. Sun, 29 Jun 2014 0.00 17:38:12 -32562. 3112 6.513e+04 3110 6.514e+04 1
coef std err t P>|t| [95.0% Conf. Int.] 2.729e+04 370.307 73.700 0.000 2.66e+04 2.8e+04 935.1122 17.367 53.843 0.000 901.059 969.165
 Omnibus: Durbin-Watson: 185.316 1.587 0 513.123 0.307 3.77e-112 4.892 52.1
In [71]:
df['income_edu_resid'] = income_edu_model.norm_resid()

In [72]:
df.sort('income_edu_resid')

Out[72]:
County id rank education income unemployment disability life obesity income_edu_resid
2960 Whitman, Washington 53075 591 48.8 34169 6.3 0.5 79.9 36 -4.573856
1422 Oktibbeha, Mississippi 28105 1988 43.1 29430 9.2 1.9 76.7 38 -4.504092
2902 Lexington City, Virginia 51678 1541 46.9 36511 11.4 1.1 78.4 37 -4.087782
385 Clarke, Georgia 13059 1410 40.8 33846 7.0 1.5 77.9 36 -3.729109
3089 Albany, Wyoming 56001 150 48.1 42882 4.4 0.4 79.7 29 -3.468331
718 Monroe, Indiana 18105 555 43.3 38675 6.9 0.7 79.4 32 -3.435105
1953 Watauga, North Carolina 37189 758 38.4 34848 8.3 0.6 79.6 31 -3.345997
602 Jackson, Illinois 17077 1656 36.0 32819 7.6 1.5 77.3 38 -3.320592
548 Latah, Idaho 16057 469 42.9 39466 6.4 0.8 80.4 32 -3.297611
2888 Charlottesville City, Virginia 51540 675 48.1 44535 5.9 1.2 77.1 25 -3.273250
2343 Clay, South Dakota 46027 397 41.3 38377 4.0 0.5 80.0 36 -3.249557
1401 Jefferson, Mississippi 28063 2926 20.9 20281 14.4 5.7 72.8 50 -3.133867
937 Riley, Kansas 20161 143 45.4 43364 4.5 0.4 80.3 33 -3.113480
2512 Brazos, Texas 48041 732 38.7 37638 5.5 1.0 79.5 36 -3.049840
240 Gunnison, Colorado 8051 77 51.9 50091 6.6 0.3 83.0 25 -3.036915
2913 Radford City, Virginia 51750 1424 29.5 29757 7.7 0.9 76.7 33 -2.964628
879 Douglas, Kansas 20045 157 48.4 48395 5.3 0.7 79.7 32 -2.850816
2178 Benton, Oregon 41003 159 48.6 48635 6.1 0.7 81.2 32 -2.844564
2159 Payne, Oklahoma 40119 1046 35.9 36762 4.8 1.0 77.4 36 -2.844218
1645 Dawes, Nebraska 31045 447 36.1 36974 3.9 0.4 79.4 36 -2.841271
1461 Boone, Missouri 29019 249 47.3 47786 4.6 1.0 79.1 32 -2.801293
2900 Harrisonburg City, Virginia 51660 991 35.6 36853 6.8 0.7 78.1 35 -2.800372
1926 Orange, North Carolina 37135 93 55.2 55241 6.2 0.7 80.2 24 -2.793314
1851 Tompkins, New York 36109 170 49.9 50539 6.0 0.9 80.5 31 -2.763327
552 Madison, Idaho 16065 584 31.9 33776 5.5 0.3 79.5 35 -2.755181
1112 Lincoln, Louisiana 22061 2130 33.6 35433 8.0 2.0 75.9 40 -2.747238
290 Alachua, Florida 12001 838 41.2 42818 6.6 1.4 78.1 31 -2.714412
2921 Williamsburg City, Virginia 51830 516 49.5 50865 13.4 0.6 80.8 34 -2.680710
618 McDonough, Illinois 17109 1249 32.9 35812 7.5 1.1 78.3 36 -2.625259
842 Story, Iowa 19169 67 47.7 49683 3.9 0.4 80.9 35 -2.621560
... ... ... ... ... ... ... ... ... ... ...
1174 Howard, Maryland 24027 9 59.5 107821 5.0 0.4 81.7 30 2.937432
2911 Poquoson City, Virginia 51735 52 35.1 85033 5.4 0.3 80.5 34 2.940823
2873 Spotsylvania, Virginia 51177 264 29.0 79402 5.0 0.6 78.6 37 2.949460
1163 Anne Arundel, Maryland 24003 180 36.8 86987 6.1 0.7 79.0 35 2.983818
234 Elbert, Colorado 8039 91 30.1 80811 7.2 0.2 79.8 26 2.994351
2858 Powhatan, Virginia 51145 367 25.2 76495 5.4 0.5 78.9 38 3.025749
3106 Sublette, Wyoming 56035 14 28.5 79776 3.7 0.3 80.0 29 3.048777
1167 Carroll, Maryland 24013 221 31.9 83155 6.2 0.5 79.2 36 3.072336
2837 King George, Virginia 51099 427 30.6 82195 7.0 0.6 78.0 36 3.102506
2895 Falls Church City, Virginia 51610 18 72.8 122844 6.8 0.1 82.0 19 3.242623
2817 Fairfax, Virginia 51059 3 58.2 109383 4.2 0.3 83.1 26 3.265239
1719 Elko, Nevada 32007 768 15.8 70411 5.9 0.4 78.1 39 3.345107
1761 Sussex, New Jersey 34037 381 31.6 85507 9.1 0.6 79.0 33 3.383017
3107 Sweetwater, Wyoming 56037 362 17.0 72139 4.6 0.5 78.4 35 3.416609
1178 Queen Annes, Maryland 24035 149 31.2 86013 6.2 0.4 79.7 36 3.486876
1848 Suffolk, New York 36103 208 32.6 87778 7.6 0.7 80.2 33 3.540673
1826 Nassau, New York 36059 63 41.4 97049 7.1 0.5 81.6 30 3.663648
1723 Lander, Nevada 32015 1187 12.8 70341 5.3 0.3 77.2 41 3.667921
1179 St. Marys, Maryland 24037 461 28.4 85032 5.9 0.8 78.3 38 3.680106
2818 Fauquier, Virginia 51061 136 32.0 88687 4.7 0.4 78.6 34 3.714165
1836 Putnam, New York 36079 66 38.8 95259 6.7 0.4 80.6 33 3.739330
2527 Chambers, Texas 48071 1109 16.8 75200 7.7 0.9 77.8 37 3.799928
3091 Campbell, Wyoming 56005 763 18.2 77090 4.3 0.3 76.7 39 3.868476
2578 Glasscock, Texas 48173 418 17.5 76563 4.3 0.0 77.8 37 3.883533
1752 Hunterdon, New Jersey 34019 37 48.1 105880 7.1 0.3 81.4 26 3.966447
2861 Prince William, Virginia 51153 42 37.7 96160 4.9 0.4 80.5 35 3.967057
2874 Stafford, Virginia 51179 99 35.5 96355 4.9 0.4 79.4 36 4.232858
1165 Calvert, Maryland 24009 296 29.5 92395 5.7 0.6 78.8 37 4.427664
2841 Loudoun, Virginia 51107 4 57.9 122068 4.2 0.2 82.6 29 4.795381
1169 Charles, Maryland 24017 704 26.6 93063 6.0 0.8 77.9 40 4.826538

3112 rows Ã— 10 columns

What's noticable about the negative outliers? These are counties where educational attainment has not translated to higher median income. They are also all rural college towns.

Whitman, Washington: Washington State University
Oktibbeha, Mississippi: Mississippi State University
Lexington City, Virginia: VMI
Clarke, Georgia: University of Georgia
Albany, Wyoming: University of Wyoming (also state capital)
Monroe, Indiana: Indiana University
Watauga, North Carolina: Appalachian State University
Jackson, Illinois: Southern Illinois University (main ecoonomic engine)
Latah, Idaho: University of Idaho
Charlottesville City, Virginia: University of Virginia

There are a few patterns in the positive outliers, where median income is higher than predicted by educational attainment. Charles, Loudoun, Calvert, Stafford, and Prince William counties all surround Washington D.C. Hunterdon County NJ and Putman County NY are outer suburbs of NYC. Chambers TX is a suburb of Houston, and Campbell, WY is dominated by oil or gas extraction. Glassock TX has 334 residents: let's not jump to any conclusions on that one!

There are a few more interactions worth exploring. See below for all pairwise scatterplots.

In [106]:
cols = ['education', 'income', 'unemployment', 'disability', 'life', 'obesity']
from pandas.tools.plotting import scatter_matrix
scatter_matrix(df[cols], alpha=0.2, figsize=(6, 6), diagonal='kde')
a = 1