%pylab --no-import-all inline
Populating the interactive namespace from numpy and matplotlib
The US Census is complex....so it's good, even essential, to have a framing question to guide your explorations so that you don't get distracted or lost.
I got into thinking of the census in 2002 when I saw a woman I knew in the following SF Chronicle article:
Claremont-Elmwood / Homogeneity in Berkeley? Well, yeah - SFGate
I thought at that point it should be easy for regular people to do census calculations....
In the summer of 2013, I wrote the following note to Greg Wilson about diversity calculations:
notes for Greg Wilson about an example Data Science Workflow
There's a whole cottage industry in musing on "diversity" in the USA:
The Most Diverse Cities In The US - Business Insider -- using 4 categories: Vallejo.
Most And Least Diverse Cities: Brown University Study Evaluates Diversity In The U.S.
and let's not forget the Racial Dot Map and some background.
# Shows the version of pandas that we are using
!pip show pandas
--- Name: pandas Version: 0.12.0 Location: /Users/raymondyee/anaconda/envs/myenv/lib/python2.7/site-packages Requires:
# import useful classes of pandas
import numpy as np
import pandas as pd
from pandas import Series, DataFrame, Index
http://www.census.gov/developers/
Dependency: to start with -- let's use the Python module: https://pypi.python.org/pypi/census/
pip install -U census
Things we'd like to be able to do:
Some starting points:
We focus first on the API -- and I hope we can come back to processing the bulk data from Census FTP site
"Your request for a new API key has been successfully submitted. Please check your email. In a few minutes you should receive a message with instructions on how to activate your new key."
Then create a settings.py in the same directory as this notebook (or somewhere else in your Python path) to hold settings.CENSUS_KEY
import settings
# This cell should run successfully if you have a string set up to represent your census key
try:
import settings
assert type(settings.CENSUS_KEY) == str or type(settings.CENSUS_KEY) == unicode
except Exception as e:
print "error in importing settings to get at settings.CENSUS_KEY", e
# let's figure out a bit about the us module, in particular, us.states
# https://github.com/unitedstates/python-us
from us import states
for (i, state) in enumerate(states.STATES):
print i, state.name, state.fips
0 Alabama 01 1 Alaska 02 2 Arizona 04 3 Arkansas 05 4 California 06 5 Colorado 08 6 Connecticut 09 7 Delaware 10 8 District of Columbia 11 9 Florida 12 10 Georgia 13 11 Hawaii 15 12 Idaho 16 13 Illinois 17 14 Indiana 18 15 Iowa 19 16 Kansas 20 17 Kentucky 21 18 Louisiana 22 19 Maine 23 20 Maryland 24 21 Massachusetts 25 22 Michigan 26 23 Minnesota 27 24 Mississippi 28 25 Missouri 29 26 Montana 30 27 Nebraska 31 28 Nevada 32 29 New Hampshire 33 30 New Jersey 34 31 New Mexico 35 32 New York 36 33 North Carolina 37 34 North Dakota 38 35 Ohio 39 36 Oklahoma 40 37 Oregon 41 38 Pennsylvania 42 39 Rhode Island 44 40 South Carolina 45 41 South Dakota 46 42 Tennessee 47 43 Texas 48 44 Utah 49 45 Vermont 50 46 Virginia 51 47 Washington 53 48 West Virginia 54 49 Wisconsin 55 50 Wyoming 56
Questions to ponder: How many states are in the list? Is DC included the states list? How to access the territories?
It's immensely useful to be able to access the census API directly but creating a URL with the proper parameters -- as well as using the census
package.
import requests
# get the total population of all states
url = "http://api.census.gov/data/2010/sf1?key={key}&get=P0010001,NAME&for=state:*".format(key=settings.CENSUS_KEY)
# note the structure of the response
r = requests.get(url)
# FILL IN
# drop the header record
from itertools import islice
# total population including PR is 312471327
# FILL IN
# exclude PR: 308745538
# let's now create a DataFrame from r.json()
df = DataFrame(r.json()[1:], columns=r.json()[0])
df.head()
P0010001 | NAME | state | |
---|---|---|---|
0 | 4779736 | Alabama | 01 |
1 | 710231 | Alaska | 02 |
2 | 6392017 | Arizona | 04 |
3 | 2915918 | Arkansas | 05 |
4 | 37253956 | California | 06 |
# FILL IN
# calculate the total population using df
# FILL IN -- now calculate the total population excluding Puerto Rico
How to map out the geographical hierachy and pull out total population figures?
Questions
P0010001 is found in 2010 SF1 API Variables [XML] = "total population"
from settings import CENSUS_KEY
import census
c=census.Census(settings.CENSUS_KEY)
c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})
[{u'NAME': u'California', u'P0010001': u'37253956', u'state': u'06'}]
"population of California: {0}".format(
int(c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})[0]['P0010001']))
'population of California: 37253956'
Let's try to get at the counties of California and their populations
ca_counties = c.sf1.get(('NAME', 'P0010001'), geo={'for': 'county:*', 'in': 'state:%s' % states.CA.fips})
# create a DataFrame, convert the 'P0010001' column
# show by descending population
df = DataFrame(ca_counties)
df['P0010001'] = df['P0010001'].astype('int')
df.sort_index(by='P0010001', ascending=False)
NAME | P0010001 | county | state | |
---|---|---|---|---|
18 | Los Angeles County | 9818605 | 037 | 06 |
36 | San Diego County | 3095313 | 073 | 06 |
29 | Orange County | 3010232 | 059 | 06 |
32 | Riverside County | 2189641 | 065 | 06 |
35 | San Bernardino County | 2035210 | 071 | 06 |
42 | Santa Clara County | 1781642 | 085 | 06 |
0 | Alameda County | 1510271 | 001 | 06 |
33 | Sacramento County | 1418788 | 067 | 06 |
6 | Contra Costa County | 1049025 | 013 | 06 |
9 | Fresno County | 930450 | 019 | 06 |
14 | Kern County | 839631 | 029 | 06 |
55 | Ventura County | 823318 | 111 | 06 |
37 | San Francisco County | 805235 | 075 | 06 |
40 | San Mateo County | 718451 | 081 | 06 |
38 | San Joaquin County | 685306 | 077 | 06 |
49 | Stanislaus County | 514453 | 099 | 06 |
48 | Sonoma County | 483878 | 097 | 06 |
53 | Tulare County | 442179 | 107 | 06 |
41 | Santa Barbara County | 423895 | 083 | 06 |
26 | Monterey County | 415057 | 053 | 06 |
47 | Solano County | 413344 | 095 | 06 |
30 | Placer County | 348432 | 061 | 06 |
39 | San Luis Obispo County | 269637 | 079 | 06 |
43 | Santa Cruz County | 262382 | 087 | 06 |
23 | Merced County | 255793 | 047 | 06 |
20 | Marin County | 252409 | 041 | 06 |
3 | Butte County | 220000 | 007 | 06 |
56 | Yolo County | 200849 | 113 | 06 |
8 | El Dorado County | 181058 | 017 | 06 |
44 | Shasta County | 177223 | 089 | 06 |
12 | Imperial County | 174528 | 025 | 06 |
15 | Kings County | 152982 | 031 | 06 |
19 | Madera County | 150865 | 039 | 06 |
27 | Napa County | 136484 | 055 | 06 |
11 | Humboldt County | 134623 | 023 | 06 |
28 | Nevada County | 98764 | 057 | 06 |
50 | Sutter County | 94737 | 101 | 06 |
22 | Mendocino County | 87841 | 045 | 06 |
57 | Yuba County | 72155 | 115 | 06 |
16 | Lake County | 64665 | 033 | 06 |
51 | Tehama County | 63463 | 103 | 06 |
54 | Tuolumne County | 55365 | 109 | 06 |
34 | San Benito County | 55269 | 069 | 06 |
4 | Calaveras County | 45578 | 009 | 06 |
46 | Siskiyou County | 44900 | 093 | 06 |
2 | Amador County | 38091 | 005 | 06 |
17 | Lassen County | 34895 | 035 | 06 |
7 | Del Norte County | 28610 | 015 | 06 |
10 | Glenn County | 28122 | 021 | 06 |
5 | Colusa County | 21419 | 011 | 06 |
31 | Plumas County | 20007 | 063 | 06 |
13 | Inyo County | 18546 | 027 | 06 |
21 | Mariposa County | 18251 | 043 | 06 |
25 | Mono County | 14202 | 051 | 06 |
52 | Trinity County | 13786 | 105 | 06 |
24 | Modoc County | 9686 | 049 | 06 |
45 | Sierra County | 3240 | 091 | 06 |
1 | Alpine County | 1175 | 003 | 06 |
#http://stackoverflow.com/a/13130357/7782
count,division = np.histogram(df['P0010001'])
df['P0010001'].hist(bins=division)
<matplotlib.axes.AxesSubplot at 0x106f41c90>