In [1]:

%pylab --no-import-all inline

Populating the interactive namespace from numpy and matplotlib

Some Context¶

The US Census is complex....so it's good, even essential, to have a framing question to guide your explorations so that you don't get distracted or lost.

I got into thinking of the census in 2002 when I saw a woman I knew in the following SF Chronicle article:

Claremont-Elmwood / Homogeneity in Berkeley? Well, yeah - SFGate

I thought at that point it should be easy for regular people to do census calculations....

In the summer of 2013, I wrote the following note to Greg Wilson about diversity calculations:

notes for Greg Wilson about an example Data Science Workflow

There's a whole cottage industry in musing on "diversity" in the USA:

The Most Diverse Cities In The US - Business Insider -- using 4 categories: Vallejo.
Most And Least Diverse Cities: Brown University Study Evaluates Diversity In The U.S.
The Top 10 Most Diverse Cities in America -- LA?

and let's not forget the Racial Dot Map and some background.

In [2]:

# Shows the version of pandas that we are using
!pip show pandas

---
Name: pandas
Version: 0.12.0
Location: /Users/raymondyee/anaconda/envs/myenv/lib/python2.7/site-packages
Requires:

In [3]:

#  import useful classes of pandas
import numpy as np
import pandas as pd
from pandas import Series, DataFrame, Index

http://www.census.gov/developers/

Dependency: to start with -- let's use the Python module: https://pypi.python.org/pypi/census/

pip install -U  census

Things we'd like to be able to do:

calculate the population of California.
then calculate the population of every geographic entity going down to census block if possible.
for a given geographic unit, can we get the racial/ethnic breakdown?

Figuring out the Census Data is a Big Jigsaw Puzzle¶

Some starting points:

We focus first on the API -- and I hope we can come back to processing the bulk data from Census FTP site

Prerequisites: Getting and activating key¶

fill out form at http://www.census.gov/developers/tos/key_request.html

"Your request for a new API key has been successfully submitted. Please check your email. In a few minutes you should receive a message with instructions on how to activate your new key."

click on link you'll get http://api.census.gov/data/KeySignup?validate=%7Bkey%7D

Then create a settings.py in the same directory as this notebook (or somewhere else in your Python path) to hold settings.CENSUS_KEY

In [4]:

import settings

In [5]:

# This cell should run successfully if you have a string set up to represent your census key

try:
    import settings
    assert type(settings.CENSUS_KEY) == str or type(settings.CENSUS_KEY) == unicode
except Exception as e:
    print "error in importing settings to get at settings.CENSUS_KEY", e

states module¶

In [6]:

# let's figure out a bit about the us module, in particular, us.states
# https://github.com/unitedstates/python-us

from us import states

for (i, state) in enumerate(states.STATES):
    print i, state.name, state.fips

0 Alabama 01
1 Alaska 02
2 Arizona 04
3 Arkansas 05
4 California 06
5 Colorado 08
6 Connecticut 09
7 Delaware 10
8 District of Columbia 11
9 Florida 12
10 Georgia 13
11 Hawaii 15
12 Idaho 16
13 Illinois 17
14 Indiana 18
15 Iowa 19
16 Kansas 20
17 Kentucky 21
18 Louisiana 22
19 Maine 23
20 Maryland 24
21 Massachusetts 25
22 Michigan 26
23 Minnesota 27
24 Mississippi 28
25 Missouri 29
26 Montana 30
27 Nebraska 31
28 Nevada 32
29 New Hampshire 33
30 New Jersey 34
31 New Mexico 35
32 New York 36
33 North Carolina 37
34 North Dakota 38
35 Ohio 39
36 Oklahoma 40
37 Oregon 41
38 Pennsylvania 42
39 Rhode Island 44
40 South Carolina 45
41 South Dakota 46
42 Tennessee 47
43 Texas 48
44 Utah 49
45 Vermont 50
46 Virginia 51
47 Washington 53
48 West Virginia 54
49 Wisconsin 55
50 Wyoming 56

Questions to ponder: How many states are in the list? Is DC included the states list? How to access the territories?

Formulating URL requests by hand¶

It's immensely useful to be able to access the census API directly but creating a URL with the proper parameters -- as well as using the census package.

In [7]:

import requests

In [8]:

# get the total population of all states
url = "http://api.census.gov/data/2010/sf1?key={key}&get=P0010001,NAME&for=state:*".format(key=settings.CENSUS_KEY)

In [9]:

# note the structure of the response
r = requests.get(url)

Total Population¶

In [10]:

# FILL IN
# drop the header record
from itertools import islice
# total population including PR is 312471327

In [11]:

# FILL IN
# exclude PR:  308745538

In [12]:

# let's now create a DataFrame from r.json()

df = DataFrame(r.json()[1:], columns=r.json()[0])
df.head()

Out[12]:

	P0010001	NAME	state
0	4779736	Alabama	01
1	710231	Alaska	02
2	6392017	Arizona	04
3	2915918	Arkansas	05
4	37253956	California	06

In [13]:

# FILL IN
# calculate the total population using df

In [14]:

# FILL IN -- now calculate the total population excluding Puerto Rico

Focusing on sf1 +2010 census¶

How to map out the geographical hierachy and pull out total population figures?

Nation
Regions
Divisions
State
County
Census Tract
Block Group
Census Block

Questions

What identifiers are used for these various geographic entities?
Can we get an enumeration of each of these entities?
How to figure out which census tract, block group, census block one is in?

Total Population of California¶

2010 Census Summary File 1

P0010001 is found in 2010 SF1 API Variables [XML] = "total population"

In [15]:

from settings import CENSUS_KEY
import census

c=census.Census(settings.CENSUS_KEY) 
c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})

Out[15]:

[{u'NAME': u'California', u'P0010001': u'37253956', u'state': u'06'}]

In [16]:

"population of California: {0}".format(
        int(c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})[0]['P0010001']))

Out[16]:

'population of California: 37253956'

Let's try to get at the counties of California and their populations

In [17]:

ca_counties = c.sf1.get(('NAME', 'P0010001'), geo={'for': 'county:*', 'in': 'state:%s' % states.CA.fips})

In [18]:

# create a DataFrame, convert the 'P0010001' column
# show by descending population
df = DataFrame(ca_counties)
df['P0010001'] = df['P0010001'].astype('int')
df.sort_index(by='P0010001', ascending=False)

Out[18]:

	NAME	P0010001	county	state
18	Los Angeles County	9818605	037	06
36	San Diego County	3095313	073	06
29	Orange County	3010232	059	06
32	Riverside County	2189641	065	06
35	San Bernardino County	2035210	071	06
42	Santa Clara County	1781642	085	06
0	Alameda County	1510271	001	06
33	Sacramento County	1418788	067	06
6	Contra Costa County	1049025	013	06
9	Fresno County	930450	019	06
14	Kern County	839631	029	06
55	Ventura County	823318	111	06
37	San Francisco County	805235	075	06
40	San Mateo County	718451	081	06
38	San Joaquin County	685306	077	06
49	Stanislaus County	514453	099	06
48	Sonoma County	483878	097	06
53	Tulare County	442179	107	06
41	Santa Barbara County	423895	083	06
26	Monterey County	415057	053	06
47	Solano County	413344	095	06
30	Placer County	348432	061	06
39	San Luis Obispo County	269637	079	06
43	Santa Cruz County	262382	087	06
23	Merced County	255793	047	06
20	Marin County	252409	041	06
3	Butte County	220000	007	06
56	Yolo County	200849	113	06
8	El Dorado County	181058	017	06
44	Shasta County	177223	089	06
12	Imperial County	174528	025	06
15	Kings County	152982	031	06
19	Madera County	150865	039	06
27	Napa County	136484	055	06
11	Humboldt County	134623	023	06
28	Nevada County	98764	057	06
50	Sutter County	94737	101	06
22	Mendocino County	87841	045	06
57	Yuba County	72155	115	06
16	Lake County	64665	033	06
51	Tehama County	63463	103	06
54	Tuolumne County	55365	109	06
34	San Benito County	55269	069	06
4	Calaveras County	45578	009	06
46	Siskiyou County	44900	093	06
2	Amador County	38091	005	06
17	Lassen County	34895	035	06
7	Del Norte County	28610	015	06
10	Glenn County	28122	021	06
5	Colusa County	21419	011	06
31	Plumas County	20007	063	06
13	Inyo County	18546	027	06
21	Mariposa County	18251	043	06
25	Mono County	14202	051	06
52	Trinity County	13786	105	06
24	Modoc County	9686	049	06
45	Sierra County	3240	091	06
1	Alpine County	1175	003	06

In [19]:

#http://stackoverflow.com/a/13130357/7782
count,division = np.histogram(df['P0010001'])
df['P0010001'].hist(bins=division)

Out[19]:

<matplotlib.axes.AxesSubplot at 0x106f41c90>