Downloading and Plotting U.S. Census Bureau Data Using Python¶

David C. Folch | Florida State University | github: @dfolch

Rebecca Davies | University of Colorado Boulder | github: @beckymasond

Motivation¶

Sources of US Census Bureau data¶

There are a number of point-and-click sources:

NHGIS (National Historical Geographic Information System)
Social Explorer
American Factfinder from the US Census Bureau

Accessing the API (using python)¶

Why this approach when there are all these nice websites?

Reproducible (data) science
Efficiency
One platform

Tools

US Census Bureau API
Python
Cenpy package to interface with the Census API
GeoPandas package to hold and plot data

Installation¶

I recommend installing Canopy Python or Anaconda Python. These come with the python programming language and many common libraries preinstalled.

Cenpy¶

Dependencies
- Pandas
- Requests
pip install cenpy

GeoPandas¶

Dependencies
- Pandas
- Matplotlib
- Shapely
- Fiona
- Pyproj
- Descartes
- PySAL (optional - needed for choropleth maps)
pip install geopandas

In [1]:

%matplotlib inline
from matplotlib import rcParams
rcParams['figure.figsize'] = 12,12  # change this line if you want to change the default map size
import pandas as pd
import geopandas as gpd
import cenpy as cen

Work Flow¶

Select database (ex. 2008-2012 ACS)
Select geography(s) (ex. Leon County census tracts)
Select attributes (ex. income and poverty)
Pull down the data
Pull down the geography
Link the data to the geography

1. Database¶

The Census Bureau provides many databases through their API, including the decennial census, economic census, etc.

Explore database options¶

In [2]:

databases = [(k,v) for k,v in cen.explorer.available(verbose=True).items()]
print 'total number of databases:', len(databases)
databases[0:5]

total number of databases: 85

Out[2]:

[(u'NONEMP2012_old', u'2012 Nonemployer Statistics: Nonemployer Statistics'),
 (u'ASMState',
  u'Time Series Annual Survey of Manufactures: Statistics for All Manufacturing by State'),
 (u'IDBSINGLEYEAR',
  u'Time Series International Database: International Populations by Single Year of Age and Sex'),
 (u'2012popproj/deaths',
  u'Vintage 2012 Population Projections - : Projected Deaths'),
 (u'NONEMP2009', u'2009 Nonemployer Statistics: Non Employer Statistics')]

For this example we will use the 2008-2012 American Community Survey.

In [3]:

#api_database = '2010acs5'     # ACS 2006-2010
api_database = 'ACSSF5Y2012'  # ACS 2008-2012
#api_database = 'ACSSF5Y2013'  # ACS 2009-2013

In [4]:

cen.explorer.explain(api_database)

Out[4]:

{u'2008-2012 American Community Survey - Summarized Data: 5-Year Summary File': u"The American Community Survey (ACS) is a nationwide survey designed to provide communities a fresh look at how they are changing. The ACS replaced the decennial census long form in 2010 and thereafter by collecting long form type information throughout the decade rather than only once every 10 years. Questionnaires are mailed to a sample of addresses to obtain information about households -- that is, about each person and the housing unit itself. The American Community Survey produces demographic, social, housing and economic estimates in the form of 1-year, 3-year and 5-year estimates based on population thresholds. The strength of the ACS is in estimating population and housing characteristics. It produces estimates for small areas, including census tracts and population subgroups. Although the ACS produces population, demographic and housing unit estimates,it is the Census Bureau's Population Estimates Program that produces and disseminates the official estimates of the population for the nation, states, counties, cities and towns, and estimates of housing units for states and counties.  For 2010 and other decennial census years, the Decennial Census provides the official counts of population and housing units."}

Connect to database¶

In [5]:

api_conn = cen.base.Connection(api_database)

In [6]:

api_conn

Out[6]:

Connection to 2008-2012 American Community Survey - Summarized Data: 5-Year Summary File (ID: http://api.census.gov/data/id/ACSSF5Y2012)

2. Geography¶

Data is provided at a variety of geographic scales, which depend on the database selected. For this example we're working with the ACS. Data for large geographies, like states, can be queried directly. But smaller geographies, like counties or census tracts, can only be selected by first choosing a bounding geography; this is sometimes referred to as geo-in-geo.

In [7]:

api_conn.geographies.keys()

Out[7]:

[u'fips']

In [8]:

api_conn.geographies['fips']

Out[8]:

	geoLevelId	name	optionalWithWCFor	requires
0	310	metropolitan statistical area/micropolitan sta...	NaN	NaN
1	010	us	NaN	NaN
2	020	region	NaN	NaN
3	030	division	NaN	NaN
4	400	urban area	NaN	NaN
5	250	american indian area/alaska native area/hawaii...	NaN	NaN
6	330	combined statistical area	NaN	NaN
7	350	new england city and town area	NaN	NaN
8	040	state	NaN	NaN
9	860	zip code tabulation area	NaN	NaN
10	795	public use microdata area	state	[state]
11	950	school district (elementary)	NaN	[state]
12	960	school district (secondary)	NaN	[state]
13	970	school district (unified)	NaN	[state]
14	320	metropolitan statistical area/micropolitan sta...	NaN	[state]
15	340	combined statistical area	NaN	[state]
16	500	congressional district	state	[state]
17	610	state legislative district (upper chamber)	NaN	[state]
18	620	state legislative district (lower chamber)	NaN	[state]
19	050	county	state	[state]
20	160	place	state	[state]
21	140	tract	county	[state, county]
22	060	county subdivision	county	[state, county]
23	150	block group	tract	[state, county, tract]

The geographic query requires selecting a spatial scale from the name column in the table above, and in some cases an encompassing geography, based on the rules in the requires column. In either case, FIPS codes are used to pick specific geographies.

In [9]:

#### select all states in the country
#g_unit = 'state'
#g_filter = {}
#### select Florida
#g_unit = 'state:12'
#g_filter = {}
#### select all counties in Florida
#g_unit = 'county'
#g_filter = {'state':'12'}
#### select all census tracts in Florida
#g_unit = 'tract'
#g_filter = {'state':'12'}
#### select all tracts in Leon County, Florida
g_unit = 'tract'
g_filter = {'state':'12', 'county':'073'}

3. Attributes¶

Attributes represent the demographic or economic characteristics that you wish to download.

Select attributes¶

In [10]:

print 'Attributes in the ACS:', api_conn.variables.shape[0]
api_conn.variables.head(10)

Attributes in the ACS: 45236

Out[10]:

	concept	label	predicateOnly	predicateType
AIANHH	Selectable Geographies	American Indian Area/Alaska Native Area/Hawaii...	NaN	NaN
AIHHTLI	Selectable Geographies	American Indian Trust Land/Hawaiian Home Land ...	NaN	int
AITS	Selectable Geographies	American Indian Tribal Subdivision (FIPS)	NaN	NaN
AITSCE	Selectable Geographies	American Indian Tribal Subdivision (Census)	NaN	int
ANRC	Selectable Geographies	Alaska Native Regional Corporation (FIPS)	NaN	NaN
B00001_001E	B00001. Unweighted Sample Count of the Popula...	Total	NaN	int
B00001_001M	B00001. Unweighted Sample Count of the Popula...	Margin Of Error For!!Total	NaN	int
B00002_001E	B00002. Unweighted Sample Housing Units	Total	NaN	int
B00002_001M	B00002. Unweighted Sample Housing Units	Margin Of Error For!!Total	NaN	int
B01001A_001E	B01001A. SEX BY AGE (WHITE ALONE)	Total:	NaN	int

Cenpy passes column names the Census API. Therefore, one option is to build a list of column codes. A good resource for these codes is Social Explorer. For example, here is the info for ACS 2008-2012 5 year tables.

In [11]:

#### total number of children in poverty
#cols = ['B17006_002E']
#### total number of children in poverty and number of children in pov in female headed households
#cols = ['B17006_002E', 'B17006_012E']

An alternative approach for building the list of column names is to use regular expressions. This approach is handy when you want all the columns from a particular table.

In [12]:

#### all the columns for table B17006 (poverty status for children by family type)
#cols = api_conn.varslike('B17006_\S+')
#### all the columns for table B17006 and table B19326 (median income by gender and employment)
cols = api_conn.varslike('B17006_\S+')
cols.extend(api_conn.varslike('B19326_\S+'))

In [13]:

len(cols)

Out[13]:

You can get a text description of each variable. Some descriptions are more useful than others.

In [14]:

cols_detail = pd.DataFrame(api_conn.variables.ix[cols].label)
cols_detail.head()

Out[14]:

	label
B17006_001E	Total:
B17006_001M	Margin Of Error For!!Total:
B17006_002E	Income in the past 12 months below poverty level:
B17006_002M	Margin Of Error For!!Income in the past 12 mon...
B17006_003E	Income in the past 12 months below poverty lev...

Note that each estimate in the ACS comes with an accompanying margin of error (MOE). Column headers ending in E contain estimates, and headers ending in M contain MOEs.

Geographic attributes¶

In addition to the socioeconomic attributes, by default you will also get columns representing the geo_unit and geo_filter, which contain FIPS codes. We recommend adding the NAME and GEOID columns to get the geographies' text names and the full FIPS code in one cell.

In [15]:

cols.extend(['NAME', 'GEOID'])

4. Pull the data¶

With all the pieces in hand, we can now pull the data from the API.

In [18]:

data = api_conn.query(cols, geo_unit=g_unit, geo_filter=g_filter)

In [19]:

data.shape

Out[19]:

(68, 77)

It is often useful to make the pandas DataFrame index the full FIPS code. The effect is that any selection from the DataFrame will be accompanied by the Census ID.

In [20]:

data.index = data.GEOID
data.index = data.index.str.replace('14000US','')

We can view all the columns, and then select one.

In [21]:

data.columns

Out[21]:

Index([u'B17006_001E', u'B17006_001M', u'B17006_002E', u'B17006_002M',
       u'B17006_003E', u'B17006_003M', u'B17006_004E', u'B17006_004M',
       u'B17006_005E', u'B17006_005M', u'B17006_006E', u'B17006_006M',
       u'B17006_007E', u'B17006_007M', u'B17006_008E', u'B17006_008M',
       u'B17006_009E', u'B17006_009M', u'B17006_010E', u'B17006_010M',
       u'B17006_011E', u'B17006_011M', u'B17006_012E', u'B17006_012M',
       u'B17006_013E', u'B17006_013M', u'B17006_014E', u'B17006_014M',
       u'B17006_015E', u'B17006_015M', u'B17006_016E', u'B17006_016M',
       u'B17006_017E', u'B17006_017M', u'B17006_018E', u'B17006_018M',
       u'B17006_019E', u'state', u'county', u'tract', u'B17006_019M',
       u'B17006_020E', u'B17006_020M', u'B17006_021E', u'B17006_021M',
       u'B17006_022E', u'B17006_022M', u'B17006_023E', u'B17006_023M',
       u'B17006_024E', u'B17006_024M', u'B17006_025E', u'B17006_025M',
       u'B17006_026E', u'B17006_026M', u'B17006_027E', u'B17006_027M',
       u'B17006_028E', u'B17006_028M', u'B17006_029E', u'B17006_029M',
       u'B19326_001E', u'B19326_001M', u'B19326_002E', u'B19326_002M',
       u'B19326_003E', u'B19326_003M', u'B19326_004E', u'B19326_004M',
       u'B19326_005E', u'B19326_005M', u'B19326_006E', u'B19326_006M',
       u'B19326_007E', u'B19326_007M', u'NAME', u'GEOID'],
      dtype='object')

In [22]:

#### select the count of number of children in poverty, and its MOE
data[['B17006_012E','B17006_012M']].head(10)

Out[22]:

	B17006_012E	B17006_012M
GEOID
12073000200	107	109
12073000301	0	14
12073000302	0	14
12073000303	147	97
12073000400	28	26
12073000500	0	14
12073000600	139	129
12073000700	107	96
12073000800	54	61
12073000901	154	198

As an aside, you might notice that the quality of this particular estimate is not very good.

5. Pull down the geometry¶

The Census Bureau API for extracting geometries is separate from the database API.

We can view all the available geographic data.

In [23]:

cen.tiger.available()

Out[23]:

[{u'name': u'AIANNHA', u'type': u'MapServer'},
 {u'name': u'CBSA', u'type': u'MapServer'},
 {u'name': u'Hydro_LargeScale', u'type': u'MapServer'},
 {u'name': u'Hydro', u'type': u'MapServer'},
 {u'name': u'Labels', u'type': u'MapServer'},
 {u'name': u'Legislative', u'type': u'MapServer'},
 {u'name': u'Places_CouSub_ConCity_SubMCD', u'type': u'MapServer'},
 {u'name': u'PUMA_TAD_TAZ_UGA_ZCTA', u'type': u'MapServer'},
 {u'name': u'Region_Division', u'type': u'MapServer'},
 {u'name': u'School', u'type': u'MapServer'},
 {u'name': u'Special_Land_Use_Areas', u'type': u'MapServer'},
 {u'name': u'State_County', u'type': u'MapServer'},
 {u'name': u'tigerWMS_ACS2013', u'type': u'MapServer'},
 {u'name': u'tigerWMS_ACS2014', u'type': u'MapServer'},
 {u'name': u'tigerWMS_ACS2015', u'type': u'MapServer'},
 {u'name': u'tigerWMS_Census2010', u'type': u'MapServer'},
 {u'name': u'tigerWMS_Current', u'type': u'MapServer'},
 {u'name': u'tigerWMS_Econ2012', u'type': u'MapServer'},
 {u'name': u'tigerWMS_PhysicalFeatures', u'type': u'MapServer'},
 {u'name': u'Tracts_Blocks', u'type': u'MapServer'},
 {u'name': u'Transportation_LargeScale', u'type': u'MapServer'},
 {u'name': u'Transportation', u'type': u'MapServer'},
 {u'name': u'TribalTracts', u'type': u'MapServer'},
 {u'name': u'Urban', u'type': u'MapServer'},
 {u'name': u'USLandmass', u'type': u'MapServer'}]

Like before, we need to make a connection to the API. In this case there is a set_mapservice method that will attached this second connection to the first. Since we are working with ACS data, we will connect to the ACS geometry data.

In [24]:

api_conn.set_mapservice('tigerWMS_ACS2013')
api_conn

Out[24]:

Connection to 2008-2012 American Community Survey - Summarized Data: 5-Year Summary File(ID: http://api.census.gov/data/id/ACSSF5Y2012)
With MapServer: Census ACS 2013 WMS

The ACS produces estimates from many different geographies.

In [25]:

api_conn.mapservice.layers

Out[25]:

{0: (ESRILayer) 2010 Census Public Use Microdata Areas,
 1: (ESRILayer) 2010 Census Public Use Microdata Areas Labels,
 2: (ESRILayer) 2010 Census ZIP Code Tabulation Areas,
 3: (ESRILayer) 2010 Census ZIP Code Tabulation Areas Labels,
 4: (ESRILayer) Tribal Census Tracts,
 5: (ESRILayer) Tribal Census Tracts Labels,
 6: (ESRILayer) Tribal Block Groups,
 7: (ESRILayer) Tribal Block Groups Labels,
 8: (ESRILayer) Census Tracts,
 9: (ESRILayer) Census Tracts Labels,
 10: (ESRILayer) Census Block Groups,
 11: (ESRILayer) Census Block Groups Labels,
 12: (ESRILayer) Unified School Districts,
 13: (ESRILayer) Unified School Districts Labels,
 14: (ESRILayer) Secondary School Districts,
 15: (ESRILayer) Secondary School Districts Labels,
 16: (ESRILayer) Elementary School Districts,
 17: (ESRILayer) Elementary School Districts Labels,
 18: (ESRILayer) Estates,
 19: (ESRILayer) Estates Labels,
 20: (ESRILayer) County Subdivisions,
 21: (ESRILayer) County Subdivisions Labels,
 22: (ESRILayer) Subbarrios,
 23: (ESRILayer) Subbarrios Labels,
 24: (ESRILayer) Consolidated Cities,
 25: (ESRILayer) Consolidated Cities Labels,
 26: (ESRILayer) Incorporated Places,
 27: (ESRILayer) Incorporated Places Labels,
 28: (ESRILayer) Census Designated Places,
 29: (ESRILayer) Census Designated Places Labels,
 30: (ESRILayer) Alaska Native Regional Corporations,
 31: (ESRILayer) Alaska Native Regional Corporations Labels,
 32: (ESRILayer) Tribal Subdivisions,
 33: (ESRILayer) Tribal Subdivisions Labels,
 34: (ESRILayer) Federal American Indian Reservations,
 35: (ESRILayer) Federal American Indian Reservations Labels,
 36: (ESRILayer) Off-Reservation Trust Lands,
 37: (ESRILayer) Off-Reservation Trust Lands Labels,
 38: (ESRILayer) State American Indian Reservations,
 39: (ESRILayer) State American Indian Reservations Labels,
 40: (ESRILayer) Hawaiian Home Lands,
 41: (ESRILayer) Hawaiian Home Lands Labels,
 42: (ESRILayer) Alaska Native Village Statistical Areas,
 43: (ESRILayer) Alaska Native Village Statistical Areas Labels,
 44: (ESRILayer) Oklahoma Tribal Statistical Areas,
 45: (ESRILayer) Oklahoma Tribal Statistical Areas Labels,
 46: (ESRILayer) State Designated Tribal Statistical Areas,
 47: (ESRILayer) State Designated Tribal Statistical Areas Labels,
 48: (ESRILayer) Tribal Designated Statistical Areas,
 49: (ESRILayer) Tribal Designated Statistical Areas Labels,
 50: (ESRILayer) American Indian Joint-Use Areas,
 51: (ESRILayer) American Indian Joint-Use Areas Labels,
 52: (ESRILayer) 113th Congressional Districts,
 53: (ESRILayer) 113th Congressional Districts Labels,
 54: (ESRILayer) 2013 State Legislative Districts - Upper,
 55: (ESRILayer) 2013 State Legislative Districts - Upper Labels,
 56: (ESRILayer) 2013 State Legislative Districts - Lower,
 57: (ESRILayer) 2013 State Legislative Districts - Lower Labels,
 58: (ESRILayer) Census Divisions,
 59: (ESRILayer) Census Divisions Labels,
 60: (ESRILayer) Census Regions,
 61: (ESRILayer) Census Regions Labels,
 62: (ESRILayer) 2010 Census Urbanized Areas,
 63: (ESRILayer) 2010 Census Urbanized Areas Labels,
 64: (ESRILayer) 2010 Census Urban Clusters,
 65: (ESRILayer) 2010 Census Urban Clusters Labels,
 66: (ESRILayer) Combined New England City and Town Areas,
 67: (ESRILayer) Combined New England City and Town Areas Labels,
 68: (ESRILayer) New England City and Town Area Divisions,
 69: (ESRILayer) New England City and Town Area  Divisions Labels,
 70: (ESRILayer) Metropolitan New England City and Town Areas,
 71: (ESRILayer) Metropolitan New England City and Town Areas Labels,
 72: (ESRILayer) Micropolitan New England City and Town Areas,
 73: (ESRILayer) Micropolitan New England City and Town Areas Labels,
 74: (ESRILayer) Combined Statistical Areas,
 75: (ESRILayer) Combined Statistical Areas Labels,
 76: (ESRILayer) Metropolitan Divisions,
 77: (ESRILayer) Metropolitan Divisions Labels,
 78: (ESRILayer) Metropolitan Statistical Areas,
 79: (ESRILayer) Metropolitan Statistical Areas Labels,
 80: (ESRILayer) Micropolitan Statistical Areas,
 81: (ESRILayer) Micropolitan Statistical Areas Labels,
 82: (ESRILayer) States,
 83: (ESRILayer) States Labels,
 84: (ESRILayer) Counties,
 85: (ESRILayer) Counties Labels}

We are working with census tracts.

In [26]:

api_conn.mapservice.layers[8]

Out[26]:

(ESRILayer) Census Tracts

Similar to setting the geographic constraints when pulling attribute data, you need to select the spatial scale and a bounding geography. However, these are selected entirely differently for the geometry data. The scale is set by passing an integer to the layer argument, and the bounding geography is selected by passing SQL to the WHERE argument.

In [27]:

#### select Florida
#geodata = api_conn.mapservice.query(layer=82, where='STATE=12', pkg='geopandas')
#### select all counties in Florida
#geodata = api_conn.mapservice.query(layer=84, where='STATE=12', pkg='geopandas')
#### select all census tracts in Florida
#geodata = api_conn.mapservice.query(layer=8, where='STATE=12', pkg='geopandas')
#### select all tracts in Leon County, Florida
geodata = api_conn.mapservice.query(layer=8, where='STATE=12 and COUNTY=073', pkg='geopandas')

6. Merge attributes and geometries¶

We now have all the pieces to merge the attribute data to the geometries.

In [28]:

newdata = pd.merge(data, geodata, left_index=True, right_on='GEOID')
newdata = gpd.GeoDataFrame(newdata)

We will plot median household income for Leon County, Florida.

In [29]:

newdata.plot(column='B19326_001E', scheme='QUANTILES', k=5, colormap='OrRd')

Out[29]:

<matplotlib.axes._subplots.AxesSubplot at 0x1107e0cd0>