Downloading and Plotting U.S. Census Bureau Data Using Python

David C. Folch | Florida State University | github: @dfolch

Rebecca Davies | University of Colorado Boulder | github: @beckymasond


Motivation

Sources of US Census Bureau data

There are a number of point-and-click sources:

Accessing the API (using python)

Why this approach when there are all these nice websites?

  • Reproducible (data) science
  • Efficiency
  • One platform

Tools


Installation

I recommend installing Canopy Python or Anaconda Python. These come with the python programming language and many common libraries preinstalled.

Cenpy

GeoPandas

In [1]:
%matplotlib inline
from matplotlib import rcParams
rcParams['figure.figsize'] = 12,12  # change this line if you want to change the default map size
import pandas as pd
import geopandas as gpd
import cenpy as cen

Work Flow

  1. Select database (ex. 2008-2012 ACS)
  2. Select geography(s) (ex. Leon County census tracts)
  3. Select attributes (ex. income and poverty)
  4. Pull down the data
  5. Pull down the geography
  6. Link the data to the geography

1. Database

The Census Bureau provides many databases through their API, including the decennial census, economic census, etc.

Explore database options

In [2]:
databases = [(k,v) for k,v in cen.explorer.available(verbose=True).items()]
print 'total number of databases:', len(databases)
databases[0:5]
total number of databases: 85
Out[2]:
[(u'NONEMP2012_old', u'2012 Nonemployer Statistics: Nonemployer Statistics'),
 (u'ASMState',
  u'Time Series Annual Survey of Manufactures: Statistics for All Manufacturing by State'),
 (u'IDBSINGLEYEAR',
  u'Time Series International Database: International Populations by Single Year of Age and Sex'),
 (u'2012popproj/deaths',
  u'Vintage 2012 Population Projections - : Projected Deaths'),
 (u'NONEMP2009', u'2009 Nonemployer Statistics: Non Employer Statistics')]

For this example we will use the 2008-2012 American Community Survey.

In [3]:
#api_database = '2010acs5'     # ACS 2006-2010
api_database = 'ACSSF5Y2012'  # ACS 2008-2012
#api_database = 'ACSSF5Y2013'  # ACS 2009-2013
In [4]:
cen.explorer.explain(api_database)
Out[4]:
{u'2008-2012 American Community Survey - Summarized Data: 5-Year Summary File': u"The American Community Survey (ACS) is a nationwide survey designed to provide communities a fresh look at how they are changing. The ACS replaced the decennial census long form in 2010 and thereafter by collecting long form type information throughout the decade rather than only once every 10 years. Questionnaires are mailed to a sample of addresses to obtain information about households -- that is, about each person and the housing unit itself. The American Community Survey produces demographic, social, housing and economic estimates in the form of 1-year, 3-year and 5-year estimates based on population thresholds. The strength of the ACS is in estimating population and housing characteristics. It produces estimates for small areas, including census tracts and population subgroups. Although the ACS produces population, demographic and housing unit estimates,it is the Census Bureau's Population Estimates Program that produces and disseminates the official estimates of the population for the nation, states, counties, cities and towns, and estimates of housing units for states and counties.  For 2010 and other decennial census years, the Decennial Census provides the official counts of population and housing units."}

Connect to database

In [5]:
api_conn = cen.base.Connection(api_database)
In [6]:
api_conn
Out[6]:
Connection to 2008-2012 American Community Survey - Summarized Data: 5-Year Summary File (ID: http://api.census.gov/data/id/ACSSF5Y2012)

2. Geography

Data is provided at a variety of geographic scales, which depend on the database selected. For this example we're working with the ACS. Data for large geographies, like states, can be queried directly. But smaller geographies, like counties or census tracts, can only be selected by first choosing a bounding geography; this is sometimes referred to as geo-in-geo.

In [7]:
api_conn.geographies.keys()
Out[7]:
[u'fips']
In [8]:
api_conn.geographies['fips']
Out[8]:
geoLevelId name optionalWithWCFor requires
0 310 metropolitan statistical area/micropolitan sta... NaN NaN
1 010 us NaN NaN
2 020 region NaN NaN
3 030 division NaN NaN
4 400 urban area NaN NaN
5 250 american indian area/alaska native area/hawaii... NaN NaN
6 330 combined statistical area NaN NaN
7 350 new england city and town area NaN NaN
8 040 state NaN NaN
9 860 zip code tabulation area NaN NaN
10 795 public use microdata area state [state]
11 950 school district (elementary) NaN [state]
12 960 school district (secondary) NaN [state]
13 970 school district (unified) NaN [state]
14 320 metropolitan statistical area/micropolitan sta... NaN [state]
15 340 combined statistical area NaN [state]
16 500 congressional district state [state]
17 610 state legislative district (upper chamber) NaN [state]
18 620 state legislative district (lower chamber) NaN [state]
19 050 county state [state]
20 160 place state [state]
21 140 tract county [state, county]
22 060 county subdivision county [state, county]
23 150 block group tract [state, county, tract]

The geographic query requires selecting a spatial scale from the name column in the table above, and in some cases an encompassing geography, based on the rules in the requires column. In either case, FIPS codes are used to pick specific geographies.

In [9]:
#### select all states in the country
#g_unit = 'state'
#g_filter = {}
#### select Florida
#g_unit = 'state:12'
#g_filter = {}
#### select all counties in Florida
#g_unit = 'county'
#g_filter = {'state':'12'}
#### select all census tracts in Florida
#g_unit = 'tract'
#g_filter = {'state':'12'}
#### select all tracts in Leon County, Florida
g_unit = 'tract'
g_filter = {'state':'12', 'county':'073'}

3. Attributes

Attributes represent the demographic or economic characteristics that you wish to download.

Select attributes

In [10]:
print 'Attributes in the ACS:', api_conn.variables.shape[0]
api_conn.variables.head(10)
Attributes in the ACS: 45236
Out[10]:
concept label predicateOnly predicateType
AIANHH Selectable Geographies American Indian Area/Alaska Native Area/Hawaii... NaN NaN
AIHHTLI Selectable Geographies American Indian Trust Land/Hawaiian Home Land ... NaN int
AITS Selectable Geographies American Indian Tribal Subdivision (FIPS) NaN NaN
AITSCE Selectable Geographies American Indian Tribal Subdivision (Census) NaN int
ANRC Selectable Geographies Alaska Native Regional Corporation (FIPS) NaN NaN
B00001_001E B00001. Unweighted Sample Count of the Popula... Total NaN int
B00001_001M B00001. Unweighted Sample Count of the Popula... Margin Of Error For!!Total NaN int
B00002_001E B00002. Unweighted Sample Housing Units Total NaN int
B00002_001M B00002. Unweighted Sample Housing Units Margin Of Error For!!Total NaN int
B01001A_001E B01001A. SEX BY AGE (WHITE ALONE) Total: NaN int

Cenpy passes column names the Census API. Therefore, one option is to build a list of column codes. A good resource for these codes is Social Explorer. For example, here is the info for ACS 2008-2012 5 year tables.

In [11]:
#### total number of children in poverty
#cols = ['B17006_002E']
#### total number of children in poverty and number of children in pov in female headed households
#cols = ['B17006_002E', 'B17006_012E']

An alternative approach for building the list of column names is to use regular expressions. This approach is handy when you want all the columns from a particular table.

In [12]:
#### all the columns for table B17006 (poverty status for children by family type)
#cols = api_conn.varslike('B17006_\S+')
#### all the columns for table B17006 and table B19326 (median income by gender and employment)
cols = api_conn.varslike('B17006_\S+')
cols.extend(api_conn.varslike('B19326_\S+'))
In [13]:
len(cols)
Out[13]:
72

You can get a text description of each variable. Some descriptions are more useful than others.

In [14]:
cols_detail = pd.DataFrame(api_conn.variables.ix[cols].label)
cols_detail.head()
Out[14]:
label
B17006_001E Total:
B17006_001M Margin Of Error For!!Total:
B17006_002E Income in the past 12 months below poverty level:
B17006_002M Margin Of Error For!!Income in the past 12 mon...
B17006_003E Income in the past 12 months below poverty lev...

Note that each estimate in the ACS comes with an accompanying margin of error (MOE). Column headers ending in E contain estimates, and headers ending in M contain MOEs.

Geographic attributes

In addition to the socioeconomic attributes, by default you will also get columns representing the geo_unit and geo_filter, which contain FIPS codes. We recommend adding the NAME and GEOID columns to get the geographies' text names and the full FIPS code in one cell.

In [15]:
cols.extend(['NAME', 'GEOID'])

4. Pull the data

With all the pieces in hand, we can now pull the data from the API.

In [18]:
data = api_conn.query(cols, geo_unit=g_unit, geo_filter=g_filter)
In [19]:
data.shape
Out[19]:
(68, 77)

It is often useful to make the pandas DataFrame index the full FIPS code. The effect is that any selection from the DataFrame will be accompanied by the Census ID.

In [20]:
data.index = data.GEOID
data.index = data.index.str.replace('14000US','')

We can view all the columns, and then select one.

In [21]:
data.columns
Out[21]:
Index([u'B17006_001E', u'B17006_001M', u'B17006_002E', u'B17006_002M',
       u'B17006_003E', u'B17006_003M', u'B17006_004E', u'B17006_004M',
       u'B17006_005E', u'B17006_005M', u'B17006_006E', u'B17006_006M',
       u'B17006_007E', u'B17006_007M', u'B17006_008E', u'B17006_008M',
       u'B17006_009E', u'B17006_009M', u'B17006_010E', u'B17006_010M',
       u'B17006_011E', u'B17006_011M', u'B17006_012E', u'B17006_012M',
       u'B17006_013E', u'B17006_013M', u'B17006_014E', u'B17006_014M',
       u'B17006_015E', u'B17006_015M', u'B17006_016E', u'B17006_016M',
       u'B17006_017E', u'B17006_017M', u'B17006_018E', u'B17006_018M',
       u'B17006_019E', u'state', u'county', u'tract', u'B17006_019M',
       u'B17006_020E', u'B17006_020M', u'B17006_021E', u'B17006_021M',
       u'B17006_022E', u'B17006_022M', u'B17006_023E', u'B17006_023M',
       u'B17006_024E', u'B17006_024M', u'B17006_025E', u'B17006_025M',
       u'B17006_026E', u'B17006_026M', u'B17006_027E', u'B17006_027M',
       u'B17006_028E', u'B17006_028M', u'B17006_029E', u'B17006_029M',
       u'B19326_001E', u'B19326_001M', u'B19326_002E', u'B19326_002M',
       u'B19326_003E', u'B19326_003M', u'B19326_004E', u'B19326_004M',
       u'B19326_005E', u'B19326_005M', u'B19326_006E', u'B19326_006M',
       u'B19326_007E', u'B19326_007M', u'NAME', u'GEOID'],
      dtype='object')
In [22]:
#### select the count of number of children in poverty, and its MOE
data[['B17006_012E','B17006_012M']].head(10)
Out[22]:
B17006_012E B17006_012M
GEOID
12073000200 107 109
12073000301 0 14
12073000302 0 14
12073000303 147 97
12073000400 28 26
12073000500 0 14
12073000600 139 129
12073000700 107 96
12073000800 54 61
12073000901 154 198

As an aside, you might notice that the quality of this particular estimate is not very good.


5. Pull down the geometry

The Census Bureau API for extracting geometries is separate from the database API.

We can view all the available geographic data.

In [23]:
cen.tiger.available()
Out[23]:
[{u'name': u'AIANNHA', u'type': u'MapServer'},
 {u'name': u'CBSA', u'type': u'MapServer'},
 {u'name': u'Hydro_LargeScale', u'type': u'MapServer'},
 {u'name': u'Hydro', u'type': u'MapServer'},
 {u'name': u'Labels', u'type': u'MapServer'},
 {u'name': u'Legislative', u'type': u'MapServer'},
 {u'name': u'Places_CouSub_ConCity_SubMCD', u'type': u'MapServer'},
 {u'name': u'PUMA_TAD_TAZ_UGA_ZCTA', u'type': u'MapServer'},
 {u'name': u'Region_Division', u'type': u'MapServer'},
 {u'name': u'School', u'type': u'MapServer'},
 {u'name': u'Special_Land_Use_Areas', u'type': u'MapServer'},
 {u'name': u'State_County', u'type': u'MapServer'},
 {u'name': u'tigerWMS_ACS2013', u'type': u'MapServer'},
 {u'name': u'tigerWMS_ACS2014', u'type': u'MapServer'},
 {u'name': u'tigerWMS_ACS2015', u'type': u'MapServer'},
 {u'name': u'tigerWMS_Census2010', u'type': u'MapServer'},
 {u'name': u'tigerWMS_Current', u'type': u'MapServer'},
 {u'name': u'tigerWMS_Econ2012', u'type': u'MapServer'},
 {u'name': u'tigerWMS_PhysicalFeatures', u'type': u'MapServer'},
 {u'name': u'Tracts_Blocks', u'type': u'MapServer'},
 {u'name': u'Transportation_LargeScale', u'type': u'MapServer'},
 {u'name': u'Transportation', u'type': u'MapServer'},
 {u'name': u'TribalTracts', u'type': u'MapServer'},
 {u'name': u'Urban', u'type': u'MapServer'},
 {u'name': u'USLandmass', u'type': u'MapServer'}]

Like before, we need to make a connection to the API. In this case there is a set_mapservice method that will attached this second connection to the first. Since we are working with ACS data, we will connect to the ACS geometry data.

In [24]:
api_conn.set_mapservice('tigerWMS_ACS2013')
api_conn
Out[24]:
Connection to 2008-2012 American Community Survey - Summarized Data: 5-Year Summary File(ID: http://api.census.gov/data/id/ACSSF5Y2012)
With MapServer: Census ACS 2013 WMS

The ACS produces estimates from many different geographies.

In [25]:
api_conn.mapservice.layers
Out[25]:
{0: (ESRILayer) 2010 Census Public Use Microdata Areas,
 1: (ESRILayer) 2010 Census Public Use Microdata Areas Labels,
 2: (ESRILayer) 2010 Census ZIP Code Tabulation Areas,
 3: (ESRILayer) 2010 Census ZIP Code Tabulation Areas Labels,
 4: (ESRILayer) Tribal Census Tracts,
 5: (ESRILayer) Tribal Census Tracts Labels,
 6: (ESRILayer) Tribal Block Groups,
 7: (ESRILayer) Tribal Block Groups Labels,
 8: (ESRILayer) Census Tracts,
 9: (ESRILayer) Census Tracts Labels,
 10: (ESRILayer) Census Block Groups,
 11: (ESRILayer) Census Block Groups Labels,
 12: (ESRILayer) Unified School Districts,
 13: (ESRILayer) Unified School Districts Labels,
 14: (ESRILayer) Secondary School Districts,
 15: (ESRILayer) Secondary School Districts Labels,
 16: (ESRILayer) Elementary School Districts,
 17: (ESRILayer) Elementary School Districts Labels,
 18: (ESRILayer) Estates,
 19: (ESRILayer) Estates Labels,
 20: (ESRILayer) County Subdivisions,
 21: (ESRILayer) County Subdivisions Labels,
 22: (ESRILayer) Subbarrios,
 23: (ESRILayer) Subbarrios Labels,
 24: (ESRILayer) Consolidated Cities,
 25: (ESRILayer) Consolidated Cities Labels,
 26: (ESRILayer) Incorporated Places,
 27: (ESRILayer) Incorporated Places Labels,
 28: (ESRILayer) Census Designated Places,
 29: (ESRILayer) Census Designated Places Labels,
 30: (ESRILayer) Alaska Native Regional Corporations,
 31: (ESRILayer) Alaska Native Regional Corporations Labels,
 32: (ESRILayer) Tribal Subdivisions,
 33: (ESRILayer) Tribal Subdivisions Labels,
 34: (ESRILayer) Federal American Indian Reservations,
 35: (ESRILayer) Federal American Indian Reservations Labels,
 36: (ESRILayer) Off-Reservation Trust Lands,
 37: (ESRILayer) Off-Reservation Trust Lands Labels,
 38: (ESRILayer) State American Indian Reservations,
 39: (ESRILayer) State American Indian Reservations Labels,
 40: (ESRILayer) Hawaiian Home Lands,
 41: (ESRILayer) Hawaiian Home Lands Labels,
 42: (ESRILayer) Alaska Native Village Statistical Areas,
 43: (ESRILayer) Alaska Native Village Statistical Areas Labels,
 44: (ESRILayer) Oklahoma Tribal Statistical Areas,
 45: (ESRILayer) Oklahoma Tribal Statistical Areas Labels,
 46: (ESRILayer) State Designated Tribal Statistical Areas,
 47: (ESRILayer) State Designated Tribal Statistical Areas Labels,
 48: (ESRILayer) Tribal Designated Statistical Areas,
 49: (ESRILayer) Tribal Designated Statistical Areas Labels,
 50: (ESRILayer) American Indian Joint-Use Areas,
 51: (ESRILayer) American Indian Joint-Use Areas Labels,
 52: (ESRILayer) 113th Congressional Districts,
 53: (ESRILayer) 113th Congressional Districts Labels,
 54: (ESRILayer) 2013 State Legislative Districts - Upper,
 55: (ESRILayer) 2013 State Legislative Districts - Upper Labels,
 56: (ESRILayer) 2013 State Legislative Districts - Lower,
 57: (ESRILayer) 2013 State Legislative Districts - Lower Labels,
 58: (ESRILayer) Census Divisions,
 59: (ESRILayer) Census Divisions Labels,
 60: (ESRILayer) Census Regions,
 61: (ESRILayer) Census Regions Labels,
 62: (ESRILayer) 2010 Census Urbanized Areas,
 63: (ESRILayer) 2010 Census Urbanized Areas Labels,
 64: (ESRILayer) 2010 Census Urban Clusters,
 65: (ESRILayer) 2010 Census Urban Clusters Labels,
 66: (ESRILayer) Combined New England City and Town Areas,
 67: (ESRILayer) Combined New England City and Town Areas Labels,
 68: (ESRILayer) New England City and Town Area Divisions,
 69: (ESRILayer) New England City and Town Area  Divisions Labels,
 70: (ESRILayer) Metropolitan New England City and Town Areas,
 71: (ESRILayer) Metropolitan New England City and Town Areas Labels,
 72: (ESRILayer) Micropolitan New England City and Town Areas,
 73: (ESRILayer) Micropolitan New England City and Town Areas Labels,
 74: (ESRILayer) Combined Statistical Areas,
 75: (ESRILayer) Combined Statistical Areas Labels,
 76: (ESRILayer) Metropolitan Divisions,
 77: (ESRILayer) Metropolitan Divisions Labels,
 78: (ESRILayer) Metropolitan Statistical Areas,
 79: (ESRILayer) Metropolitan Statistical Areas Labels,
 80: (ESRILayer) Micropolitan Statistical Areas,
 81: (ESRILayer) Micropolitan Statistical Areas Labels,
 82: (ESRILayer) States,
 83: (ESRILayer) States Labels,
 84: (ESRILayer) Counties,
 85: (ESRILayer) Counties Labels}

We are working with census tracts.

In [26]:
api_conn.mapservice.layers[8]
Out[26]:
(ESRILayer) Census Tracts

Similar to setting the geographic constraints when pulling attribute data, you need to select the spatial scale and a bounding geography. However, these are selected entirely differently for the geometry data. The scale is set by passing an integer to the layer argument, and the bounding geography is selected by passing SQL to the WHERE argument.

In [27]:
#### select Florida
#geodata = api_conn.mapservice.query(layer=82, where='STATE=12', pkg='geopandas')
#### select all counties in Florida
#geodata = api_conn.mapservice.query(layer=84, where='STATE=12', pkg='geopandas')
#### select all census tracts in Florida
#geodata = api_conn.mapservice.query(layer=8, where='STATE=12', pkg='geopandas')
#### select all tracts in Leon County, Florida
geodata = api_conn.mapservice.query(layer=8, where='STATE=12 and COUNTY=073', pkg='geopandas')

6. Merge attributes and geometries

We now have all the pieces to merge the attribute data to the geometries.

In [28]:
newdata = pd.merge(data, geodata, left_index=True, right_on='GEOID')
newdata = gpd.GeoDataFrame(newdata)

We will plot median household income for Leon County, Florida.

In [29]:
newdata.plot(column='B19326_001E', scheme='QUANTILES', k=5, colormap='OrRd')
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x1107e0cd0>