David C. Folch | Florida State University | github: @dfolch
Rebecca Davies | University of Colorado Boulder | github: @beckymasond
There are a number of point-and-click sources:
Why this approach when there are all these nice websites?
Tools
I recommend installing Canopy Python or Anaconda Python. These come with the python programming language and many common libraries preinstalled.
%matplotlib inline
from matplotlib import rcParams
rcParams['figure.figsize'] = 12,12 # change this line if you want to change the default map size
import pandas as pd
import geopandas as gpd
import cenpy as cen
The Census Bureau provides many databases through their API, including the decennial census, economic census, etc.
databases = [(k,v) for k,v in cen.explorer.available(verbose=True).items()]
print 'total number of databases:', len(databases)
databases[0:5]
total number of databases: 85
[(u'NONEMP2012_old', u'2012 Nonemployer Statistics: Nonemployer Statistics'), (u'ASMState', u'Time Series Annual Survey of Manufactures: Statistics for All Manufacturing by State'), (u'IDBSINGLEYEAR', u'Time Series International Database: International Populations by Single Year of Age and Sex'), (u'2012popproj/deaths', u'Vintage 2012 Population Projections - : Projected Deaths'), (u'NONEMP2009', u'2009 Nonemployer Statistics: Non Employer Statistics')]
For this example we will use the 2008-2012 American Community Survey.
#api_database = '2010acs5' # ACS 2006-2010
api_database = 'ACSSF5Y2012' # ACS 2008-2012
#api_database = 'ACSSF5Y2013' # ACS 2009-2013
cen.explorer.explain(api_database)
{u'2008-2012 American Community Survey - Summarized Data: 5-Year Summary File': u"The American Community Survey (ACS) is a nationwide survey designed to provide communities a fresh look at how they are changing. The ACS replaced the decennial census long form in 2010 and thereafter by collecting long form type information throughout the decade rather than only once every 10 years. Questionnaires are mailed to a sample of addresses to obtain information about households -- that is, about each person and the housing unit itself. The American Community Survey produces demographic, social, housing and economic estimates in the form of 1-year, 3-year and 5-year estimates based on population thresholds. The strength of the ACS is in estimating population and housing characteristics. It produces estimates for small areas, including census tracts and population subgroups. Although the ACS produces population, demographic and housing unit estimates,it is the Census Bureau's Population Estimates Program that produces and disseminates the official estimates of the population for the nation, states, counties, cities and towns, and estimates of housing units for states and counties. For 2010 and other decennial census years, the Decennial Census provides the official counts of population and housing units."}
api_conn = cen.base.Connection(api_database)
api_conn
Connection to 2008-2012 American Community Survey - Summarized Data: 5-Year Summary File (ID: http://api.census.gov/data/id/ACSSF5Y2012)
Data is provided at a variety of geographic scales, which depend on the database selected. For this example we're working with the ACS. Data for large geographies, like states, can be queried directly. But smaller geographies, like counties or census tracts, can only be selected by first choosing a bounding geography; this is sometimes referred to as geo-in-geo.
api_conn.geographies.keys()
[u'fips']
api_conn.geographies['fips']
geoLevelId | name | optionalWithWCFor | requires | |
---|---|---|---|---|
0 | 310 | metropolitan statistical area/micropolitan sta... | NaN | NaN |
1 | 010 | us | NaN | NaN |
2 | 020 | region | NaN | NaN |
3 | 030 | division | NaN | NaN |
4 | 400 | urban area | NaN | NaN |
5 | 250 | american indian area/alaska native area/hawaii... | NaN | NaN |
6 | 330 | combined statistical area | NaN | NaN |
7 | 350 | new england city and town area | NaN | NaN |
8 | 040 | state | NaN | NaN |
9 | 860 | zip code tabulation area | NaN | NaN |
10 | 795 | public use microdata area | state | [state] |
11 | 950 | school district (elementary) | NaN | [state] |
12 | 960 | school district (secondary) | NaN | [state] |
13 | 970 | school district (unified) | NaN | [state] |
14 | 320 | metropolitan statistical area/micropolitan sta... | NaN | [state] |
15 | 340 | combined statistical area | NaN | [state] |
16 | 500 | congressional district | state | [state] |
17 | 610 | state legislative district (upper chamber) | NaN | [state] |
18 | 620 | state legislative district (lower chamber) | NaN | [state] |
19 | 050 | county | state | [state] |
20 | 160 | place | state | [state] |
21 | 140 | tract | county | [state, county] |
22 | 060 | county subdivision | county | [state, county] |
23 | 150 | block group | tract | [state, county, tract] |
The geographic query requires selecting a spatial scale from the name
column in the table above, and in some cases an encompassing geography, based on the rules in the requires
column. In either case, FIPS codes are used to pick specific geographies.
#### select all states in the country
#g_unit = 'state'
#g_filter = {}
#### select Florida
#g_unit = 'state:12'
#g_filter = {}
#### select all counties in Florida
#g_unit = 'county'
#g_filter = {'state':'12'}
#### select all census tracts in Florida
#g_unit = 'tract'
#g_filter = {'state':'12'}
#### select all tracts in Leon County, Florida
g_unit = 'tract'
g_filter = {'state':'12', 'county':'073'}
Attributes represent the demographic or economic characteristics that you wish to download.
print 'Attributes in the ACS:', api_conn.variables.shape[0]
api_conn.variables.head(10)
Attributes in the ACS: 45236
concept | label | predicateOnly | predicateType | |
---|---|---|---|---|
AIANHH | Selectable Geographies | American Indian Area/Alaska Native Area/Hawaii... | NaN | NaN |
AIHHTLI | Selectable Geographies | American Indian Trust Land/Hawaiian Home Land ... | NaN | int |
AITS | Selectable Geographies | American Indian Tribal Subdivision (FIPS) | NaN | NaN |
AITSCE | Selectable Geographies | American Indian Tribal Subdivision (Census) | NaN | int |
ANRC | Selectable Geographies | Alaska Native Regional Corporation (FIPS) | NaN | NaN |
B00001_001E | B00001. Unweighted Sample Count of the Popula... | Total | NaN | int |
B00001_001M | B00001. Unweighted Sample Count of the Popula... | Margin Of Error For!!Total | NaN | int |
B00002_001E | B00002. Unweighted Sample Housing Units | Total | NaN | int |
B00002_001M | B00002. Unweighted Sample Housing Units | Margin Of Error For!!Total | NaN | int |
B01001A_001E | B01001A. SEX BY AGE (WHITE ALONE) | Total: | NaN | int |
Cenpy passes column names the Census API. Therefore, one option is to build a list of column codes. A good resource for these codes is Social Explorer. For example, here is the info for ACS 2008-2012 5 year tables.
#### total number of children in poverty
#cols = ['B17006_002E']
#### total number of children in poverty and number of children in pov in female headed households
#cols = ['B17006_002E', 'B17006_012E']
An alternative approach for building the list of column names is to use regular expressions. This approach is handy when you want all the columns from a particular table.
#### all the columns for table B17006 (poverty status for children by family type)
#cols = api_conn.varslike('B17006_\S+')
#### all the columns for table B17006 and table B19326 (median income by gender and employment)
cols = api_conn.varslike('B17006_\S+')
cols.extend(api_conn.varslike('B19326_\S+'))
len(cols)
72
You can get a text description of each variable. Some descriptions are more useful than others.
cols_detail = pd.DataFrame(api_conn.variables.ix[cols].label)
cols_detail.head()
label | |
---|---|
B17006_001E | Total: |
B17006_001M | Margin Of Error For!!Total: |
B17006_002E | Income in the past 12 months below poverty level: |
B17006_002M | Margin Of Error For!!Income in the past 12 mon... |
B17006_003E | Income in the past 12 months below poverty lev... |
Note that each estimate in the ACS comes with an accompanying margin of error (MOE). Column headers ending in E
contain estimates, and headers ending in M
contain MOEs.
In addition to the socioeconomic attributes, by default you will also get columns representing the geo_unit
and geo_filter
, which contain FIPS codes. We recommend adding the NAME
and GEOID
columns to get the geographies' text names and the full FIPS code in one cell.
cols.extend(['NAME', 'GEOID'])
With all the pieces in hand, we can now pull the data from the API.
data = api_conn.query(cols, geo_unit=g_unit, geo_filter=g_filter)
data.shape
(68, 77)
It is often useful to make the pandas DataFrame index the full FIPS code. The effect is that any selection from the DataFrame will be accompanied by the Census ID.
data.index = data.GEOID
data.index = data.index.str.replace('14000US','')
We can view all the columns, and then select one.
data.columns
Index([u'B17006_001E', u'B17006_001M', u'B17006_002E', u'B17006_002M', u'B17006_003E', u'B17006_003M', u'B17006_004E', u'B17006_004M', u'B17006_005E', u'B17006_005M', u'B17006_006E', u'B17006_006M', u'B17006_007E', u'B17006_007M', u'B17006_008E', u'B17006_008M', u'B17006_009E', u'B17006_009M', u'B17006_010E', u'B17006_010M', u'B17006_011E', u'B17006_011M', u'B17006_012E', u'B17006_012M', u'B17006_013E', u'B17006_013M', u'B17006_014E', u'B17006_014M', u'B17006_015E', u'B17006_015M', u'B17006_016E', u'B17006_016M', u'B17006_017E', u'B17006_017M', u'B17006_018E', u'B17006_018M', u'B17006_019E', u'state', u'county', u'tract', u'B17006_019M', u'B17006_020E', u'B17006_020M', u'B17006_021E', u'B17006_021M', u'B17006_022E', u'B17006_022M', u'B17006_023E', u'B17006_023M', u'B17006_024E', u'B17006_024M', u'B17006_025E', u'B17006_025M', u'B17006_026E', u'B17006_026M', u'B17006_027E', u'B17006_027M', u'B17006_028E', u'B17006_028M', u'B17006_029E', u'B17006_029M', u'B19326_001E', u'B19326_001M', u'B19326_002E', u'B19326_002M', u'B19326_003E', u'B19326_003M', u'B19326_004E', u'B19326_004M', u'B19326_005E', u'B19326_005M', u'B19326_006E', u'B19326_006M', u'B19326_007E', u'B19326_007M', u'NAME', u'GEOID'], dtype='object')
#### select the count of number of children in poverty, and its MOE
data[['B17006_012E','B17006_012M']].head(10)
B17006_012E | B17006_012M | |
---|---|---|
GEOID | ||
12073000200 | 107 | 109 |
12073000301 | 0 | 14 |
12073000302 | 0 | 14 |
12073000303 | 147 | 97 |
12073000400 | 28 | 26 |
12073000500 | 0 | 14 |
12073000600 | 139 | 129 |
12073000700 | 107 | 96 |
12073000800 | 54 | 61 |
12073000901 | 154 | 198 |
As an aside, you might notice that the quality of this particular estimate is not very good.
The Census Bureau API for extracting geometries is separate from the database API.
We can view all the available geographic data.
cen.tiger.available()
[{u'name': u'AIANNHA', u'type': u'MapServer'}, {u'name': u'CBSA', u'type': u'MapServer'}, {u'name': u'Hydro_LargeScale', u'type': u'MapServer'}, {u'name': u'Hydro', u'type': u'MapServer'}, {u'name': u'Labels', u'type': u'MapServer'}, {u'name': u'Legislative', u'type': u'MapServer'}, {u'name': u'Places_CouSub_ConCity_SubMCD', u'type': u'MapServer'}, {u'name': u'PUMA_TAD_TAZ_UGA_ZCTA', u'type': u'MapServer'}, {u'name': u'Region_Division', u'type': u'MapServer'}, {u'name': u'School', u'type': u'MapServer'}, {u'name': u'Special_Land_Use_Areas', u'type': u'MapServer'}, {u'name': u'State_County', u'type': u'MapServer'}, {u'name': u'tigerWMS_ACS2013', u'type': u'MapServer'}, {u'name': u'tigerWMS_ACS2014', u'type': u'MapServer'}, {u'name': u'tigerWMS_ACS2015', u'type': u'MapServer'}, {u'name': u'tigerWMS_Census2010', u'type': u'MapServer'}, {u'name': u'tigerWMS_Current', u'type': u'MapServer'}, {u'name': u'tigerWMS_Econ2012', u'type': u'MapServer'}, {u'name': u'tigerWMS_PhysicalFeatures', u'type': u'MapServer'}, {u'name': u'Tracts_Blocks', u'type': u'MapServer'}, {u'name': u'Transportation_LargeScale', u'type': u'MapServer'}, {u'name': u'Transportation', u'type': u'MapServer'}, {u'name': u'TribalTracts', u'type': u'MapServer'}, {u'name': u'Urban', u'type': u'MapServer'}, {u'name': u'USLandmass', u'type': u'MapServer'}]
Like before, we need to make a connection to the API. In this case there is a set_mapservice
method that will attached this second connection to the first. Since we are working with ACS data, we will connect to the ACS geometry data.
api_conn.set_mapservice('tigerWMS_ACS2013')
api_conn
Connection to 2008-2012 American Community Survey - Summarized Data: 5-Year Summary File(ID: http://api.census.gov/data/id/ACSSF5Y2012) With MapServer: Census ACS 2013 WMS
The ACS produces estimates from many different geographies.
api_conn.mapservice.layers
{0: (ESRILayer) 2010 Census Public Use Microdata Areas, 1: (ESRILayer) 2010 Census Public Use Microdata Areas Labels, 2: (ESRILayer) 2010 Census ZIP Code Tabulation Areas, 3: (ESRILayer) 2010 Census ZIP Code Tabulation Areas Labels, 4: (ESRILayer) Tribal Census Tracts, 5: (ESRILayer) Tribal Census Tracts Labels, 6: (ESRILayer) Tribal Block Groups, 7: (ESRILayer) Tribal Block Groups Labels, 8: (ESRILayer) Census Tracts, 9: (ESRILayer) Census Tracts Labels, 10: (ESRILayer) Census Block Groups, 11: (ESRILayer) Census Block Groups Labels, 12: (ESRILayer) Unified School Districts, 13: (ESRILayer) Unified School Districts Labels, 14: (ESRILayer) Secondary School Districts, 15: (ESRILayer) Secondary School Districts Labels, 16: (ESRILayer) Elementary School Districts, 17: (ESRILayer) Elementary School Districts Labels, 18: (ESRILayer) Estates, 19: (ESRILayer) Estates Labels, 20: (ESRILayer) County Subdivisions, 21: (ESRILayer) County Subdivisions Labels, 22: (ESRILayer) Subbarrios, 23: (ESRILayer) Subbarrios Labels, 24: (ESRILayer) Consolidated Cities, 25: (ESRILayer) Consolidated Cities Labels, 26: (ESRILayer) Incorporated Places, 27: (ESRILayer) Incorporated Places Labels, 28: (ESRILayer) Census Designated Places, 29: (ESRILayer) Census Designated Places Labels, 30: (ESRILayer) Alaska Native Regional Corporations, 31: (ESRILayer) Alaska Native Regional Corporations Labels, 32: (ESRILayer) Tribal Subdivisions, 33: (ESRILayer) Tribal Subdivisions Labels, 34: (ESRILayer) Federal American Indian Reservations, 35: (ESRILayer) Federal American Indian Reservations Labels, 36: (ESRILayer) Off-Reservation Trust Lands, 37: (ESRILayer) Off-Reservation Trust Lands Labels, 38: (ESRILayer) State American Indian Reservations, 39: (ESRILayer) State American Indian Reservations Labels, 40: (ESRILayer) Hawaiian Home Lands, 41: (ESRILayer) Hawaiian Home Lands Labels, 42: (ESRILayer) Alaska Native Village Statistical Areas, 43: (ESRILayer) Alaska Native Village Statistical Areas Labels, 44: (ESRILayer) Oklahoma Tribal Statistical Areas, 45: (ESRILayer) Oklahoma Tribal Statistical Areas Labels, 46: (ESRILayer) State Designated Tribal Statistical Areas, 47: (ESRILayer) State Designated Tribal Statistical Areas Labels, 48: (ESRILayer) Tribal Designated Statistical Areas, 49: (ESRILayer) Tribal Designated Statistical Areas Labels, 50: (ESRILayer) American Indian Joint-Use Areas, 51: (ESRILayer) American Indian Joint-Use Areas Labels, 52: (ESRILayer) 113th Congressional Districts, 53: (ESRILayer) 113th Congressional Districts Labels, 54: (ESRILayer) 2013 State Legislative Districts - Upper, 55: (ESRILayer) 2013 State Legislative Districts - Upper Labels, 56: (ESRILayer) 2013 State Legislative Districts - Lower, 57: (ESRILayer) 2013 State Legislative Districts - Lower Labels, 58: (ESRILayer) Census Divisions, 59: (ESRILayer) Census Divisions Labels, 60: (ESRILayer) Census Regions, 61: (ESRILayer) Census Regions Labels, 62: (ESRILayer) 2010 Census Urbanized Areas, 63: (ESRILayer) 2010 Census Urbanized Areas Labels, 64: (ESRILayer) 2010 Census Urban Clusters, 65: (ESRILayer) 2010 Census Urban Clusters Labels, 66: (ESRILayer) Combined New England City and Town Areas, 67: (ESRILayer) Combined New England City and Town Areas Labels, 68: (ESRILayer) New England City and Town Area Divisions, 69: (ESRILayer) New England City and Town Area Divisions Labels, 70: (ESRILayer) Metropolitan New England City and Town Areas, 71: (ESRILayer) Metropolitan New England City and Town Areas Labels, 72: (ESRILayer) Micropolitan New England City and Town Areas, 73: (ESRILayer) Micropolitan New England City and Town Areas Labels, 74: (ESRILayer) Combined Statistical Areas, 75: (ESRILayer) Combined Statistical Areas Labels, 76: (ESRILayer) Metropolitan Divisions, 77: (ESRILayer) Metropolitan Divisions Labels, 78: (ESRILayer) Metropolitan Statistical Areas, 79: (ESRILayer) Metropolitan Statistical Areas Labels, 80: (ESRILayer) Micropolitan Statistical Areas, 81: (ESRILayer) Micropolitan Statistical Areas Labels, 82: (ESRILayer) States, 83: (ESRILayer) States Labels, 84: (ESRILayer) Counties, 85: (ESRILayer) Counties Labels}
We are working with census tracts.
api_conn.mapservice.layers[8]
(ESRILayer) Census Tracts
Similar to setting the geographic constraints when pulling attribute data, you need to select the spatial scale and a bounding geography. However, these are selected entirely differently for the geometry data. The scale is set by passing an integer to the layer
argument, and the bounding geography is selected by passing SQL to the WHERE argument.
#### select Florida
#geodata = api_conn.mapservice.query(layer=82, where='STATE=12', pkg='geopandas')
#### select all counties in Florida
#geodata = api_conn.mapservice.query(layer=84, where='STATE=12', pkg='geopandas')
#### select all census tracts in Florida
#geodata = api_conn.mapservice.query(layer=8, where='STATE=12', pkg='geopandas')
#### select all tracts in Leon County, Florida
geodata = api_conn.mapservice.query(layer=8, where='STATE=12 and COUNTY=073', pkg='geopandas')
We now have all the pieces to merge the attribute data to the geometries.
newdata = pd.merge(data, geodata, left_index=True, right_on='GEOID')
newdata = gpd.GeoDataFrame(newdata)
We will plot median household income for Leon County, Florida.
newdata.plot(column='B19326_001E', scheme='QUANTILES', k=5, colormap='OrRd')
<matplotlib.axes._subplots.AxesSubplot at 0x1107e0cd0>