#!/usr/bin/env python
# coding: utf-8

# # Downloading and Plotting U.S. Census Bureau Data Using Python
# David C. Folch | Florida State University | github: @dfolch
# 
# Rebecca Davies | University of Colorado Boulder | github: @beckymasond

# ---

# ## Motivation

# ### Sources of US Census Bureau data

# There are a number of point-and-click sources:
# * [NHGIS](https://www.nhgis.org/) (National Historical Geographic Information System)
# * [Social Explorer](http://www.socialexplorer.com/)
# * [American Factfinder](http://factfinder.census.gov) from the US Census Bureau

# ### Accessing the API (using python)

# Why this approach when there are all these nice websites?
# * Reproducible (data) science
# * Efficiency
# * One platform
# 
# Tools
# * [US Census Bureau API](http://www.census.gov/developers/)
# * [Python](https://www.python.org/) 
# * [Cenpy](https://github.com/ljwolf/cenpy) package to interface with the Census API
# * [GeoPandas](http://geopandas.org/) package to hold and plot data

# ---

# ## Installation
# 
# I recommend installing [Canopy Python](https://www.enthought.com/products/canopy/) or [Anaconda Python](https://www.continuum.io/downloads). These come with the python programming language and many common libraries preinstalled. 
# 
# #### Cenpy
# * Dependencies
#     * [Pandas](https://pandas.pydata.org/)
#     * [Requests](https://docs.python-requests.org)
# * `pip install cenpy`
# 
# #### GeoPandas
# * Dependencies
#     * [Pandas](https://pandas.pydata.org/)
#     * [Matplotlib](http://matplotlib.org/)
#     * [Shapely](http://toblerity.org/shapely/)
#     * [Fiona](http://toblerity.org/fiona/)
#     * [Pyproj](https://github.com/jswhit/pyproj)
#     * [Descartes](https://pypi.python.org/pypi/descartes)
#     * [PySAL](http://pysal.org) (optional - needed for choropleth maps)
# * `pip install geopandas`
# 

# In[1]:


get_ipython().run_line_magic('matplotlib', 'inline')
from matplotlib import rcParams
rcParams['figure.figsize'] = 12,12  # change this line if you want to change the default map size
import pandas as pd
import geopandas as gpd
import cenpy as cen


# ---

# ## Work Flow

# 1. Select database  (ex. 2008-2012 ACS)
# 2. Select geography(s)  (ex. Leon County census tracts)
# 3. Select attributes  (ex. income and poverty)
# 4. Pull down the data
# 5. Pull down the geography
# 6. Link the data to the geography

# ---

# ### 1. Database
# 
# The Census Bureau provides [many databases](http://www.census.gov/data/developers/data-sets.html) through their API, including the decennial census, economic census, etc.

# #### Explore database options

# In[2]:


databases = [(k,v) for k,v in cen.explorer.available(verbose=True).items()]
print 'total number of databases:', len(databases)
databases[0:5]


# For this example we will use the 2008-2012 [American Community Survey](https://www.census.gov/programs-surveys/acs/).

# In[3]:


#api_database = '2010acs5'     # ACS 2006-2010
api_database = 'ACSSF5Y2012'  # ACS 2008-2012
#api_database = 'ACSSF5Y2013'  # ACS 2009-2013


# In[4]:


cen.explorer.explain(api_database)


# 
# #### Connect to database

# In[5]:


api_conn = cen.base.Connection(api_database)


# In[6]:


api_conn


# ---

# ### 2. Geography

# Data is provided at a variety of geographic scales, which depend on the database selected. For this example we're working with the ACS. Data for large geographies, like states, can be queried directly. But smaller geographies, like counties or census tracts, can only be selected by first choosing a bounding geography; this is sometimes referred to as *geo-in-geo*.  

# In[7]:


api_conn.geographies.keys()


# In[8]:


api_conn.geographies['fips']


# The geographic query requires selecting a spatial scale from the `name` column in the table above, and in some cases an encompassing geography, based on the rules in the `requires` column. In either case, [FIPS codes](http://mcdc.missouri.edu/webrepts/commoncodes11/) are used to pick specific geographies.

# In[9]:


#### select all states in the country
#g_unit = 'state'
#g_filter = {}
#### select Florida
#g_unit = 'state:12'
#g_filter = {}
#### select all counties in Florida
#g_unit = 'county'
#g_filter = {'state':'12'}
#### select all census tracts in Florida
#g_unit = 'tract'
#g_filter = {'state':'12'}
#### select all tracts in Leon County, Florida
g_unit = 'tract'
g_filter = {'state':'12', 'county':'073'}


# ---

# ### 3. Attributes

# Attributes represent the demographic or economic characteristics that you wish to download.

# #### Select attributes

# In[10]:


print 'Attributes in the ACS:', api_conn.variables.shape[0]
api_conn.variables.head(10)


# Cenpy passes column names the Census API. Therefore, one option is to build a list of column codes. A good resource for these codes is [Social Explorer](http://www.socialexplorer.com/). For example, here is the info for [ACS 2008-2012 5 year tables](http://www.socialexplorer.com/data/ACS2012_5yr/metadata/?ds=ACS12_5yr).

# In[11]:


#### total number of children in poverty
#cols = ['B17006_002E']
#### total number of children in poverty and number of children in pov in female headed households
#cols = ['B17006_002E', 'B17006_012E']


# An alternative approach for building the list of column names is to use [regular expressions](https://en.wikipedia.org/wiki/Regular_expression). This approach is handy when you want all the columns from a particular table.

# In[12]:


#### all the columns for table B17006 (poverty status for children by family type)
#cols = api_conn.varslike('B17006_\S+')
#### all the columns for table B17006 and table B19326 (median income by gender and employment)
cols = api_conn.varslike('B17006_\S+')
cols.extend(api_conn.varslike('B19326_\S+'))


# In[13]:


len(cols)


# You can get a text description of each variable. Some descriptions are more useful than others.

# In[14]:


cols_detail = pd.DataFrame(api_conn.variables.ix[cols].label)
cols_detail.head()


# Note that each estimate in the ACS comes with an accompanying margin of error (MOE). Column headers ending in `E` contain estimates, and headers ending in `M` contain MOEs.

# #### Geographic attributes

# In addition to the socioeconomic attributes, by default you will also get columns representing the `geo_unit` and `geo_filter`, which contain FIPS codes. We recommend adding the `NAME` and `GEOID` columns to get the geographies' text names and the full FIPS code in one cell.

# In[15]:


cols.extend(['NAME', 'GEOID'])


# ---

# ### 4. Pull the data

# With all the pieces in hand, we can now pull the data from the API.

# In[18]:


data = api_conn.query(cols, geo_unit=g_unit, geo_filter=g_filter)


# In[19]:


data.shape


# It is often useful to make the pandas DataFrame index the full FIPS code. The effect is that any selection from the DataFrame will be accompanied by the Census ID.

# In[20]:


data.index = data.GEOID
data.index = data.index.str.replace('14000US','')


# We can view all the columns, and then select one.

# In[21]:


data.columns


# In[22]:


#### select the count of number of children in poverty, and its MOE
data[['B17006_012E','B17006_012M']].head(10)


# As an aside, you might notice that the quality of this particular estimate is not very good.

# ---

# ### 5. Pull down the geometry

# The Census Bureau API for extracting geometries is separate from the database API.

# We can view all the available geographic data.

# In[23]:


cen.tiger.available()


# Like before, we need to make a connection to the API. In this case there is a `set_mapservice` method that will attached this second connection to the first. Since we are working with ACS data, we will connect to the ACS geometry data.

# In[24]:


api_conn.set_mapservice('tigerWMS_ACS2013')
api_conn


# The ACS produces estimates from many different geographies.

# In[25]:


api_conn.mapservice.layers


# We are working with census tracts.

# In[26]:


api_conn.mapservice.layers[8]


# Similar to setting the geographic constraints when pulling attribute data, you need to select the spatial scale and a bounding geography. However, these are selected entirely differently for the geometry data. The scale is set by passing an integer to the `layer` argument, and the bounding geography is selected by passing SQL to the WHERE argument.

# In[27]:


#### select Florida
#geodata = api_conn.mapservice.query(layer=82, where='STATE=12', pkg='geopandas')
#### select all counties in Florida
#geodata = api_conn.mapservice.query(layer=84, where='STATE=12', pkg='geopandas')
#### select all census tracts in Florida
#geodata = api_conn.mapservice.query(layer=8, where='STATE=12', pkg='geopandas')
#### select all tracts in Leon County, Florida
geodata = api_conn.mapservice.query(layer=8, where='STATE=12 and COUNTY=073', pkg='geopandas')


# ---

# ### 6. Merge attributes and geometries

# We now have all the pieces to merge the attribute data to the geometries.

# In[28]:


newdata = pd.merge(data, geodata, left_index=True, right_on='GEOID')
newdata = gpd.GeoDataFrame(newdata)


# We will plot median household income for Leon County, Florida.

# In[29]:


newdata.plot(column='B19326_001E', scheme='QUANTILES', k=5, colormap='OrRd')