Written by James A. Bednar and Philipp Rudiger
Created 2016
Last updated: August 3, 2021
We are now awash with data from different sources, but pulling it all together to gain insights can be difficult for many reasons. In this notebook we show how to combine data of very different types to show previously hidden relationships:
Few if any tools can alone handle all of these data sources, but here we'll show how freely available Python packages can easily be combined to explore even large, complex datasets interactively in a web browser. The resulting plots make it simple to explore how the racial distribution of the USA population corresponds to the geographic features of each region and how both of these are reflected in the shape of US Congressional districts. For instance, here's an example of using this notebook to zoom in to Houston, revealing a very precisely gerrymandered Hispanic district:
Here the US population is rendered using racial category using the key shown, with more intense colors indicating a higher population density in that pixel, and the geographic background being dimly visible where population density is low. Racially integrated neighborhoods show up as intermediate or locally mixed colors, but most neighborhoods are quite segregated, and in this case the congressional district boundary shown clearly follows the borders of this segregation.
If you run this notebook and zoom in on any urban region of interest, you can click on an area with a concentration of one racial or ethnic group to see for yourself if that district follows geographic features, state boundaries, the racial distribution, or some combination thereof.
Numerous Python packages are required for this type of analysis to work, all coordinated using conda:
Each package is maintained independently and focuses on doing one job really well, but they all combine seamlessly and with very little code to solve complex problems.
import holoviews as hv
from holoviews import opts
import geoviews as gv
import datashader as ds
import dask.dataframe as dd
from cartopy import crs
from holoviews.operation.datashader import datashade
hv.extension('bokeh', width=95)
opts.defaults(
opts.Points(apply_ranges=False, ),
opts.RGB(width=1200, height=682, xaxis=None, yaxis=None, show_grid=False),
opts.Shape(fill_alpha=0, line_width=1.5, apply_ranges=False, tools=['tap']),
opts.WMTS(alpha=0.5)
)
In this notebook, we'll load data from different sources and show it all overlaid together. First, let's define a color key for racial/ethnic categories:
color_key = {'w':'blue', 'b':'green', 'a':'red', 'h':'orange', 'o':'saddlebrown'}
races = {'w':'White', 'b':'Black', 'a':'Asian', 'h':'Hispanic', 'o':'Other'}
color_points = hv.NdOverlay(
{races[k]: gv.Points([0,0], crs=crs.PlateCarree()).opts(color=v) for k, v in color_key.items()})
Next, we'll load the 2010 US Census, with the location and race or ethnicity of every US resident as of that year (300 million data points), and define a plot using datashader to show this data with the given color key:
df = dd.io.parquet.read_parquet('./data/census.snappy.parq')
df = df.persist()
census_points = hv.Points(df, kdims=['easting', 'northing'], vdims=['race'])
Now we can datashade and render these points, coloring the points by race:
x_range, y_range = ((-13884029.0, -7453303.5), (2818291.5, 6335972.0)) # Continental USA
shade_defaults = dict(x_range=x_range, y_range=y_range, x_sampling=10, y_sampling=10, width=1200, height=682,
color_key=color_key, aggregator=ds.count_cat('race'),)
shaded = datashade(census_points, **shade_defaults)
shaded
Next, we'll load congressional districts from a publicly available shapefile and project them into Web Mercator format using GeoViews (which in turn calls Cartopy):
shape_path = './data/congressional_districts'
districts = gv.Shape.from_shapefile(shape_path, crs=crs.PlateCarree())
districts = gv.project(districts)
Finally, we'll define some image tiles to use as a background, using any publicly available Web Mercator tile set:
tiles = gv.WMTS('https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{Z}/{Y}/{X}.jpg')
Each of these data sources can be visualized on their own (just type their name in a separate cell), but they can also easily be combined into a single overlaid plot to see the relationships:
opts.defaults(
opts.Polygons(fill_alpha=0))
shaded = datashade(census_points, **shade_defaults)
tiles * shaded * color_points * districts
You should now be able to interactively explore these three linked datasets, to see how they all relate to each other. In a live notebook, this plot will support a variety of interactive features:
Most of these interactive features are also available in the static HTML copy visible at examples.pyviz.org, with the restriction that because there is no Python process running, the racial/population data will be limited to the resolution at which it was initially rendered, rather than being dynamically re-rendered to fit the current zoom level. Thus in a static copy, the data will look pixelated, whereas in the live server you can zoom all the way down to individual datapoints (people) in each region.