This tutorial will introduce you to some basic methods for processing geospatial data, particularly focusing on Geographic Information Systems (GIS) data. Geospatial data forms a key aspect of data science. In many settings we want to visualize or analyze data with a spatial component: where (geographically) are certain features located, how do certain properties correlate with location, and can we identify spatial patterns in data? Sometimes this also involves substantial analysis or machine learning, but sometimes just visuallizing data spatially tells us a great deal about some question we are interested in.

A classical example is the following map, created by John Snow in 1854: (click for full-size version). This is map of London in the 19th century, as it was experiencing a Cholera outbreak. John Snow took a map of the streets of London, and plotted the number of Cholera death spatially. What emerged is that the deaths were clearly centered around a pump on Broad Street. This visual evidence was enough to persuade the government council to remove the handle to the pump, and provided substantial evidence that Cholera was being spread by contaminated water (the well was later found to have been contaminated by sewage).

Tutorial content

In this tutorial, we will show how to do some basic spatial analysis in Python, specifically using GeoPandas, Shapeley, and GeoPy.

We'll be using data collected from the Pittsburgh GIS data repository from the Pittsburgh city planning department: While there aren't always public repositories for GIS data, a surprisingly large number of cities or counties maintain an up-to-date set of GIS data available on their web pages, and the analysis we do here could be repeated for many other cities as well.

We will cover the following topics in this tutorial:

Installing the libraries

Before getting started, you'll need to install the various libraries that we will use. You can install GeoPandas, Shapely (automatically installed as a dependency of GeoPandas), and GeoPy using pip:

$ pip install --upgrade geopandas geopy

On my machine, GeoPandas throws an exception in its spatial join command unless the Rtree library is also installed. There are instructions for installing Rtree on the web page above, but I was able to use the following command, which installed the relevant libraries:

$ conda install -c ioos rtree

Note that this installs the rtree module from a private repository (called ioos), not an official conda package, so there is a chance it won't work for you, and you'll need to consult the documentation. After you run all the installs, make sure the following commands work for you:

In [1]:
import geopandas as gpd
import shapely
from geopy.geocoders import GoogleV3
import rtree
import numpy as np

Loading data and plotting

Now that we've installed and loaded the libraries, let's load our first geographic data. We're going to load data provided in the "shapefile" format, which stores vector data that is geographically tagged (i.e., the points are expressed in terms of longitude/latitude or some other geographic coordinate system).

Download the file from the Pittsburgh city planning website: Then unzip the file to create a Neighborhood folder with a number of files inside, including a Neighborhood.shp file. This is the Shapefile we want, but in an odd quirk of the Shapefile format, you actually need all the files in this directory in order to properly load the file (GIS systems were developed before people knew how to properly create database files, apparently). So you'll want to copy the entire Neighborhood directory into the same folder as this notebook, and you can then load the data using the following command.

In [2]:
df_neigh = gpd.read_file("Neighborhood/Neighborhood.shp");
0 180.883 268195.0 7843108.0 0.0 1 003 3 0 62 S ... 0.281 42 7.843108e+06 7.842997e+06 11526.863222 11525.904546 7988.379391 040500 113 POLYGON ((1355377.123158768 411531.0846695155,...
1 320.679 75472.0 13904629.0 0.0 1 003 3 1 40 S ... 0.499 42 1.390463e+07 1.390469e+07 20941.386025 20945.562570 3813.620989 040400 25 POLYGON ((1355110.970147774 417655.421174109, ...
2 138.372 282679.0 5999801.5 0.0 2 003 3 2 61 S ... 0.215 42 5.999801e+06 5.998649e+06 18271.426385 18280.484515 8260.915502 040200 21 POLYGON ((1352807.998047695 411926.938864693, ...
3 166.101 284548.0 7202139.0 0.0 2 003 3 3 56 S ... 0.258 42 7.202139e+06 7.203631e+06 15696.759230 15694.771443 8670.428122 030500 19 POLYGON ((1347150.050702184 413333.3376266807,...
4 390.864 1593129.0 16947852.0 329216.0 1 003 1 5 27 S ... 0.608 42 1.694785e+07 1.694875e+07 23903.077852 24019.532672 24019.532865 020300 23 POLYGON ((1349414.006476104 416876.8987810165,...

5 rows × 32 columns

The df_neigh object is of type GeoDataFrame, a subclass of a Pandas DataFrame, so you can use all the same calls you would use with a Pandas DataFrame. If you select a column from the object (other than the special "geometry" data type referenced below), it will simply be a normal Pandas Series. For example, if we wanted to get the total square miles over all neighborhoods, we could do so with the command:

In [3]:

The one special element of a GeoDataFrame is that it has one additional column with the name "geometry". This one column is a GeoSeries type, and contains a series of Shapely geometry objects. We won't worry about using Shapely itself too much except to generate basic geometry types, for the case of the df_neigh object, each entroy of "geometry" contains a Shapely Polygon type, which describes the geometry of the neighborhood as a list of points.

In [4]:
g = df_neigh.loc[0,"geometry"]
print(str(g)[:100] + "...")
POLYGON ((1355377.123158768 411531.0846695155, 1355382.612321019 411522.5715631843, 1355390.25764694...

These points describe the outline of this speicfic neighborhood in Pittsburgh (the name is in the "HOOD" column), and this geometric information lets us plot the DataFrame visually. Specifically, we can use the .plot() command to visualize the geometry of the entire DataFrame. In this case, we can plot the data to view the neighborhoods of Pittsburgh visually:

In [7]:
import matplotlib.pyplot as plt
import seaborn
%matplotlib inline
In [8]:
df_neigh.plot(figsize=(8,8), cmap="jet");

Geometric operations

While the ability to simply plot the data is nice, is nice, one of the powerful aspects of these geometric tools is the ability to perform spatial operations on the data. For example, lets consider an alternative breakdown of the Pittsburgh area by census block groups. These breakdowns are contained in file folder as above. Let's load the shapefile as before, and then plot it to visualize these breakdowns.

In [9]:
df_block_group = gpd.read_file("Block_Group_2010/Block_Group_2010.shp");
df_block_group.plot(figsize=(8,8), cmap="jet");

This looks ok, but unlike the neighborhood shapes, these block groups cover the river areas in Pittsburgh, so they don't look quite as nice as what we had previously. To address this, we're going to compute the intersection between each polygon in the df_block_group data frame and the union of all the neighborhoods (a an object we can get using the .unary_union property of a GeoDataFrame). We can do this using the .intersection() command:

In [10]:
df_block_group["geometry"] = df_block_group.intersection(df_neigh.unary_union)
df_block_group.plot(figsize=(8,8), cmap="jet");

This looks good, but as a minor point, it looks like there are a few holes in this map. What's happening here is that some of the intersections are resulting in GeometryCollection types instead of Polygon or MultiPolygon types (you can check this by looking at the type of each geometry). This is likely a minor bug in the intersection code, and you can fix it by simply only keeping the polygons from the intersections, as follows.

In [11]:
is_collection = lambda x : isinstance(x, shapely.geometry.collection.GeometryCollection)
to_multipolygon = lambda x : shapely.geometry.MultiPolygon([p for p in x if isinstance(p, shapely.geometry.Polygon)])

idx = df_block_group["geometry"].apply(is_collection)
df_block_group.loc[idx,"geometry"] = df_block_group.loc[idx,"geometry"].apply(to_multipolygon)
df_block_group.plot(figsize=(8,8), cmap="jet");

Much nicer. The .intersection() is just one possible geometric operation. We can also create unions between separate geometries (we already effectively did this with the .unary_union property), differences, symmetric differences, etc.

Spatial joins

Similar to geometric operations, but also integrating the more relational components of GIS data is the notation of a "spatial join". Just like traditional join operators merge tables based upon the value of certain columns in these two tables, a spatial join merges together two geospatial tables based upon properties of their geometry entries. In a spatial join we merge together two rows that satisfy some geometric property between the rows. The most obvious case is that we can, for instance, merge together rows where one geometry intersects with the other, or where one geometry is contained within the other.

This idea is best demonstrated through an example. Let's suppose that we would like to build a map of the fair market value of residential homes within Pittsburgh, organized by census block group. To get housing data, we'll download the file ( from the Pittsburgh GIS data, and specifically look at the Parcels shapefiles.

In [12]:
# Load the Parecels shapefile (note: this is a big file, and takes about a minute)
df_parcels = gpd.read_file("Parcels/Parcels.shp")

Now let's perform a spatial join between the parcels shapefile and our (modified) census block groups, to assign each parcel to one of the census block groups.

In [13]:
# Spatial join of df_block_group features, appended to all parcels
# located within that block group (takes about 30 seconds)
df_parcels = gpd.sjoin(df_parcels, df_block_group, how="inner", op="within")