alt text

A Deep Dive into Geospatial Analysis¶

Many of the datasets that data scientists handle have some kind of geospatial component to them, and that information is oftentimes useful-to-critical for understanding the problem at hand. As such, an understanding of spatial data and how to work with it is a valuable skill for any data scientist to have. Even better, Python provides a rich toolset for working in this domain, and recent advances have greatly simplified and consolidated these.

In this tutorial we will take a deep dive into geospatial analysis in Python, using tools like geopandas, shapely, and pysal to analyze a dataset, provided by Kaggle (and originally from Inside AirBnB), of sample AirBnB locations in Boston, Massachusetts.

This tutorial is targeted at folks who know a thing or two about data but haven't used Python's geospatial data tools just yet. As such, it assumes a high level of familiarity with pandas. Some familiarity with scikit-learn, statsmodels, matplotlib, and seaborn is also helpful.

In [1]:

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option("max_columns", None)

Table of Contents¶

Plotting Points
Plotting Geometries
Upconverting DataFrame to GeoDataFrame Objects
Plotting Geometries and Points
Spatial Weights
Spatial Lag
Spatial Clustering
Spatial Regression
Conclusion
Extra Credit

Plotting Points¶

First, let's boot up and examine our data. Since our data comes in a simple CSV file, we load it into a pandas DataFrame.

In [353]:

listings = pd.read_csv("../input/listings.csv")

In [4]:

listings.head()

Out[4]:

	id	listing_url	scrape_id	last_scraped	name	summary	space	description	experiences_offered	neighborhood_overview	notes	transit	access	interaction	house_rules	thumbnail_url	medium_url	picture_url	xl_picture_url	host_id	host_url	host_name	host_since	host_location	host_about	host_response_time	host_response_rate	host_acceptance_rate	host_is_superhost	host_thumbnail_url	host_picture_url	host_neighbourhood	host_listings_count	host_total_listings_count	host_verifications	host_has_profile_pic	host_identity_verified	street	neighbourhood	neighbourhood_cleansed	neighbourhood_group_cleansed	city	state	zipcode	market	smart_location	country_code	country	latitude	longitude	is_location_exact	property_type	room_type	accommodates	bathrooms	bedrooms	beds	bed_type	amenities	square_feet	price	weekly_price	monthly_price	security_deposit	cleaning_fee	guests_included	extra_people	minimum_nights	maximum_nights	calendar_updated	has_availability	availability_30	availability_60	availability_90	availability_365	calendar_last_scraped	number_of_reviews	first_review	last_review	review_scores_rating	review_scores_accuracy	review_scores_cleanliness	review_scores_checkin	review_scores_communication	review_scores_location	review_scores_value	requires_license	license	jurisdiction_names	instant_bookable	cancellation_policy	require_guest_profile_picture	require_guest_phone_verification	calculated_host_listings_count	reviews_per_month
0	12147973	https://www.airbnb.com/rooms/12147973	20160906204935	2016-09-07	Sunny Bungalow in the City	Cozy, sunny, family home. Master bedroom high...	The house has an open and cozy feel at the sam...	Cozy, sunny, family home. Master bedroom high...	none	Roslindale is quiet, convenient and friendly. ...	NaN	The bus stop is 2 blocks away, and frequent. B...	You will have access to 2 bedrooms, a living r...	NaN	Clean up and treat the home the way you'd like...	https://a2.muscache.com/im/pictures/c0842db1-e...	https://a2.muscache.com/im/pictures/c0842db1-e...	https://a2.muscache.com/im/pictures/c0842db1-e...	https://a2.muscache.com/im/pictures/c0842db1-e...	31303940	https://www.airbnb.com/users/show/31303940	Virginia	2015-04-15	Boston, Massachusetts, United States	We are country and city connecting in our deck...	NaN	NaN	NaN	f	https://a2.muscache.com/im/pictures/5936fef0-b...	https://a2.muscache.com/im/pictures/5936fef0-b...	Roslindale	1	1	['email', 'phone', 'facebook', 'reviews']	t	f	Birch Street, Boston, MA 02131, United States	Roslindale	Roslindale	NaN	Boston	MA	02131	Boston	Boston, MA	US	United States	42.282619	-71.133068	t	House	Entire home/apt	4	1.5	2.0	3.0	Real Bed	{TV,"Wireless Internet",Kitchen,"Free Parking ...	NaN	$250.00	NaN	NaN	NaN	$35.00	1	$0.00	2	1125	2 weeks ago	NaN	0	0	0	0	2016-09-06	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	f	NaN	NaN	f	moderate	f	f	1	NaN
1	3075044	https://www.airbnb.com/rooms/3075044	20160906204935	2016-09-07	Charming room in pet friendly apt	Charming and quiet room in a second floor 1910...	Small but cozy and quite room with a full size...	Charming and quiet room in a second floor 1910...	none	The room is in Roslindale, a diverse and prima...	If you don't have a US cell phone, you can tex...	Plenty of safe street parking. Bus stops a few...	Apt has one more bedroom (which I use) and lar...	If I am at home, I am likely working in my hom...	Pet friendly but please confirm with me if the...	https://a1.muscache.com/im/pictures/39327812/d...	https://a1.muscache.com/im/pictures/39327812/d...	https://a1.muscache.com/im/pictures/39327812/d...	https://a1.muscache.com/im/pictures/39327812/d...	2572247	https://www.airbnb.com/users/show/2572247	Andrea	2012-06-07	Boston, Massachusetts, United States	I live in Boston and I like to travel and have...	within an hour	100%	100%	f	https://a2.muscache.com/im/users/2572247/profi...	https://a2.muscache.com/im/users/2572247/profi...	Roslindale	1	1	['email', 'phone', 'facebook', 'linkedin', 'am...	t	t	Pinehurst Street, Boston, MA 02131, United States	Roslindale	Roslindale	NaN	Boston	MA	02131	Boston	Boston, MA	US	United States	42.286241	-71.134374	t	Apartment	Private room	2	1.0	1.0	1.0	Real Bed	{TV,Internet,"Wireless Internet","Air Conditio...	NaN	$65.00	$400.00	NaN	$95.00	$10.00	0	$0.00	2	15	a week ago	NaN	26	54	84	359	2016-09-06	36	2014-06-01	2016-08-13	94.0	10.0	9.0	10.0	10.0	9.0	9.0	f	NaN	NaN	t	moderate	f	f	1	1.30
2	6976	https://www.airbnb.com/rooms/6976	20160906204935	2016-09-07	Mexican Folk Art Haven in Boston	Come stay with a friendly, middle-aged guy in ...	Come stay with a friendly, middle-aged guy in ...	Come stay with a friendly, middle-aged guy in ...	none	The LOCATION: Roslindale is a safe and diverse...	I am in a scenic part of Boston with a couple ...	PUBLIC TRANSPORTATION: From the house, quick p...	I am living in the apartment during your stay,...	ABOUT ME: I'm a laid-back, friendly, unmarried...	I encourage you to use my kitchen, cooking and...	https://a2.muscache.com/im/pictures/6ae8335d-9...	https://a2.muscache.com/im/pictures/6ae8335d-9...	https://a2.muscache.com/im/pictures/6ae8335d-9...	https://a2.muscache.com/im/pictures/6ae8335d-9...	16701	https://www.airbnb.com/users/show/16701	Phil	2009-05-11	Boston, Massachusetts, United States	I am a middle-aged, single male with a wide ra...	within a few hours	100%	88%	t	https://a2.muscache.com/im/users/16701/profile...	https://a2.muscache.com/im/users/16701/profile...	Roslindale	1	1	['email', 'phone', 'reviews', 'jumio']	t	t	Ardale St., Boston, MA 02131, United States	Roslindale	Roslindale	NaN	Boston	MA	02131	Boston	Boston, MA	US	United States	42.292438	-71.135765	t	Apartment	Private room	2	1.0	1.0	1.0	Real Bed	{TV,"Cable TV","Wireless Internet","Air Condit...	NaN	$65.00	$395.00	$1,350.00	NaN	NaN	1	$20.00	3	45	5 days ago	NaN	19	46	61	319	2016-09-06	41	2009-07-19	2016-08-05	98.0	10.0	9.0	10.0	10.0	9.0	10.0	f	NaN	NaN	f	moderate	t	f	1	0.47
3	1436513	https://www.airbnb.com/rooms/1436513	20160906204935	2016-09-07	Spacious Sunny Bedroom Suite in Historic Home	Come experience the comforts of home away from...	Most places you find in Boston are small howev...	Come experience the comforts of home away from...	none	Roslindale is a lovely little neighborhood loc...	Please be mindful of the property as it is old...	There are buses that stop right in front of th...	The basement has a washer dryer and gym area. ...	We do live in the house therefore might be som...	- The bathroom and house are shared so please ...	https://a2.muscache.com/im/pictures/39764190-1...	https://a2.muscache.com/im/pictures/39764190-1...	https://a2.muscache.com/im/pictures/39764190-1...	https://a2.muscache.com/im/pictures/39764190-1...	6031442	https://www.airbnb.com/users/show/6031442	Meghna	2013-04-21	Boston, Massachusetts, United States	My husband and I live on the property. He’s a...	within a few hours	100%	50%	f	https://a2.muscache.com/im/pictures/5d430cde-7...	https://a2.muscache.com/im/pictures/5d430cde-7...	NaN	1	1	['email', 'phone', 'reviews']	t	f	Boston, MA, United States	NaN	Roslindale	NaN	Boston	MA	NaN	Boston	Boston, MA	US	United States	42.281106	-71.121021	f	House	Private room	4	1.0	1.0	2.0	Real Bed	{TV,Internet,"Wireless Internet","Air Conditio...	NaN	$75.00	NaN	NaN	$100.00	$50.00	2	$25.00	1	1125	a week ago	NaN	6	16	26	98	2016-09-06	1	2016-08-28	2016-08-28	100.0	10.0	10.0	10.0	10.0	10.0	10.0	f	NaN	NaN	f	moderate	f	f	1	1.00
4	7651065	https://www.airbnb.com/rooms/7651065	20160906204935	2016-09-07	Come Home to Boston	My comfy, clean and relaxing home is one block...	Clean, attractive, private room, one block fro...	My comfy, clean and relaxing home is one block...	none	I love the proximity to downtown, the neighbor...	I have one roommate who lives on the lower lev...	From Logan Airport and South Station you have...	You will have access to the front and side por...	I love my city and really enjoy sharing it wit...	Please no smoking in the house, porch or on th...	https://a1.muscache.com/im/pictures/97154760/8...	https://a1.muscache.com/im/pictures/97154760/8...	https://a1.muscache.com/im/pictures/97154760/8...	https://a1.muscache.com/im/pictures/97154760/8...	15396970	https://www.airbnb.com/users/show/15396970	Linda	2014-05-11	Boston, Massachusetts, United States	I work full time for a public school district....	within an hour	100%	100%	t	https://a0.muscache.com/im/users/15396970/prof...	https://a0.muscache.com/im/users/15396970/prof...	Roslindale	1	1	['email', 'phone', 'reviews', 'kba']	t	t	Durnell Avenue, Boston, MA 02131, United States	Roslindale	Roslindale	NaN	Boston	MA	02131	Boston	Boston, MA	US	United States	42.284512	-71.136258	t	House	Private room	2	1.5	1.0	2.0	Real Bed	{Internet,"Wireless Internet","Air Conditionin...	NaN	$79.00	NaN	NaN	NaN	$15.00	1	$0.00	2	31	2 weeks ago	NaN	13	34	59	334	2016-09-06	29	2015-08-18	2016-09-01	99.0	10.0	10.0	10.0	10.0	9.0	10.0	f	NaN	NaN	f	flexible	f	f	1	2.25

In [6]:

len(listings)

Out[6]:

At the moment our listings have several geospatial variables, the most important of which are longitude and latitude, which give the exact coordinates of the BnB in question. This means that it's easy for us to, say, plot every BnB location on a map:

In [525]:

plt.scatter(listings['longitude'], listings['latitude'])

Out[525]:

<matplotlib.collections.PathCollection at 0x34240470>

Chances are you've already done this before, and it's a perfectly adequate way to get started working with locations. In this plot we see...not much, really. If you're very intimately familiar with the layout of the city of Boston, you will probably be able to make sense of some of these clusters which are, to me, not being from the city, totally mysterious.

In other words, this plot is missing something important: geospatial context.

Additionally, this display is unprojected—it's displayed in terms of raw coordinates. The amount of distance contained in a coordinate degree varies greatly depending on where you are, so this naive plot potentially pretty badly distorts distances.

We'll come back to the projection issue later; for now, there's an easy fix for both these problems.

Enter mplleaflet. mplleaflet is a tool that automatically takes a coordinate matplotlib plot of any kind and places it on top of a leaflet slippy map. The best part is that it's just one additional line of code. Just throw mplleaflet.display() after generating your plot to drop it inline in your Jupyter notebook:

In [526]:

import mplleaflet

sample = listings.sample(1000)
plt.scatter(sample['longitude'], sample['latitude'])

mplleaflet.display()

Out[526]:

Note, however, that rather than plotting every single point on our map we've taken a sample of 1000 of them and plotted those. This is known as undersampling.

This is oftentime necessary, as here, for performance reasons. Because mplleaflet embeds point objects directly on an SVG canvas, it's as much as two orders of magnitude slower and more power-hungry than generating a really basic matplotlib blob.

There are few ways to get around this limitation. The bokeh library is classically mentioned as an option; the new, language-of-graphics-oriented (but still very incomplete) geoviews library does an even better job. And if you really, really need to see those points, you can use datashader, another not-even-a-year-old visualization library that comfortably works with millions of points (further reading).

However, in our case we'll do...none of these things. We're not really interested in individaul BnBs, per se; we're interested in grouping them and finding similarities and trends amongst them, not what exact street they're on. To do this, we'll wrap these points in polygons of some kind and try to look not at homes, but at neighborhoods.

Plotting Geometries¶

The world around us is split up into lots of different kinds of geometries. The big ones are continents; from there we go down to countries, states, districts, counties, and so on.

On the city level, where we are in this case, there are usually a bunch of federal options (PUMAs or electoral districts for example) as well as a variety of usually more targetted geographical aggregations provided by the city government. Usually no matter where you go, however, the lowest-level geometries in the United States are census tracts. Census tracts are created every ten years for the census, and are built to try to contain either 4000 or 0 people (in the case of parks, beaches, etc.).

Boston GIS released data on the census tracts in their city not long after the 2010 census went out, and that's what we'll use.

However, the data comes in the form of a complex and convoluted GIS-standard data format known as a shapefile (further reading).

pandas has no facilities for reading shapefiles.

When Wes McKinney was first introducing pandas to a wider audience, he was asked more than once why he was doing it when everyone was getting along so well passing raw numpy arrays back and forth all the time (remember, hindsight is 20/20). Reading shapefiles into Python used to be a similarly terrible experience, until geopandas came along and changed everything for the better.

Now you don't need to know anything at all about shapefiles to work with them, you just do this:

In [43]:

import geopandas as gpd

In [8]:

boston = gpd.read_file("../input/Census2010_Tracts.shp")

And you're done!

In [12]:

boston.head()

Out[12]:

	ALAND10	COUNTYFP10	FUNCSTAT10	GEOID10	INTPTLAT10	INTPTLON10	MTFCC10	NAME10	NAMELSAD10	STATEFP10	Shape_Area	Shape_Leng	TRACTCE10	geometry
0	363702.0	025	S	25025010405	+42.3398654	-071.0896052	G5020	104.05	Census Tract 104.05	25	3.914568e+06	14629.550361	010405	POLYGON ((-71.090087 42.34665699999989, -71.09...
1	136829.0	025	S	25025010404	+42.3419667	-071.0886375	G5020	104.04	Census Tract 104.04	25	1.472714e+06	5277.643216	010404	POLYGON ((-71.09065999999994 42.3397659999999,...
2	127905.0	025	S	25025010801	+42.3541193	-071.0770216	G5020	108.01	Census Tract 108.01	25	1.376667e+06	6166.497167	010801	POLYGON ((-71.08159499999989 42.35370399999986...
3	299981.0	025	S	25025010702	+42.3518354	-071.0755159	G5020	107.02	Census Tract 107.02	25	3.228780e+06	7818.852369	010702	POLYGON ((-71.070655 42.35185399999991, -71.07...
4	254706.0	025	S	25025010204	+42.3462887	-071.1033879	G5020	102.04	Census Tract 102.04	25	2.741497e+06	7621.654206	010204	POLYGON ((-71.10682499999994 42.34875299999985...

geopandas extends pandas DataFrame and Series expressions into GeoDataFrame and GeoSeries ones.

The main thing that these add is a geometry column containing all of our shapes:

In [9]:

boston['geometry'].head()

Out[9]:

0    POLYGON ((766978.240309909 2951616.923866287, ...
1    POLYGON ((766835.3942200691 2949104.969886944,...
2    POLYGON ((769261.2628526837 2954196.132479325,...
3    POLYGON ((772221.7065478563 2953536.652079985,...
4    POLYGON ((762449.9770828187 2952359.446074486,...
Name: geometry, dtype: object

However, these polygons seem more than a little bit weird. The coordinates don't make any sense!

It turns out that the shapefile we're working with is encoded in some kind of alternative coordinate reference system (CRS), one with the origin point set to (200000, 750000). We can see this for ourselves by asking our GeoDataFrame its crs property:

In [10]:

boston.crs

Out[10]:

{'datum': 'NAD83',
 'lat_0': 41,
 'lat_1': 41.71666666666667,
 'lat_2': 42.68333333333333,
 'lon_0': -71.5,
 'no_defs': True,
 'proj': 'lcc',
 'units': 'us-ft',
 'x_0': 199999.9999999999,
 'y_0': 750000}

The master directory for all well-known coordinate systems is the EPSG. The simple latitude-and-longitude reference system is epsg:4326, and geopandas makes it really, really easy to "fix" our coordinates by just calling the to_crs method:

In [11]:

boston = boston.to_crs({'init': 'epsg:4326'})

In [527]:

boston['geometry'].head()

Out[527]:

0    POLYGON ((-71.090087 42.34665699999989, -71.09...
1    POLYGON ((-71.09065999999994 42.3397659999999,...
2    POLYGON ((-71.08159499999989 42.35370399999986...
3    POLYGON ((-71.070655 42.35185399999991, -71.07...
4    POLYGON ((-71.10682499999994 42.34875299999985...
Name: geometry, dtype: object

Much better.

geopandas objects have a plot method which plops your geometry into a matplotlib plot right away:

In [528]:

boston.plot()

Out[528]:

<matplotlib.axes._subplots.AxesSubplot at 0x263d6c18>

Again, to get geographic context, let's throw this result at mplleaflet!

In [13]:

import mplleaflet

f = plt.figure(figsize=(15, 8))
ax = f.gca()
boston.plot(ax=ax)
mplleaflet.display(fig=f)

Out[13]:

There it is—all the census tracts of the city of Boston.

Notice that some of these tracts have really "weird" shapes. This is because these are census tracts, remember, which try to isolate parts of the city which have people in them from those (like airports, beaches, islands, or the waterway) which do not.

Upconverting `DataFrame` to `GeoDataFrame` Objects¶

Under the hood, geopandas represents geometries using the shapely library. Just like how when using pandas you still occassionally have to reach across to grab something from numpy, when working with geometries it's important to understand what shapely nets you.

In [14]:

import shapely

Luckily shapely has some of the best documentation of any library I've ever used, and it's really simple to use.

Recall that our BnB listings was a DataFrame containing two geospatial fields, latitude and longitude. As a quick demonstration of shapely, let's upconvert this representation to true geometries. To do this we wrap our DataFrame in a GeoDataFrame, mapping our geometry column to a shapely Point for each coordinate pair.

In [15]:

listings = gpd.GeoDataFrame(listings, geometry=listings.apply(
        lambda srs: shapely.geometry.Point(srs['longitude'], srs['latitude']), axis='columns'
    ))

Plotting Geometries and Points¶

We've plotted our points and our polygons seperately, now how do we bring them together?

One obvious answer is to use a seaborn kdeplot, which many of you should be familiar with.

In [88]:

import seaborn as sns

C:\Users\Alex\Anaconda3\envs\geoplot\lib\site-packages\IPython\html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)

In [120]:

f = plt.figure(figsize=(15, 8))
ax = f.gca()

boston.plot(ax=ax, alpha=0.1, linewidth=0.25, color='white')
sns.kdeplot(data=listings.apply(lambda srs: pd.Series({'x': srs.geometry.x, 'y': srs.geometry.y}), axis='columns'), ax=ax,
            alpha=1)
ax.set_axis_off()

This does give us a good sense of where our AirBNBs are concentrated. Unfortunately, however, KDEs are almost always too "blobish"—they're not granular enough to show the details inherent in geographic distributions.

With a little bit more work we can get to the closest thing to a gold standard for such things in cartography, a choropleth map.

First let's create a column in our polygonal dataset, BNBs, containing a count of the number of AirBnB locations in the area.

In [16]:

import numpy as np

def assign_census_tract(bnb):
    bools = [geom.contains(bnb['geometry']) for geom in boston['geometry']]
    if True in bools:
        return boston.iloc[bools.index(True)]['NAMELSAD10']
    else:
        return np.nan

In [17]:

listings['census_tract'] = listings.apply(assign_census_tract, axis='columns')

In [18]:

listings['census_tract'].value_counts().head()

Out[18]:

Census Tract 701.01    113
Census Tract 102.03    111
Census Tract 8.02       95
Census Tract 201.01     86
Census Tract 606        84
Name: census_tract, dtype: int64

In [19]:

boston['BNBs'] = boston['NAMELSAD10'].map(listings['census_tract'].value_counts())

And now we use that data to throw up a map.

In [154]:

f = plt.figure(figsize=(15, 8))
ax = f.gca()
kw = dict(column='BNBs', k=6, cmap='YlGn', alpha=1, legend=True, edgecolor='gray', linewidth=0.5)
boston.plot(scheme='QUANTILES', ax=ax, **kw)
ax.set_axis_off()

C:\Users\Alex\Anaconda3\envs\geoplot\lib\site-packages\numpy\lib\function_base.py:3569: RuntimeWarning: Invalid value encountered in median
  RuntimeWarning)

This type of map is, as was mentioned earlier, known as a choropleth. It's a ubiquitious classic of the genre. The darker the shading, the more AirBNBs in that census tract.

Aside from a variety of visual parameters, we specified three important ones. column=BNBs told geopandas which columns to plot data from. scheme=QUANTILES is which "bucketing" system you want to plot with: in this case we specified quantiles, which splits the data into bins equal in number, albiet not in size. Finally, k tells geopandas how many bins we want.

The trouble with this map, however, is that it doesn't take into account the size of the census tracts. If our census tract is bigger than average, it'll naturally contain more homes, and, all else being equal, more AirBnBs.

This is a classical mapmaking fallacy. Here's another example of this error, straight from the responsible article on Wikipedia:

alt text

Notice how the plot on the left is completely scrambled and doesn't actually tell us anything, while the plot on the right is smooth, structured, and informative.

In our case we'll fix it by figuring out AirBnB density per square kilometer, not just the raw number, and plot that.

Remember, however, that areas in the latitude-longitude coordinate system are not equal. To get an accurate measurement of area we need to first reproject to what's called an "equal-area" projection (epsg:3395 will do nicely), and then take the areas of those polygons instead. Note also that shapely returns areas in square meters; we divide by a constant to get square kilometers. All this is done by the workhourse one-liner below.

In [182]:

boston['BNBDensity'] = (boston['BNBs'] / boston['geometry']\
                            .to_crs({'init': 'epsg:3395''})\
                            .map(lambda p: p.area / 10**6))\
                            .fillna(0)

In [184]:

f = plt.figure(figsize=(15, 8))
ax = f.gca()
kw = dict(column='BNBDensity', k=6, cmap='YlGn', alpha=1, legend=True, edgecolor='gray', linewidth=0.5)
boston.plot(scheme='QUANTILES', ax=ax, **kw)
ax.set_axis_off()

This more contiguous plot is the correct one!

Here we see that in the inner city AirBnB density reaches from 60 to almost 300 AirBnB locations per square kilometer. Since there are approximately 150 city blocks per kilometer squared, a back-of-the-envelope calculation says that this makes for anywhere from an AirBnB every three blocks to two AirBnBs per block.

That's a lot of market penetration.

Spatial Weights¶

We will now introduce the next major tool in the geographic toolbox, pysal. pysal is a by-now venerable library designed for performing spatial analytics, and it'll be our focus for the remainder of this tutorial.

In [44]:

import pysal as ps

Spatial data analytics requires understanding spaces, and understanding spaces requires tying them together somehow. Thus the concept of spatial weights.

Spatial weights tie together areas which are neighbors with one another in some sense. There are many ways of defining this concept of neighborhood-ness or closeness. We could, for example, pick all of the other shapes which are within a certain distance of our chosen shape. Or we could choose all shapes which directly border ours, or at least touch ours. The former of these is known as rook continuity and the latter as queen continuity, after the way those pieces move on a checkerboard, and they're two of the most common methods for defining spatial weights.

If you're interested in more details, you definitely need to check out the concept's entry in the pysal documentation.

For ease of use, we'll pick queen continuity. This means that we will consider our census tracts to be neighbors so long as they touch at least at an edge. We can read such facts directly out of the shapefile using queen_from_shapefile:

In [45]:

qW = ps.queen_from_shapefile("../input/Census2010_Tracts.shp")

This tells us that, for instance, our fourth census tract has four "touching" neighbors: tracts 6, 10, 50, and 80:

In [47]:

qW[4]

Out[47]:

{6: 1.0, 10: 1.0, 50: 1.0, 80: 1.0}

To get a better sense of what we've just created, let's plot our neighborhood vectors on a map:

In [70]:

f = plt.figure(figsize=(15, 8))
ax = f.gca()

for i, (k, neighbors) in enumerate(qW.neighbors.items()):
    origin = boston.geometry.iloc[i].centroid
    for nabe_i in neighbors:
        nabe_centroid = boston.geometry.iloc[nabe_i].centroid
        plt.plot([origin.x, nabe_centroid.x], [origin.y, nabe_centroid.y], '-')

boston.plot(ax=ax, linewidth=0.5, facecolor='white')

ax.set_axis_off()

You can see it better when you zoom in a bit:

In [76]:

f = plt.figure(figsize=(15, 8))
ax = f.gca()

for i, (k, neighbors) in enumerate(qW.neighbors.items()):
    origin = boston.geometry.iloc[i].centroid
    for nabe_i in neighbors:
        nabe_centroid = boston.geometry.iloc[nabe_i].centroid
        plt.plot([origin.x, nabe_centroid.x], [origin.y, nabe_centroid.y], '-')

boston.plot(ax=ax, linewidth=1, facecolor='white')

ax.set_xlim([-71.2, -71.1])
ax.set_ylim([42.325, 42.375])

Out[76]:

(42.325, 42.375)

...including the weirdness around the elongated census tract for that strip of beach.

Spatial Lag¶

Spacial weights are necessary in order to get a sense of the spatial lag inherent in our data. Spatial lag is a geographic version of autocorrelation, a perhaps familar term from time-series analysis: it's a measure of how strongly the data in one polygon is predicted by its neighbors.

If our data is totally randomly distributed on the map—if, in effect, it has no spatial component—then we expect that the BnB densities in our neighboring census tracts will be very poor predictors of the BnB density in our current tract, because we'll just be predicting noise with noise. If, on the other hand, the densities we predict are close to the real value, then we can predict density based on geography, and our data does indeed have a strong geographic component.

Of course, examining our plots from earlier, this may seem like a moot point—the choropleths make it pretty obvious that there's geography involved. But what we'd like to do now is assign a probability to that geography not being just random chance. Examining spatial lag allows us, essentially, to understand the statistical significance of our result.

Let's generate lags, split them into five quantiles, and see where they end up.

In [199]:

bnb_spatial_lags = ps.lag_spatial(qW, boston['BNBDensity'])
spatial_lag_classes = ps.Quantiles(bnb_spatial_lags, k=5)

In [188]:

spatial_lag_classes

Out[188]:

                 Quantiles                  
 
 Lower              Upper              Count
============================================
           x[i] <=   14.018               19
  14.018 < x[i] <=   24.331               18
  24.331 < x[i] <=   43.946               18
  43.946 < x[i] <=   59.383               18
  59.383 < x[i] <=   85.885               18
  85.885 < x[i] <=  136.440               18
 136.440 < x[i] <=  185.530               17
 185.530 < x[i] <=  280.634               19
 280.634 < x[i] <=  613.336               18
 613.336 < x[i] <= 1128.536               18

In [201]:

spatial_lag_classes.yb

Out[201]:

array([4, 4, 4, 4, 3, 4, 4, 3, 3, 3, 3, 3, 2, 2, 2, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 3, 3, 3, 3, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 3,
       3, 2, 4, 4, 4, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 2, 3, 3, 2, 2,
       2, 2, 2, 2, 2, 1, 1, 1, 1, 0, 3, 4, 1, 0, 2, 0, 0, 1, 1, 1, 2, 2, 3,
       3, 2, 1, 0, 4, 3, 1, 3, 2, 2, 3, 4, 4, 3, 1, 3, 4, 0, 0, 0, 2, 0, 0,
       1, 1, 4, 4, 4, 4, 4, 3, 4, 4, 2, 3, 2, 3, 2, 0, 1, 0, 3, 2, 2, 3, 1,
       3, 2, 2, 3, 2, 2, 2, 4, 0, 0, 0, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 1, 2,
       3, 2, 3, 1, 4, 2, 2, 2, 2, 1, 1, 3, 4, 4, 4, 4, 4, 3, 4, 2])

In [200]:

f = plt.figure(figsize=(15, 8))
ax = f.gca()

kw = dict(column='spatial_class', k=5, cmap='YlGn', alpha=1, legend=True, edgecolor='gray', linewidth=0.5, 
          categorical=True)
boston.assign(spatial_class=spatial_lag_classes.yb).plot(ax=ax, **kw)
ax.set_axis_off()

This geographic pattern is distinctly not random, strong evidence again that our data is geographically dependent.

How dependent? The tool of choice for this is the a Moran scatterplot. If geography matters our data should show a strong liner structure of some kind: high lag values correspond with high actual values, and low lags with low ones.

A quick plot shows that this is, indeed, the case:

In [218]:

f, ax = plt.subplots(1, figsize=(9, 9))
plt.plot(boston['BNBDensity'], bnb_spatial_lags, '.', color='firebrick')

# Calculate and plot a line of best fit.
b,a = np.polyfit(boston['BNBDensity'], bnb_spatial_lags, 1)
plt.plot(boston['BNBDensity'], a + b*boston['BNBDensity'], 'r')

plt.title('Moran Scatterplot')
plt.ylabel('Boston BNB Density Spatial Lag')
plt.xlabel('Boston BNB Density Actual')
plt.show()

We can get a p-value:

In [214]:

moran = ps.Moran(boston['BNBDensity'].values, qW)
moran.I, moran.p_sim

Out[214]:

(0.52791249910953175, 0.001)

Our p-value of 0.001 tells us that there is a less than 0.1% chance that our data distribution is this heavily skewed by randomness alone.

Spatial Clustering¶

We've learned how to display geography and how to measure its importance for a dataset. Let's now move on to learning how to apply it.

Suppose that we're interested in segmenting the Boston AirBnB markets based on the types of properties available. This is, after all, a part of our dataset:

In [79]:

listings['property_type'].value_counts()

Out[79]:

Apartment          2612
House               562
Condominium         231
Townhouse            54
Bed & Breakfast      41
Loft                 39
Other                17
Boat                 12
Villa                 6
Entire Floor          4
Dorm                  2
Guesthouse            1
Camper/RV             1
Name: property_type, dtype: int64

Leaving aside the "exotic" options, we've really got two major types of BnBs: apartments and houses.

In [223]:

apartments = listings.query('property_type == "Apartment" or property_type == "Condominium"').groupby('census_tract').count()['id']
houses = listings.query('property_type == "House" or property_type == "Townhouse"').groupby('census_tract').count()['id']

Let's create numerical variables counting how many of each we have per census tract.

In [259]:

boston['BNBDensity_Houses'] = boston['NAMELSAD10'].map(houses).fillna(0)
boston['BNBDensity_Apartments'] = boston['NAMELSAD10'].map(apartments).fillna(0)

We have three predictor variables that we'd like to use: the number of apartments, the numbers of houses, and their density combined.

In [260]:

bnb_market = boston[['BNBDensity', 'BNBDensity_Houses', 'BNBDensity_Apartments']]

If you're familiar with scikit-learn this is, at this point, a classic clustering problem, one we can easily solve using k-means clustering, for example:

In [279]:

import sklearn.cluster
import sklearn.preprocessing

clf = sklearn.cluster.KMeans(n_clusters=3)
X = sklearn.preprocessing.scale(bnb_market.values)
classes = clf.fit(X)

But remember that we're dealing with geometries, not just points in space! Our clustering algorithm doesn't know about this; it just naively classifies even very distant tracts into bins based on characteristic similarity alone. We can see this when we plot the result:

In [280]:

f = plt.figure(figsize=(15, 8))
ax = f.gca()

kw = dict(column='cluster', k=3, cmap='YlGn', alpha=1, legend=True, edgecolor='gray', linewidth=0.5, categorical=True)
boston.assign(cluster=classes.labels_).plot(ax=ax, **kw)
ax.set_axis_off()

Our classifier does seem to have learned, in effect, the difference between low-density outlying tracts, medium-density tracts, and high-density inner city tracts:

In [281]:

bnb_market.assign(cluster=classes.labels_).groupby('cluster').mean()

Out[281]:

	BNBDensity	BNBDensity_Houses	BNBDensity_Apartments
cluster
0	12.188742	2.062992	6.740157
1	146.465581	1.724138	46.862069
2	28.905771	11.840000	23.640000

Indeed, this classifier does exactly what we set out to do. Examine the following plot:

In [282]:

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(bnb_market['BNBDensity'], bnb_market['BNBDensity_Apartments'],
           zs=bnb_market['BNBDensity_Houses'], c=classes.labels_, cmap='YlGn')

Out[282]:

<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x28f554a8>

The pale census tracts are those which have very little AirBnB activity at all. They tend to be furthest from the city center. The second class of tracts, the dark green ones, have a significant BnB density, but consist primarily of houses, not apartments; these correspond with middle-of-the-pack residential areas. Finally, the densest areas, tracts containing plenty of apartments for rent but no homes and located in the heart of the city, are pale green.

This is perfect! Right?

Well, it is, except that the areas we've binned are discontinuous. Because our classifier disregards geography it doesn't put in any effort keeping our class members next one another, and so we don't arrive at any understanding of neighborhood topography.

pysal contains classifiers that allow us to do just that. Cheif amongst them is the Maxp clustering algorithm, which chooses clusters which match each other characteristically and are also all next to one another (according to whatever definition of neighbors you choose—we'll keep going with queen contiguity).

Here's what we get when we run it. Notice the place of our spatial weights in the input, as well as a minimum number of observations per class—150.

In [328]:

clf = ps.region.Maxp(qW, bnb_market.values, 150, boston['BNBs'].values[:,None])

In [331]:

# n = 200
f = plt.figure(figsize=(15, 8))
ax = f.gca()

kw = dict(column='cluster', k=10, cmap='YlGn', alpha=1, legend=True, edgecolor='gray', linewidth=0.5, categorical=True)
boston.assign(cluster=clf.area2region.values()).plot(ax=ax, **kw)
ax.set_axis_off()

Because of the additional constraint the algorithm imposes, it tends to result in significantly more clusters than comparable non-spatial clustering algorithms. But, as you can see here, the algorithm did arrive at essentially the same conclusion as our kNN did: apartment-dense inner cities, house-dense middle burbs, and outlying areas. It just takes ten contiguous districts to describe what three discontiguous ones.

Here again are our mean characteristics:

In [329]:

bnb_market.assign(cluster=clf.area2region.values()).groupby('cluster').mean()

Out[329]:

	BNBDensity	BNBDensity_Houses	BNBDensity_Apartments
cluster
0	78.451566	1.764706	36.647059
1	16.821212	5.833333	11.166667
2	7.442751	2.882353	4.423529
3	21.453374	2.428571	10.142857
4	53.346294	3.375000	15.250000
5	29.089423	3.333333	29.333333
6	39.003481	10.000000	24.777778
7	83.277575	10.500000	39.750000
8	205.586501	1.285714	43.857143
9	166.723335	0.666667	49.500000

Spatial Regression¶

Suppose that we're interested in modeling the price of a BnB. Let's use the following basic features to do it:

In [1]:

explanatory_variables = ['bathrooms', 'bedrooms', 'beds', 'guests_included']

There's a lot of boilerplate here (note the need to merge together the price and cleaning fee into an actual_price, BnB hosts like to be sneaky and sock away as much as 50$ or more into that "fee"!). But if you've run a basic linear regression before, this should be pretty rote. We're going to use statsmodels to do it.

In [461]:

price_variables = ['price', 'cleaning_fee']
geom = ['latitude', 'longitude']  # we'll need this later
valid_listings = listings[explanatory_variables + price_variables + geom].dropna()
valid_listings['actual_price'] = valid_listings['price'].map(convert_price) +\
                                 valid_listings['cleaning_fee'].map(convert_price)
valid_listings = valid_listings.dropna()

def convert_price(price):
    try:
        return float(price.replace("$", "").replace(".", "")[:-2])
    except (ValueError, AttributeError):
        return np.nan

import statsmodels.api as sm
clf = sm.OLS(valid_listings['actual_price'], valid_listings[explanatory_variables])
basic_ols_res = clf.fit()

In [462]:

basic_ols_res.summary()

Out[462]:

OLS Regression Results
Dep. Variable:	actual_price	R-squared:	0.821
Model:	OLS	Adj. R-squared:	0.820
Method:	Least Squares	F-statistic:	2808.
Date:	Sun, 20 Nov 2016	Prob (F-statistic):	0.00
Time:	19:55:02	Log-Likelihood:	-15293.
No. Observations:	2460	AIC:	3.059e+04
Df Residuals:	2456	BIC:	3.062e+04
Df Model:	4
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
bathrooms	86.6214	4.185	20.699	0.000	78.415 94.828
bedrooms	38.2459	4.722	8.099	0.000	28.986 47.505
beds	36.4452	3.642	10.006	0.000	29.303 43.588
guests_included	12.0422	2.316	5.199	0.000	7.500 16.585

Omnibus:	172.427	Durbin-Watson:	1.296
Prob(Omnibus):	0.000	Jarque-Bera (JB):	241.583
Skew:	0.592	Prob(JB):	3.48e-53
Kurtosis:	3.977	Cond. No.	7.49

Now how do we take this analysis and make it spatial?

A common theme found in the field of spatial econometrics is the idea of mutual price-setting. That is, when someone sets the price of a spatial good, they take into account the price that good set by their neighbors, and vice versa.

If price-setting has a spatial element of this kind (a beating-the-Joneses effect of sorts), then we should be able to improve the results of our model by mixing in observations about nearby BnBs.

To do that we'll first need to grab ahold of what we mean by "neighbors". Since we're dealing with points, not shapes, contiguities don't apply here anymore; we in our usual k-nearest neighbor space. It's possible to grab our kNN from scipy, but it's much faster coding-wise to take a pysal built-in instead.

In [513]:

w = ps.knnW_from_array(valid_listings[['longitude', 'latitude']].dropna().values, k=2)
# w.transform = 'R'
w

Out[513]:

<pysal.weights.Distance.KNN at 0x29652710>

Then to run our spatial regression we simply pass in the same parameters as before, but now with the inclusion of our weights (w). name_x and name_y are optional parameters which simply add a little bit more information to the summary display.

In [514]:

geospatial_ols = ps.spreg.GM_Lag(valid_listings['actual_price'].values[:, None],
                                 valid_listings[explanatory_variables].values,
                                 w=w,
                                 name_x=explanatory_variables,
                                 name_y='actual_price')

In [515]:

print(geospatial_ols.summary)

REGRESSION
----------
SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES
--------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :actual_price                Number of Observations:        2460
Mean dependent var  :    246.6793                Number of Variables   :           6
S.D. dependent var  :    145.2765                Degrees of Freedom    :        2454
Pseudo R-squared    :      0.2996
Spatial Pseudo R-squared:  0.3265

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT      74.5257846      11.4683038       6.4984139       0.0000000
           bathrooms      48.5721848       6.0617495       8.0128988       0.0000000
            bedrooms      39.4847512       4.7342175       8.3402908       0.0000000
                beds      35.5915425       3.6648137       9.7116921       0.0000000
     guests_included       5.9026820       2.4268024       2.4322878       0.0150038
      W_actual_price      -0.0212334       0.0196799      -1.0789394       0.2806148
------------------------------------------------------------------------------------
Instrumented: W_actual_price
Instruments: W_bathrooms, W_bedrooms, W_beds, W_guests_included
================================ END OF REPORT =====================================

Unfortunately we will end our crash course on a sad note. it doesn't look like neighbor price fixing is a significant variable at all, as our coefficient is on the level of statistical noise.

Furthermore, it turns out that our spatial algorithm is significantly worse; the mean squared error of our spatial result, a common metric for OLS regression, is much worse than for our no-frills one:

In [524]:

geospatial_ols_mse = ((valid_listings['actual_price'] - geospatial_ols.predy.flatten()) ** 2).mean()
print("OLS".ljust(5), ":", basic_ols_res.mse_total)
print("OLS+W".ljust(5), ":", geospatial_ols_mse)

OLS   : 81947.352439
OLS+W : 14788.3873299

So in this case we've shown pretty conclusively just the opposite: that the price of a BnB has little to do with the prices set by its neighbors.

Conclusion¶

That concludes our crash course on geospatial data science. There's lots of useful components in the Python toolbox that we didn't cover: working with images using rasterio, for example. But I hope that this quick crash course has at least made you aware of some of the many interesting geographic things you can do with the tools that Python has! Keep them in mind—they're sure to come in handy later.

If you want to learn more about Python geospatial tools, here's a only loosely-ordered list of resources to look at:

Extra Credit¶

If you're up for a challenge, the dataset also includes information on the amenities present at each location:

In [373]:

amenities = listings['amenities'].map(lambda d: [amenity.replace('"', "")\
                                                        .replace("{", "")\
                                                        .replace("}", "") for amenity in d.split(",")])
listings['amenities'] = amenities
possible_amenities = set([item for sublist in amenities for item in sublist])
possible_amenities

Out[373]:

{'',
 '24-Hour Check-in',
 'Air Conditioning',
 'Breakfast',
 'Buzzer/Wireless Intercom',
 'Cable TV',
 'Carbon Monoxide Detector',
 'Cat(s)',
 'Dog(s)',
 'Doorman',
 'Dryer',
 'Elevator in Building',
 'Essentials',
 'Family/Kid Friendly',
 'Fire Extinguisher',
 'First Aid Kit',
 'Free Parking on Premises',
 'Free Parking on Street',
 'Gym',
 'Hair Dryer',
 'Hangers',
 'Heating',
 'Hot Tub',
 'Indoor Fireplace',
 'Internet',
 'Iron',
 'Kitchen',
 'Laptop Friendly Workspace',
 'Lock on Bedroom Door',
 'Other pet(s)',
 'Paid Parking Off Premises',
 'Pets Allowed',
 'Pets live on this property',
 'Pool',
 'Safety Card',
 'Shampoo',
 'Smoke Detector',
 'Smoking Allowed',
 'Suitable for Events',
 'TV',
 'Washer',
 'Washer / Dryer',
 'Wheelchair Accessible',
 'Wireless Internet',
 'translation missing: en.hosting_amenity_49',
 'translation missing: en.hosting_amenity_50'}

Analysis by the pysal folks showed that, at least in Austin, the presence of a pool amenity somewhere nearby had a small but significant effect on price.

Does this result extend to Boston? And do any of the other amenities have an effect like this as well?