Many of the datasets that data scientists handle have some kind of geospatial component to them, and that information is oftentimes useful-to-critical for understanding the problem at hand. As such, an understanding of spatial data and how to work with it is a valuable skill for any data scientist to have. Even better, Python provides a rich toolset for working in this domain, and recent advances have greatly simplified and consolidated these.
In this tutorial we will take a deep dive into geospatial analysis in Python, using tools like geopandas
, shapely
, and pysal
to analyze a dataset, provided by Kaggle (and originally from Inside AirBnB), of sample AirBnB locations in Boston, Massachusetts.
This tutorial is targeted at folks who know a thing or two about data but haven't used Python's geospatial data tools just yet. As such, it assumes a high level of familiarity with pandas
. Some familiarity with scikit-learn
, statsmodels
, matplotlib
, and seaborn
is also helpful.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option("max_columns", None)
First, let's boot up and examine our data. Since our data comes in a simple CSV file, we load it into a pandas
DataFrame
.
listings = pd.read_csv("../input/listings.csv")
listings.head()
id | listing_url | scrape_id | last_scraped | name | summary | space | description | experiences_offered | neighborhood_overview | notes | transit | access | interaction | house_rules | thumbnail_url | medium_url | picture_url | xl_picture_url | host_id | host_url | host_name | host_since | host_location | host_about | host_response_time | host_response_rate | host_acceptance_rate | host_is_superhost | host_thumbnail_url | host_picture_url | host_neighbourhood | host_listings_count | host_total_listings_count | host_verifications | host_has_profile_pic | host_identity_verified | street | neighbourhood | neighbourhood_cleansed | neighbourhood_group_cleansed | city | state | zipcode | market | smart_location | country_code | country | latitude | longitude | is_location_exact | property_type | room_type | accommodates | bathrooms | bedrooms | beds | bed_type | amenities | square_feet | price | weekly_price | monthly_price | security_deposit | cleaning_fee | guests_included | extra_people | minimum_nights | maximum_nights | calendar_updated | has_availability | availability_30 | availability_60 | availability_90 | availability_365 | calendar_last_scraped | number_of_reviews | first_review | last_review | review_scores_rating | review_scores_accuracy | review_scores_cleanliness | review_scores_checkin | review_scores_communication | review_scores_location | review_scores_value | requires_license | license | jurisdiction_names | instant_bookable | cancellation_policy | require_guest_profile_picture | require_guest_phone_verification | calculated_host_listings_count | reviews_per_month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 12147973 | https://www.airbnb.com/rooms/12147973 | 20160906204935 | 2016-09-07 | Sunny Bungalow in the City | Cozy, sunny, family home. Master bedroom high... | The house has an open and cozy feel at the sam... | Cozy, sunny, family home. Master bedroom high... | none | Roslindale is quiet, convenient and friendly. ... | NaN | The bus stop is 2 blocks away, and frequent. B... | You will have access to 2 bedrooms, a living r... | NaN | Clean up and treat the home the way you'd like... | https://a2.muscache.com/im/pictures/c0842db1-e... | https://a2.muscache.com/im/pictures/c0842db1-e... | https://a2.muscache.com/im/pictures/c0842db1-e... | https://a2.muscache.com/im/pictures/c0842db1-e... | 31303940 | https://www.airbnb.com/users/show/31303940 | Virginia | 2015-04-15 | Boston, Massachusetts, United States | We are country and city connecting in our deck... | NaN | NaN | NaN | f | https://a2.muscache.com/im/pictures/5936fef0-b... | https://a2.muscache.com/im/pictures/5936fef0-b... | Roslindale | 1 | 1 | ['email', 'phone', 'facebook', 'reviews'] | t | f | Birch Street, Boston, MA 02131, United States | Roslindale | Roslindale | NaN | Boston | MA | 02131 | Boston | Boston, MA | US | United States | 42.282619 | -71.133068 | t | House | Entire home/apt | 4 | 1.5 | 2.0 | 3.0 | Real Bed | {TV,"Wireless Internet",Kitchen,"Free Parking ... | NaN | $250.00 | NaN | NaN | NaN | $35.00 | 1 | $0.00 | 2 | 1125 | 2 weeks ago | NaN | 0 | 0 | 0 | 0 | 2016-09-06 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | f | NaN | NaN | f | moderate | f | f | 1 | NaN |
1 | 3075044 | https://www.airbnb.com/rooms/3075044 | 20160906204935 | 2016-09-07 | Charming room in pet friendly apt | Charming and quiet room in a second floor 1910... | Small but cozy and quite room with a full size... | Charming and quiet room in a second floor 1910... | none | The room is in Roslindale, a diverse and prima... | If you don't have a US cell phone, you can tex... | Plenty of safe street parking. Bus stops a few... | Apt has one more bedroom (which I use) and lar... | If I am at home, I am likely working in my hom... | Pet friendly but please confirm with me if the... | https://a1.muscache.com/im/pictures/39327812/d... | https://a1.muscache.com/im/pictures/39327812/d... | https://a1.muscache.com/im/pictures/39327812/d... | https://a1.muscache.com/im/pictures/39327812/d... | 2572247 | https://www.airbnb.com/users/show/2572247 | Andrea | 2012-06-07 | Boston, Massachusetts, United States | I live in Boston and I like to travel and have... | within an hour | 100% | 100% | f | https://a2.muscache.com/im/users/2572247/profi... | https://a2.muscache.com/im/users/2572247/profi... | Roslindale | 1 | 1 | ['email', 'phone', 'facebook', 'linkedin', 'am... | t | t | Pinehurst Street, Boston, MA 02131, United States | Roslindale | Roslindale | NaN | Boston | MA | 02131 | Boston | Boston, MA | US | United States | 42.286241 | -71.134374 | t | Apartment | Private room | 2 | 1.0 | 1.0 | 1.0 | Real Bed | {TV,Internet,"Wireless Internet","Air Conditio... | NaN | $65.00 | $400.00 | NaN | $95.00 | $10.00 | 0 | $0.00 | 2 | 15 | a week ago | NaN | 26 | 54 | 84 | 359 | 2016-09-06 | 36 | 2014-06-01 | 2016-08-13 | 94.0 | 10.0 | 9.0 | 10.0 | 10.0 | 9.0 | 9.0 | f | NaN | NaN | t | moderate | f | f | 1 | 1.30 |
2 | 6976 | https://www.airbnb.com/rooms/6976 | 20160906204935 | 2016-09-07 | Mexican Folk Art Haven in Boston | Come stay with a friendly, middle-aged guy in ... | Come stay with a friendly, middle-aged guy in ... | Come stay with a friendly, middle-aged guy in ... | none | The LOCATION: Roslindale is a safe and diverse... | I am in a scenic part of Boston with a couple ... | PUBLIC TRANSPORTATION: From the house, quick p... | I am living in the apartment during your stay,... | ABOUT ME: I'm a laid-back, friendly, unmarried... | I encourage you to use my kitchen, cooking and... | https://a2.muscache.com/im/pictures/6ae8335d-9... | https://a2.muscache.com/im/pictures/6ae8335d-9... | https://a2.muscache.com/im/pictures/6ae8335d-9... | https://a2.muscache.com/im/pictures/6ae8335d-9... | 16701 | https://www.airbnb.com/users/show/16701 | Phil | 2009-05-11 | Boston, Massachusetts, United States | I am a middle-aged, single male with a wide ra... | within a few hours | 100% | 88% | t | https://a2.muscache.com/im/users/16701/profile... | https://a2.muscache.com/im/users/16701/profile... | Roslindale | 1 | 1 | ['email', 'phone', 'reviews', 'jumio'] | t | t | Ardale St., Boston, MA 02131, United States | Roslindale | Roslindale | NaN | Boston | MA | 02131 | Boston | Boston, MA | US | United States | 42.292438 | -71.135765 | t | Apartment | Private room | 2 | 1.0 | 1.0 | 1.0 | Real Bed | {TV,"Cable TV","Wireless Internet","Air Condit... | NaN | $65.00 | $395.00 | $1,350.00 | NaN | NaN | 1 | $20.00 | 3 | 45 | 5 days ago | NaN | 19 | 46 | 61 | 319 | 2016-09-06 | 41 | 2009-07-19 | 2016-08-05 | 98.0 | 10.0 | 9.0 | 10.0 | 10.0 | 9.0 | 10.0 | f | NaN | NaN | f | moderate | t | f | 1 | 0.47 |
3 | 1436513 | https://www.airbnb.com/rooms/1436513 | 20160906204935 | 2016-09-07 | Spacious Sunny Bedroom Suite in Historic Home | Come experience the comforts of home away from... | Most places you find in Boston are small howev... | Come experience the comforts of home away from... | none | Roslindale is a lovely little neighborhood loc... | Please be mindful of the property as it is old... | There are buses that stop right in front of th... | The basement has a washer dryer and gym area. ... | We do live in the house therefore might be som... | - The bathroom and house are shared so please ... | https://a2.muscache.com/im/pictures/39764190-1... | https://a2.muscache.com/im/pictures/39764190-1... | https://a2.muscache.com/im/pictures/39764190-1... | https://a2.muscache.com/im/pictures/39764190-1... | 6031442 | https://www.airbnb.com/users/show/6031442 | Meghna | 2013-04-21 | Boston, Massachusetts, United States | My husband and I live on the property. He’s a... | within a few hours | 100% | 50% | f | https://a2.muscache.com/im/pictures/5d430cde-7... | https://a2.muscache.com/im/pictures/5d430cde-7... | NaN | 1 | 1 | ['email', 'phone', 'reviews'] | t | f | Boston, MA, United States | NaN | Roslindale | NaN | Boston | MA | NaN | Boston | Boston, MA | US | United States | 42.281106 | -71.121021 | f | House | Private room | 4 | 1.0 | 1.0 | 2.0 | Real Bed | {TV,Internet,"Wireless Internet","Air Conditio... | NaN | $75.00 | NaN | NaN | $100.00 | $50.00 | 2 | $25.00 | 1 | 1125 | a week ago | NaN | 6 | 16 | 26 | 98 | 2016-09-06 | 1 | 2016-08-28 | 2016-08-28 | 100.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | f | NaN | NaN | f | moderate | f | f | 1 | 1.00 |
4 | 7651065 | https://www.airbnb.com/rooms/7651065 | 20160906204935 | 2016-09-07 | Come Home to Boston | My comfy, clean and relaxing home is one block... | Clean, attractive, private room, one block fro... | My comfy, clean and relaxing home is one block... | none | I love the proximity to downtown, the neighbor... | I have one roommate who lives on the lower lev... | From Logan Airport and South Station you have... | You will have access to the front and side por... | I love my city and really enjoy sharing it wit... | Please no smoking in the house, porch or on th... | https://a1.muscache.com/im/pictures/97154760/8... | https://a1.muscache.com/im/pictures/97154760/8... | https://a1.muscache.com/im/pictures/97154760/8... | https://a1.muscache.com/im/pictures/97154760/8... | 15396970 | https://www.airbnb.com/users/show/15396970 | Linda | 2014-05-11 | Boston, Massachusetts, United States | I work full time for a public school district.... | within an hour | 100% | 100% | t | https://a0.muscache.com/im/users/15396970/prof... | https://a0.muscache.com/im/users/15396970/prof... | Roslindale | 1 | 1 | ['email', 'phone', 'reviews', 'kba'] | t | t | Durnell Avenue, Boston, MA 02131, United States | Roslindale | Roslindale | NaN | Boston | MA | 02131 | Boston | Boston, MA | US | United States | 42.284512 | -71.136258 | t | House | Private room | 2 | 1.5 | 1.0 | 2.0 | Real Bed | {Internet,"Wireless Internet","Air Conditionin... | NaN | $79.00 | NaN | NaN | NaN | $15.00 | 1 | $0.00 | 2 | 31 | 2 weeks ago | NaN | 13 | 34 | 59 | 334 | 2016-09-06 | 29 | 2015-08-18 | 2016-09-01 | 99.0 | 10.0 | 10.0 | 10.0 | 10.0 | 9.0 | 10.0 | f | NaN | NaN | f | flexible | f | f | 1 | 2.25 |
len(listings)
3585
At the moment our listings have several geospatial variables, the most important of which are longitude
and latitude
, which give the exact coordinates of the BnB in question. This means that it's easy for us to, say, plot every BnB location on a map:
plt.scatter(listings['longitude'], listings['latitude'])
<matplotlib.collections.PathCollection at 0x34240470>
Chances are you've already done this before, and it's a perfectly adequate way to get started working with locations. In this plot we see...not much, really. If you're very intimately familiar with the layout of the city of Boston, you will probably be able to make sense of some of these clusters which are, to me, not being from the city, totally mysterious.
In other words, this plot is missing something important: geospatial context.
Additionally, this display is unprojected—it's displayed in terms of raw coordinates. The amount of distance contained in a coordinate degree varies greatly depending on where you are, so this naive plot potentially pretty badly distorts distances.
We'll come back to the projection issue later; for now, there's an easy fix for both these problems.
Enter mplleaflet
. mplleaflet
is a tool that automatically takes a coordinate matplotlib
plot of any kind and places it on top of a leaflet
slippy map. The best part is that it's just one additional line of code. Just throw mplleaflet.display()
after generating your plot to drop it inline in your Jupyter notebook:
import mplleaflet
sample = listings.sample(1000)
plt.scatter(sample['longitude'], sample['latitude'])
mplleaflet.display()
Note, however, that rather than plotting every single point on our map we've taken a sample of 1000 of them and plotted those. This is known as undersampling.
This is oftentime necessary, as here, for performance reasons. Because mplleaflet
embeds point objects directly on an SVG canvas, it's as much as two orders of magnitude slower and more power-hungry than generating a really basic matplotlib
blob.
There are few ways to get around this limitation. The bokeh
library is classically mentioned as an option; the new, language-of-graphics-oriented (but still very incomplete) geoviews
library does an even better job. And if you really, really need to see those points, you can use datashader
, another not-even-a-year-old visualization library that comfortably works with millions of points (further reading).
However, in our case we'll do...none of these things. We're not really interested in individaul BnBs, per se; we're interested in grouping them and finding similarities and trends amongst them, not what exact street they're on. To do this, we'll wrap these points in polygons of some kind and try to look not at homes, but at neighborhoods.
The world around us is split up into lots of different kinds of geometries. The big ones are continents; from there we go down to countries, states, districts, counties, and so on.
On the city level, where we are in this case, there are usually a bunch of federal options (PUMAs or electoral districts for example) as well as a variety of usually more targetted geographical aggregations provided by the city government. Usually no matter where you go, however, the lowest-level geometries in the United States are census tracts. Census tracts are created every ten years for the census, and are built to try to contain either 4000 or 0 people (in the case of parks, beaches, etc.).
Boston GIS released data on the census tracts in their city not long after the 2010 census went out, and that's what we'll use.
However, the data comes in the form of a complex and convoluted GIS-standard data format known as a shapefile
(further reading).
pandas
has no facilities for reading shapefiles.
When Wes McKinney was first introducing pandas
to a wider audience, he was asked more than once why he was doing it when everyone was getting along so well passing raw numpy
arrays back and forth all the time (remember, hindsight is 20/20). Reading shapefiles into Python used to be a similarly terrible experience, until geopandas
came along and changed everything for the better.
Now you don't need to know anything at all about shapefiles to work with them, you just do this:
import geopandas as gpd
boston = gpd.read_file("../input/Census2010_Tracts.shp")
And you're done!
boston.head()
ALAND10 | AWATER10 | COUNTYFP10 | FUNCSTAT10 | GEOID10 | INTPTLAT10 | INTPTLON10 | MTFCC10 | NAME10 | NAMELSAD10 | STATEFP10 | Shape_Area | Shape_Leng | TRACTCE10 | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 363702.0 | 0.0 | 025 | S | 25025010405 | +42.3398654 | -071.0896052 | G5020 | 104.05 | Census Tract 104.05 | 25 | 3.914568e+06 | 14629.550361 | 010405 | POLYGON ((-71.090087 42.34665699999989, -71.09... |
1 | 136829.0 | 0.0 | 025 | S | 25025010404 | +42.3419667 | -071.0886375 | G5020 | 104.04 | Census Tract 104.04 | 25 | 1.472714e+06 | 5277.643216 | 010404 | POLYGON ((-71.09065999999994 42.3397659999999,... |
2 | 127905.0 | 0.0 | 025 | S | 25025010801 | +42.3541193 | -071.0770216 | G5020 | 108.01 | Census Tract 108.01 | 25 | 1.376667e+06 | 6166.497167 | 010801 | POLYGON ((-71.08159499999989 42.35370399999986... |
3 | 299981.0 | 0.0 | 025 | S | 25025010702 | +42.3518354 | -071.0755159 | G5020 | 107.02 | Census Tract 107.02 | 25 | 3.228780e+06 | 7818.852369 | 010702 | POLYGON ((-71.070655 42.35185399999991, -71.07... |
4 | 254706.0 | 0.0 | 025 | S | 25025010204 | +42.3462887 | -071.1033879 | G5020 | 102.04 | Census Tract 102.04 | 25 | 2.741497e+06 | 7621.654206 | 010204 | POLYGON ((-71.10682499999994 42.34875299999985... |
geopandas
extends pandas
DataFrame
and Series
expressions into GeoDataFrame
and GeoSeries
ones.
The main thing that these add is a geometry
column containing all of our shapes:
boston['geometry'].head()
0 POLYGON ((766978.240309909 2951616.923866287, ... 1 POLYGON ((766835.3942200691 2949104.969886944,... 2 POLYGON ((769261.2628526837 2954196.132479325,... 3 POLYGON ((772221.7065478563 2953536.652079985,... 4 POLYGON ((762449.9770828187 2952359.446074486,... Name: geometry, dtype: object
However, these polygons seem more than a little bit weird. The coordinates don't make any sense!
It turns out that the shapefile we're working with is encoded in some kind of alternative coordinate reference system (CRS), one with the origin point set to (200000, 750000). We can see this for ourselves by asking our GeoDataFrame
its crs
property:
boston.crs
{'datum': 'NAD83', 'lat_0': 41, 'lat_1': 41.71666666666667, 'lat_2': 42.68333333333333, 'lon_0': -71.5, 'no_defs': True, 'proj': 'lcc', 'units': 'us-ft', 'x_0': 199999.9999999999, 'y_0': 750000}
The master directory for all well-known coordinate systems is the EPSG. The simple latitude-and-longitude reference system is epsg:4326
, and geopandas
makes it really, really easy to "fix" our coordinates by just calling the to_crs
method:
boston = boston.to_crs({'init': 'epsg:4326'})
boston['geometry'].head()
0 POLYGON ((-71.090087 42.34665699999989, -71.09... 1 POLYGON ((-71.09065999999994 42.3397659999999,... 2 POLYGON ((-71.08159499999989 42.35370399999986... 3 POLYGON ((-71.070655 42.35185399999991, -71.07... 4 POLYGON ((-71.10682499999994 42.34875299999985... Name: geometry, dtype: object
Much better.
geopandas
objects have a plot
method which plops your geometry into a matplotlib
plot right away:
boston.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x263d6c18>
Again, to get geographic context, let's throw this result at mplleaflet
!
import mplleaflet
f = plt.figure(figsize=(15, 8))
ax = f.gca()
boston.plot(ax=ax)
mplleaflet.display(fig=f)
There it is—all the census tracts of the city of Boston.
Notice that some of these tracts have really "weird" shapes. This is because these are census tracts, remember, which try to isolate parts of the city which have people in them from those (like airports, beaches, islands, or the waterway) which do not.
DataFrame
to GeoDataFrame
Objects¶Under the hood, geopandas
represents geometries using the shapely
library. Just like how when using pandas
you still occassionally have to reach across to grab something from numpy
, when working with geometries it's important to understand what shapely
nets you.
import shapely
Luckily shapely
has some of the best documentation of any library I've ever used, and it's really simple to use.
Recall that our BnB listings was a DataFrame
containing two geospatial fields, latitude and longitude. As a quick demonstration of shapely
, let's upconvert this representation to true geometries. To do this we wrap our DataFrame
in a GeoDataFrame
, mapping our geometry
column to a shapely
Point
for each coordinate pair.
listings = gpd.GeoDataFrame(listings, geometry=listings.apply(
lambda srs: shapely.geometry.Point(srs['longitude'], srs['latitude']), axis='columns'
))
We've plotted our points and our polygons seperately, now how do we bring them together?
One obvious answer is to use a seaborn
kdeplot
, which many of you should be familiar with.
import seaborn as sns
C:\Users\Alex\Anaconda3\envs\geoplot\lib\site-packages\IPython\html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`. "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)
f = plt.figure(figsize=(15, 8))
ax = f.gca()
boston.plot(ax=ax, alpha=0.1, linewidth=0.25, color='white')
sns.kdeplot(data=listings.apply(lambda srs: pd.Series({'x': srs.geometry.x, 'y': srs.geometry.y}), axis='columns'), ax=ax,
alpha=1)
ax.set_axis_off()
This does give us a good sense of where our AirBNBs are concentrated. Unfortunately, however, KDEs are almost always too "blobish"—they're not granular enough to show the details inherent in geographic distributions.
With a little bit more work we can get to the closest thing to a gold standard for such things in cartography, a choropleth map.
First let's create a column in our polygonal dataset, BNBs
, containing a count of the number of AirBnB locations in the area.
import numpy as np
def assign_census_tract(bnb):
bools = [geom.contains(bnb['geometry']) for geom in boston['geometry']]
if True in bools:
return boston.iloc[bools.index(True)]['NAMELSAD10']
else:
return np.nan
listings['census_tract'] = listings.apply(assign_census_tract, axis='columns')
listings['census_tract'].value_counts().head()
Census Tract 701.01 113 Census Tract 102.03 111 Census Tract 8.02 95 Census Tract 201.01 86 Census Tract 606 84 Name: census_tract, dtype: int64
boston['BNBs'] = boston['NAMELSAD10'].map(listings['census_tract'].value_counts())
And now we use that data to throw up a map.
f = plt.figure(figsize=(15, 8))
ax = f.gca()
kw = dict(column='BNBs', k=6, cmap='YlGn', alpha=1, legend=True, edgecolor='gray', linewidth=0.5)
boston.plot(scheme='QUANTILES', ax=ax, **kw)
ax.set_axis_off()
C:\Users\Alex\Anaconda3\envs\geoplot\lib\site-packages\numpy\lib\function_base.py:3569: RuntimeWarning: Invalid value encountered in median RuntimeWarning)
This type of map is, as was mentioned earlier, known as a choropleth. It's a ubiquitious classic of the genre. The darker the shading, the more AirBNBs in that census tract.
Aside from a variety of visual parameters, we specified three important ones. column=BNBs
told geopandas
which columns to plot data from. scheme=QUANTILES
is which "bucketing" system you want to plot with: in this case we specified quantiles, which splits the data into bins equal in number, albiet not in size. Finally, k
tells geopandas
how many bins we want.
The trouble with this map, however, is that it doesn't take into account the size of the census tracts. If our census tract is bigger than average, it'll naturally contain more homes, and, all else being equal, more AirBnBs.
This is a classical mapmaking fallacy. Here's another example of this error, straight from the responsible article on Wikipedia:
Notice how the plot on the left is completely scrambled and doesn't actually tell us anything, while the plot on the right is smooth, structured, and informative.
In our case we'll fix it by figuring out AirBnB density per square kilometer, not just the raw number, and plot that.
Remember, however, that areas in the latitude-longitude coordinate system are not equal. To get an accurate measurement of area we need to first reproject to what's called an "equal-area" projection (epsg:3395
will do nicely), and then take the areas of those polygons instead. Note also that shapely
returns areas in square meters; we divide by a constant to get square kilometers. All this is done by the workhourse one-liner below.
boston['BNBDensity'] = (boston['BNBs'] / boston['geometry']\
.to_crs({'init': 'epsg:3395''})\
.map(lambda p: p.area / 10**6))\
.fillna(0)
f = plt.figure(figsize=(15, 8))
ax = f.gca()
kw = dict(column='BNBDensity', k=6, cmap='YlGn', alpha=1, legend=True, edgecolor='gray', linewidth=0.5)
boston.plot(scheme='QUANTILES', ax=ax, **kw)
ax.set_axis_off()
This more contiguous plot is the correct one!
Here we see that in the inner city AirBnB density reaches from 60 to almost 300 AirBnB locations per square kilometer. Since there are approximately 150 city blocks per kilometer squared, a back-of-the-envelope calculation says that this makes for anywhere from an AirBnB every three blocks to two AirBnBs per block.
That's a lot of market penetration.
We will now introduce the next major tool in the geographic toolbox, pysal
. pysal
is a by-now venerable library designed for performing spatial analytics, and it'll be our focus for the remainder of this tutorial.
import pysal as ps
Spatial data analytics requires understanding spaces, and understanding spaces requires tying them together somehow. Thus the concept of spatial weights.
Spatial weights tie together areas which are neighbors with one another in some sense. There are many ways of defining this concept of neighborhood-ness or closeness. We could, for example, pick all of the other shapes which are within a certain distance of our chosen shape. Or we could choose all shapes which directly border ours, or at least touch ours. The former of these is known as rook continuity and the latter as queen continuity, after the way those pieces move on a checkerboard, and they're two of the most common methods for defining spatial weights.
If you're interested in more details, you definitely need to check out the concept's entry in the pysal
documentation.
For ease of use, we'll pick queen continuity. This means that we will consider our census tracts to be neighbors so long as they touch at least at an edge. We can read such facts directly out of the shapefile using queen_from_shapefile
:
qW = ps.queen_from_shapefile("../input/Census2010_Tracts.shp")
This tells us that, for instance, our fourth census tract has four "touching" neighbors: tracts 6, 10, 50, and 80:
qW[4]
{6: 1.0, 10: 1.0, 50: 1.0, 80: 1.0}
To get a better sense of what we've just created, let's plot our neighborhood vectors on a map:
f = plt.figure(figsize=(15, 8))
ax = f.gca()
for i, (k, neighbors) in enumerate(qW.neighbors.items()):
origin = boston.geometry.iloc[i].centroid
for nabe_i in neighbors:
nabe_centroid = boston.geometry.iloc[nabe_i].centroid
plt.plot([origin.x, nabe_centroid.x], [origin.y, nabe_centroid.y], '-')
boston.plot(ax=ax, linewidth=0.5, facecolor='white')
ax.set_axis_off()
You can see it better when you zoom in a bit:
f = plt.figure(figsize=(15, 8))
ax = f.gca()
for i, (k, neighbors) in enumerate(qW.neighbors.items()):
origin = boston.geometry.iloc[i].centroid
for nabe_i in neighbors:
nabe_centroid = boston.geometry.iloc[nabe_i].centroid
plt.plot([origin.x, nabe_centroid.x], [origin.y, nabe_centroid.y], '-')
boston.plot(ax=ax, linewidth=1, facecolor='white')
ax.set_xlim([-71.2, -71.1])
ax.set_ylim([42.325, 42.375])
(42.325, 42.375)
...including the weirdness around the elongated census tract for that strip of beach.
Spacial weights are necessary in order to get a sense of the spatial lag inherent in our data. Spatial lag is a geographic version of autocorrelation, a perhaps familar term from time-series analysis: it's a measure of how strongly the data in one polygon is predicted by its neighbors.
If our data is totally randomly distributed on the map—if, in effect, it has no spatial component—then we expect that the BnB densities in our neighboring census tracts will be very poor predictors of the BnB density in our current tract, because we'll just be predicting noise with noise. If, on the other hand, the densities we predict are close to the real value, then we can predict density based on geography, and our data does indeed have a strong geographic component.
Of course, examining our plots from earlier, this may seem like a moot point—the choropleths make it pretty obvious that there's geography involved. But what we'd like to do now is assign a probability to that geography not being just random chance. Examining spatial lag allows us, essentially, to understand the statistical significance of our result.
Let's generate lags, split them into five quantiles, and see where they end up.
bnb_spatial_lags = ps.lag_spatial(qW, boston['BNBDensity'])
spatial_lag_classes = ps.Quantiles(bnb_spatial_lags, k=5)
spatial_lag_classes
Quantiles Lower Upper Count ============================================ x[i] <= 14.018 19 14.018 < x[i] <= 24.331 18 24.331 < x[i] <= 43.946 18 43.946 < x[i] <= 59.383 18 59.383 < x[i] <= 85.885 18 85.885 < x[i] <= 136.440 18 136.440 < x[i] <= 185.530 17 185.530 < x[i] <= 280.634 19 280.634 < x[i] <= 613.336 18 613.336 < x[i] <= 1128.536 18
spatial_lag_classes.yb
array([4, 4, 4, 4, 3, 4, 4, 3, 3, 3, 3, 3, 2, 2, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 3, 3, 2, 4, 4, 4, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 0, 3, 4, 1, 0, 2, 0, 0, 1, 1, 1, 2, 2, 3, 3, 2, 1, 0, 4, 3, 1, 3, 2, 2, 3, 4, 4, 3, 1, 3, 4, 0, 0, 0, 2, 0, 0, 1, 1, 4, 4, 4, 4, 4, 3, 4, 4, 2, 3, 2, 3, 2, 0, 1, 0, 3, 2, 2, 3, 1, 3, 2, 2, 3, 2, 2, 2, 4, 0, 0, 0, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 1, 2, 3, 2, 3, 1, 4, 2, 2, 2, 2, 1, 1, 3, 4, 4, 4, 4, 4, 3, 4, 2])
f = plt.figure(figsize=(15, 8))
ax = f.gca()
kw = dict(column='spatial_class', k=5, cmap='YlGn', alpha=1, legend=True, edgecolor='gray', linewidth=0.5,
categorical=True)
boston.assign(spatial_class=spatial_lag_classes.yb).plot(ax=ax, **kw)
ax.set_axis_off()
This geographic pattern is distinctly not random, strong evidence again that our data is geographically dependent.
How dependent? The tool of choice for this is the a Moran scatterplot. If geography matters our data should show a strong liner structure of some kind: high lag values correspond with high actual values, and low lags with low ones.
A quick plot shows that this is, indeed, the case:
f, ax = plt.subplots(1, figsize=(9, 9))
plt.plot(boston['BNBDensity'], bnb_spatial_lags, '.', color='firebrick')
# Calculate and plot a line of best fit.
b,a = np.polyfit(boston['BNBDensity'], bnb_spatial_lags, 1)
plt.plot(boston['BNBDensity'], a + b*boston['BNBDensity'], 'r')
plt.title('Moran Scatterplot')
plt.ylabel('Boston BNB Density Spatial Lag')
plt.xlabel('Boston BNB Density Actual')
plt.show()
We can get a p-value:
moran = ps.Moran(boston['BNBDensity'].values, qW)
moran.I, moran.p_sim
(0.52791249910953175, 0.001)
Our p-value of 0.001 tells us that there is a less than 0.1% chance that our data distribution is this heavily skewed by randomness alone.
We've learned how to display geography and how to measure its importance for a dataset. Let's now move on to learning how to apply it.
Suppose that we're interested in segmenting the Boston AirBnB markets based on the types of properties available. This is, after all, a part of our dataset:
listings['property_type'].value_counts()
Apartment 2612 House 562 Condominium 231 Townhouse 54 Bed & Breakfast 41 Loft 39 Other 17 Boat 12 Villa 6 Entire Floor 4 Dorm 2 Guesthouse 1 Camper/RV 1 Name: property_type, dtype: int64
Leaving aside the "exotic" options, we've really got two major types of BnBs: apartments and houses.
apartments = listings.query('property_type == "Apartment" or property_type == "Condominium"').groupby('census_tract').count()['id']
houses = listings.query('property_type == "House" or property_type == "Townhouse"').groupby('census_tract').count()['id']
Let's create numerical variables counting how many of each we have per census tract.
boston['BNBDensity_Houses'] = boston['NAMELSAD10'].map(houses).fillna(0)
boston['BNBDensity_Apartments'] = boston['NAMELSAD10'].map(apartments).fillna(0)
We have three predictor variables that we'd like to use: the number of apartments, the numbers of houses, and their density combined.
bnb_market = boston[['BNBDensity', 'BNBDensity_Houses', 'BNBDensity_Apartments']]
If you're familiar with scikit-learn
this is, at this point, a classic clustering problem, one we can easily solve using k-means clustering, for example:
import sklearn.cluster
import sklearn.preprocessing
clf = sklearn.cluster.KMeans(n_clusters=3)
X = sklearn.preprocessing.scale(bnb_market.values)
classes = clf.fit(X)
But remember that we're dealing with geometries, not just points in space! Our clustering algorithm doesn't know about this; it just naively classifies even very distant tracts into bins based on characteristic similarity alone. We can see this when we plot the result:
f = plt.figure(figsize=(15, 8))
ax = f.gca()
kw = dict(column='cluster', k=3, cmap='YlGn', alpha=1, legend=True, edgecolor='gray', linewidth=0.5, categorical=True)
boston.assign(cluster=classes.labels_).plot(ax=ax, **kw)
ax.set_axis_off()
Our classifier does seem to have learned, in effect, the difference between low-density outlying tracts, medium-density tracts, and high-density inner city tracts:
bnb_market.assign(cluster=classes.labels_).groupby('cluster').mean()
BNBDensity | BNBDensity_Houses | BNBDensity_Apartments | |
---|---|---|---|
cluster | |||
0 | 12.188742 | 2.062992 | 6.740157 |
1 | 146.465581 | 1.724138 | 46.862069 |
2 | 28.905771 | 11.840000 | 23.640000 |
Indeed, this classifier does exactly what we set out to do. Examine the following plot:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(bnb_market['BNBDensity'], bnb_market['BNBDensity_Apartments'],
zs=bnb_market['BNBDensity_Houses'], c=classes.labels_, cmap='YlGn')
<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x28f554a8>
The pale census tracts are those which have very little AirBnB activity at all. They tend to be furthest from the city center. The second class of tracts, the dark green ones, have a significant BnB density, but consist primarily of houses, not apartments; these correspond with middle-of-the-pack residential areas. Finally, the densest areas, tracts containing plenty of apartments for rent but no homes and located in the heart of the city, are pale green.
This is perfect! Right?
Well, it is, except that the areas we've binned are discontinuous. Because our classifier disregards geography it doesn't put in any effort keeping our class members next one another, and so we don't arrive at any understanding of neighborhood topography.
pysal
contains classifiers that allow us to do just that. Cheif amongst them is the Maxp
clustering algorithm, which chooses clusters which match each other characteristically and are also all next to one another (according to whatever definition of neighbors you choose—we'll keep going with queen contiguity).
Here's what we get when we run it. Notice the place of our spatial weights in the input, as well as a minimum number of observations per class—150.
clf = ps.region.Maxp(qW, bnb_market.values, 150, boston['BNBs'].values[:,None])
# n = 200
f = plt.figure(figsize=(15, 8))
ax = f.gca()
kw = dict(column='cluster', k=10, cmap='YlGn', alpha=1, legend=True, edgecolor='gray', linewidth=0.5, categorical=True)
boston.assign(cluster=clf.area2region.values()).plot(ax=ax, **kw)
ax.set_axis_off()
Because of the additional constraint the algorithm imposes, it tends to result in significantly more clusters than comparable non-spatial clustering algorithms. But, as you can see here, the algorithm did arrive at essentially the same conclusion as our kNN did: apartment-dense inner cities, house-dense middle burbs, and outlying areas. It just takes ten contiguous districts to describe what three discontiguous ones.
Here again are our mean characteristics:
bnb_market.assign(cluster=clf.area2region.values()).groupby('cluster').mean()
BNBDensity | BNBDensity_Houses | BNBDensity_Apartments | |
---|---|---|---|
cluster | |||
0 | 78.451566 | 1.764706 | 36.647059 |
1 | 16.821212 | 5.833333 | 11.166667 |
2 | 7.442751 | 2.882353 | 4.423529 |
3 | 21.453374 | 2.428571 | 10.142857 |
4 | 53.346294 | 3.375000 | 15.250000 |
5 | 29.089423 | 3.333333 | 29.333333 |
6 | 39.003481 | 10.000000 | 24.777778 |
7 | 83.277575 | 10.500000 | 39.750000 |
8 | 205.586501 | 1.285714 | 43.857143 |
9 | 166.723335 | 0.666667 | 49.500000 |
Suppose that we're interested in modeling the price of a BnB. Let's use the following basic features to do it:
explanatory_variables = ['bathrooms', 'bedrooms', 'beds', 'guests_included']
There's a lot of boilerplate here (note the need to merge together the price
and cleaning fee
into an actual_price
, BnB hosts like to be sneaky and sock away as much as 50$ or more into that "fee"!). But if you've run a basic linear regression before, this should be pretty rote. We're going to use statsmodels
to do it.
price_variables = ['price', 'cleaning_fee']
geom = ['latitude', 'longitude'] # we'll need this later
valid_listings = listings[explanatory_variables + price_variables + geom].dropna()
valid_listings['actual_price'] = valid_listings['price'].map(convert_price) +\
valid_listings['cleaning_fee'].map(convert_price)
valid_listings = valid_listings.dropna()
def convert_price(price):
try:
return float(price.replace("$", "").replace(".", "")[:-2])
except (ValueError, AttributeError):
return np.nan
import statsmodels.api as sm
clf = sm.OLS(valid_listings['actual_price'], valid_listings[explanatory_variables])
basic_ols_res = clf.fit()
basic_ols_res.summary()
Dep. Variable: | actual_price | R-squared: | 0.821 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.820 |
Method: | Least Squares | F-statistic: | 2808. |
Date: | Sun, 20 Nov 2016 | Prob (F-statistic): | 0.00 |
Time: | 19:55:02 | Log-Likelihood: | -15293. |
No. Observations: | 2460 | AIC: | 3.059e+04 |
Df Residuals: | 2456 | BIC: | 3.062e+04 |
Df Model: | 4 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
bathrooms | 86.6214 | 4.185 | 20.699 | 0.000 | 78.415 94.828 |
bedrooms | 38.2459 | 4.722 | 8.099 | 0.000 | 28.986 47.505 |
beds | 36.4452 | 3.642 | 10.006 | 0.000 | 29.303 43.588 |
guests_included | 12.0422 | 2.316 | 5.199 | 0.000 | 7.500 16.585 |
Omnibus: | 172.427 | Durbin-Watson: | 1.296 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 241.583 |
Skew: | 0.592 | Prob(JB): | 3.48e-53 |
Kurtosis: | 3.977 | Cond. No. | 7.49 |
Now how do we take this analysis and make it spatial?
A common theme found in the field of spatial econometrics is the idea of mutual price-setting. That is, when someone sets the price of a spatial good, they take into account the price that good set by their neighbors, and vice versa.
If price-setting has a spatial element of this kind (a beating-the-Joneses effect of sorts), then we should be able to improve the results of our model by mixing in observations about nearby BnBs.
To do that we'll first need to grab ahold of what we mean by "neighbors". Since we're dealing with points, not shapes, contiguities don't apply here anymore; we in our usual k-nearest neighbor space. It's possible to grab our kNN from scipy
, but it's much faster coding-wise to take a pysal
built-in instead.
w = ps.knnW_from_array(valid_listings[['longitude', 'latitude']].dropna().values, k=2)
# w.transform = 'R'
w
<pysal.weights.Distance.KNN at 0x29652710>
Then to run our spatial regression we simply pass in the same parameters as before, but now with the inclusion of our weights (w
). name_x
and name_y
are optional parameters which simply add a little bit more information to the summary display.
geospatial_ols = ps.spreg.GM_Lag(valid_listings['actual_price'].values[:, None],
valid_listings[explanatory_variables].values,
w=w,
name_x=explanatory_variables,
name_y='actual_price')
print(geospatial_ols.summary)
REGRESSION ---------- SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES -------------------------------------------------- Data set : unknown Weights matrix : unknown Dependent Variable :actual_price Number of Observations: 2460 Mean dependent var : 246.6793 Number of Variables : 6 S.D. dependent var : 145.2765 Degrees of Freedom : 2454 Pseudo R-squared : 0.2996 Spatial Pseudo R-squared: 0.3265 ------------------------------------------------------------------------------------ Variable Coefficient Std.Error z-Statistic Probability ------------------------------------------------------------------------------------ CONSTANT 74.5257846 11.4683038 6.4984139 0.0000000 bathrooms 48.5721848 6.0617495 8.0128988 0.0000000 bedrooms 39.4847512 4.7342175 8.3402908 0.0000000 beds 35.5915425 3.6648137 9.7116921 0.0000000 guests_included 5.9026820 2.4268024 2.4322878 0.0150038 W_actual_price -0.0212334 0.0196799 -1.0789394 0.2806148 ------------------------------------------------------------------------------------ Instrumented: W_actual_price Instruments: W_bathrooms, W_bedrooms, W_beds, W_guests_included ================================ END OF REPORT =====================================
Unfortunately we will end our crash course on a sad note. it doesn't look like neighbor price fixing is a significant variable at all, as our coefficient is on the level of statistical noise.
Furthermore, it turns out that our spatial algorithm is significantly worse; the mean squared error of our spatial result, a common metric for OLS regression, is much worse than for our no-frills one:
geospatial_ols_mse = ((valid_listings['actual_price'] - geospatial_ols.predy.flatten()) ** 2).mean()
print("OLS".ljust(5), ":", basic_ols_res.mse_total)
print("OLS+W".ljust(5), ":", geospatial_ols_mse)
OLS : 81947.352439 OLS+W : 14788.3873299
So in this case we've shown pretty conclusively just the opposite: that the price of a BnB has little to do with the prices set by its neighbors.
That concludes our crash course on geospatial data science. There's lots of useful components in the Python toolbox that we didn't cover: working with images using rasterio
, for example. But I hope that this quick crash course has at least made you aware of some of the many interesting geographic things you can do with the tools that Python has! Keep them in mind—they're sure to come in handy later.
If you want to learn more about Python geospatial tools, here's a only loosely-ordered list of resources to look at:
If you're up for a challenge, the dataset also includes information on the amenities present at each location:
amenities = listings['amenities'].map(lambda d: [amenity.replace('"', "")\
.replace("{", "")\
.replace("}", "") for amenity in d.split(",")])
listings['amenities'] = amenities
possible_amenities = set([item for sublist in amenities for item in sublist])
possible_amenities
{'', '24-Hour Check-in', 'Air Conditioning', 'Breakfast', 'Buzzer/Wireless Intercom', 'Cable TV', 'Carbon Monoxide Detector', 'Cat(s)', 'Dog(s)', 'Doorman', 'Dryer', 'Elevator in Building', 'Essentials', 'Family/Kid Friendly', 'Fire Extinguisher', 'First Aid Kit', 'Free Parking on Premises', 'Free Parking on Street', 'Gym', 'Hair Dryer', 'Hangers', 'Heating', 'Hot Tub', 'Indoor Fireplace', 'Internet', 'Iron', 'Kitchen', 'Laptop Friendly Workspace', 'Lock on Bedroom Door', 'Other pet(s)', 'Paid Parking Off Premises', 'Pets Allowed', 'Pets live on this property', 'Pool', 'Safety Card', 'Shampoo', 'Smoke Detector', 'Smoking Allowed', 'Suitable for Events', 'TV', 'Washer', 'Washer / Dryer', 'Wheelchair Accessible', 'Wireless Internet', 'translation missing: en.hosting_amenity_49', 'translation missing: en.hosting_amenity_50'}
Analysis by the pysal
folks showed that, at least in Austin, the presence of a pool amenity somewhere nearby had a small but significant effect on price.
Does this result extend to Boston? And do any of the other amenities have an effect like this as well?