#!/usr/bin/env python
# coding: utf-8
# # Exploratory Analysis of Spatial Data: Spatial Autocorrelation #
#
#
# In this notebook we introduce methods of _exploratory spatial data analysis_
# that are intended to complement geovizualization through formal univariate and
# multivariate statistical tests for spatial clustering.
#
#
# ## Imports
# In[1]:
import sys
import os
sys.path.append(os.path.abspath('..'))
import esda
# In[2]:
import pandas as pd
import geopandas as gpd
import libpysal as lps
import numpy as np
import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')
# Our data set comes from the Berlin airbnb scrape taken in April 2018. This dataframe was constructed as part of the [GeoPython 2018 workshop](https://github.com/ljwolf/geopython) by [Levi Wolf](https://ljwolf.org) and [Serge Rey](https://sergerey.org). As part of the workshop a geopandas data frame was constructed with one of the columns reporting the median listing price of units in each neighborhood in Berlin:
# In[3]:
df = gpd.read_file('data/neighborhoods.shp')
# was created in previous notebook with df.to_file('data/neighborhoods.shp')
# In[4]:
df.head()
# We have an `nan` to first deal with:
# In[5]:
pd.isnull(df['median_pri']).sum()
# In[6]:
df = df
df['median_pri'].fillna((df['median_pri'].mean()), inplace=True)
# In[7]:
df.plot(column='median_pri')
# In[8]:
fig, ax = plt.subplots(figsize=(12,10), subplot_kw={'aspect':'equal'})
df.plot(column='median_pri', scheme='Quantiles', k=5, cmap='GnBu', legend=True, ax=ax)
#ax.set_xlim(150000, 160000)
#ax.set_ylim(208000, 215000)
# ## Spatial Autocorrelation ##
#
# Visual inspection of the map pattern for the prices allows us to search for
# spatial structure. If the spatial distribution of the prices was random, then we
# should not see any clustering of similar values on the map. However, our visual
# system is drawn to the darker clusters in the south west as well as the center,
# and a concentration of the lighter hues (lower prices) in the north central and
# south east.
#
# Our brains are very powerful pattern recognition machines. However, sometimes
# they can be too powerful and lead us to detect false positives, or patterns
# where there are no statistical patterns. This is a particular concern when
# dealing with visualization of irregular polygons of differning sizes and shapes.
#
# The concept of *spatial
# autocorrelation* relates to the combination of two types of similarity: spatial
# similarity and attribute similarity. Although there are many different measures
# of spatial autocorrelation, they all combine these two types of simmilarity into
# a summary measure.
#
# Let's use PySAL to generate these two types of similarity
# measures.
#
# ### Spatial Similarity ###
#
# We have already encountered spatial weights
# in a previous notebook. In spatial autocorrelation analysis, the spatial weights
# are used to formalize the notion of spatial similarity. As we have seen there
# are many ways to define spatial weights, here we will use queen contiguity:
# In[9]:
wq = lps.weights.Queen.from_dataframe(df)
wq.transform = 'r'
# ### Attribute Similarity ###
#
# So the spatial weight between neighborhoods $i$ and $j$ indicates if the two
# are neighbors (i.e., geographically similar). What we also need is a measure of
# attribute similarity to pair up with this concept of spatial similarity. The
# **spatial lag** is a derived variable that accomplishes this for us. For neighborhood
# $i$ the spatial lag is defined as: $$ylag_i = \sum_j w_{i,j} y_j$$
# In[10]:
y = df['median_pri']
ylag = lps.weights.lag_spatial(wq, y)
# In[11]:
ylag
# In[12]:
import mapclassify as mc
ylagq5 = mc.Quantiles(ylag, k=5)
# In[13]:
f, ax = plt.subplots(1, figsize=(9, 9))
df.assign(cl=ylagq5.yb).plot(column='cl', categorical=True, k=5, cmap='GnBu', linewidth=0.1, ax=ax, edgecolor='white', legend=True)
ax.set_axis_off()
plt.title("Spatial Lag Median Price (Quintiles)")
plt.show()
# The quintile map for the spatial lag tends to enhance the impression of value
# similarity in space. It is, in effect, a local smoother.
# In[14]:
df['lag_median_pri'] = ylag
f,ax = plt.subplots(1,2,figsize=(2.16*4,4))
df.plot(column='median_pri', ax=ax[0], edgecolor='k',
scheme="quantiles", k=5, cmap='GnBu')
ax[0].axis(df.total_bounds[np.asarray([0,2,1,3])])
ax[0].set_title("Price")
df.plot(column='lag_median_pri', ax=ax[1], edgecolor='k',
scheme='quantiles', cmap='GnBu', k=5)
ax[1].axis(df.total_bounds[np.asarray([0,2,1,3])])
ax[1].set_title("Spatial Lag Price")
ax[0].axis('off')
ax[1].axis('off')
plt.show()
# However, we still have
# the challenge of visually associating the value of the prices in a neighborhod
# with the value of the spatial lag of values for the focal unit. The latter is a
# weighted average of homicide rates in the focal county's neighborhood.
#
# To complement the geovisualization of these associations we can turn to formal
# statistical measures of spatial autocorrelation.
#
#
# ## Global Spatial Autocorrelation
#
# We begin with a simple case where the variable under consideration is binary.
# This is useful to unpack the logic of spatial autocorrelation tests. So even though
# our attribute is a continuously valued one, we will convert it to a binary case
# to illustrate the key concepts:
#
# ### Binary Case
# In[15]:
y.median()
# In[16]:
yb = y > y.median()
sum(yb)
# We have 68 neighborhoods with list prices above the median and 70 below the
# median (recall the issue with ties).
# In[17]:
yb = y > y.median()
labels = ["0 Low", "1 High"]
yb = [labels[i] for i in 1*yb]
df['yb'] = yb
# The spatial distribution of the binary variable immediately raises questions
# about the juxtaposition of the "black" and "white" areas.
# In[18]:
fig, ax = plt.subplots(figsize=(12,10), subplot_kw={'aspect':'equal'})
df.plot(column='yb', cmap='binary', edgecolor='grey', legend=True, ax=ax)
# ### Join counts ###
#
# One way to formalize a test for spatial autocorrelation in a binary attribute is
# to consider the so-called _joins_. A join exists for each neighbor pair of
# observations, and the joins are reflected in our binary spatial weights object
# `wq`.
#
# Each unit can take on one of two values "Black" or "White", and so for a given
# pair of neighboring locations there are three different types of joins that can
# arise:
#
# - Black Black (BB)
# - White White (WW)
# - Black White (or White Black) (BW)
#
# Given that we have 68 Black polygons on our map, what is the number of Black
# Black (BB) joins we could expect if the process were such that the Black
# polygons were randomly assigned on the map? This is the logic of join count statistics.
#
# We can use the `esda` package from PySAL to carry out join count analysis:
# In[19]:
import esda
yb = 1 * (y > y.median()) # convert back to binary
wq = lps.weights.Queen.from_dataframe(df)
wq.transform = 'b'
np.random.seed(12345)
jc = esda.join_counts.Join_Counts(yb, wq)
# The resulting object stores the observed counts for the different types of joins:
# In[20]:
jc.bb
# In[21]:
jc.ww
# In[22]:
jc.bw
# Note that the three cases exhaust all possibilities:
# In[23]:
jc.bb + jc.ww + jc.bw
# and
# In[24]:
wq.s0 / 2
# which is the unique number of joins in the spatial weights object.
#
# Our object tells us we have observed 121 BB joins:
# In[25]:
jc.bb
# The critical question for us, is whether this is a departure from what we would
# expect if the process generating the spatial distribution of the Black polygons
# were a completely random one? To answer this, PySAL uses random spatial
# permutations of the observed attribute values to generate a realization under
# the null of _complete spatial randomness_ (CSR). This is repeated a large number
# of times (999 default) to construct a reference distribution to evaluate the
# statistical significance of our observed counts.
#
# The average number of BB joins from the synthetic realizations is:
# In[26]:
jc.mean_bb
# which is less than our observed count. The question is whether our observed
# value is so different from the expectation that we would reject the null of CSR?
# In[27]:
import seaborn as sbn
sbn.kdeplot(jc.sim_bb, shade=True)
plt.vlines(jc.bb, 0, 0.075, color='r')
plt.vlines(jc.mean_bb, 0,0.075)
plt.xlabel('BB Counts')
# The density portrays the distribution of the BB counts, with the black vertical
# line indicating the mean BB count from the synthetic realizations and the red
# line the observed BB count for our prices. Clearly our observed value is
# extremely high. A pseudo p-value summarizes this:
# In[28]:
jc.p_sim_bb
# Since this is below conventional significance levels, we would reject the null
# of complete spatial randomness in favor of spatial autocorrelation in market prices.
#
#
# ### Continuous Case
#
# The join count analysis is based on a binary attribute, which can cover many
# interesting empirical applications where one is interested in presence and
# absence type phenomena. In our case, we artificially created the binary variable,
# and in the process we throw away a lot of information in our originally
# continuous attribute. Turning back to the original variable, we can explore
# other tests for spatial autocorrelation for the continuous case.
#
# First, we transform our weights to be row-standardized, from the current binary state:
# In[29]:
wq.transform = 'r'
# In[30]:
y = df['median_pri']
# Moran's I is a test for global autocorrelation for a continuous attribute:
# In[31]:
np.random.seed(12345)
mi = esda.moran.Moran(y, wq)
mi.I
# Again, our value for the statistic needs to be interpreted against a reference
# distribution under the null of CSR. PySAL uses a similar approach as we saw in
# the join count analysis: random spatial permutations.
# In[32]:
import seaborn as sbn
sbn.kdeplot(mi.sim, shade=True)
plt.vlines(mi.I, 0, 1, color='r')
plt.vlines(mi.EI, 0,1)
plt.xlabel("Moran's I")
# Here our observed value is again in the upper tail, although visually it does
# not look as extreme relative to the binary case. Yet, it is still statistically significant:
# In[33]:
mi.p_sim
# ## Local Autocorrelation: Hot Spots, Cold Spots, and Spatial Outliers ##
#
# In addition to the Global autocorrelation statistics, PySAL has many local
# autocorrelation statistics. Let's compute a local Moran statistic for the same
# d
# In[34]:
np.random.seed(12345)
import esda
# In[35]:
wq.transform = 'r'
lag_price = lps.weights.lag_spatial(wq, df['median_pri'])
# In[36]:
price = df['median_pri']
b, a = np.polyfit(price, lag_price, 1)
f, ax = plt.subplots(1, figsize=(9, 9))
plt.plot(price, lag_price, '.', color='firebrick')
# dashed vert at mean of the price
plt.vlines(price.mean(), lag_price.min(), lag_price.max(), linestyle='--')
# dashed horizontal at mean of lagged price
plt.hlines(lag_price.mean(), price.min(), price.max(), linestyle='--')
# red line of best fit using global I as slope
plt.plot(price, a + b*price, 'r')
plt.title('Moran Scatterplot')
plt.ylabel('Spatial Lag of Price')
plt.xlabel('Price')
plt.show()
# Now, instead of a single $I$ statistic, we have an *array* of local $I_i$
# statistics, stored in the `.Is` attribute, and p-values from the simulation are
# in `p_sim`.
# In[37]:
li = esda.moran.Moran_Local(y, wq)
# In[38]:
li.q
# We can again test for local clustering using permutations, but here we use
# conditional random permutations (different distributions for each focal location)
# In[39]:
(li.p_sim < 0.05).sum()
# We can distinguish the specific type of local spatial association reflected in
# the four quadrants of the Moran Scatterplot above:
# In[40]:
sig = li.p_sim < 0.05
hotspot = sig * li.q==1
coldspot = sig * li.q==3
doughnut = sig * li.q==2
diamond = sig * li.q==4
# In[41]:
spots = ['n.sig.', 'hot spot']
labels = [spots[i] for i in hotspot*1]
# In[42]:
df = df
from matplotlib import colors
hmap = colors.ListedColormap(['red', 'lightgrey'])
f, ax = plt.subplots(1, figsize=(9, 9))
df.assign(cl=labels).plot(column='cl', categorical=True, k=2, cmap=hmap, linewidth=0.1, ax=ax, edgecolor='white', legend=True)
ax.set_axis_off()
plt.show()
# In[43]:
spots = ['n.sig.', 'cold spot']
labels = [spots[i] for i in coldspot*1]
# In[44]:
df = df
from matplotlib import colors
hmap = colors.ListedColormap(['blue', 'lightgrey'])
f, ax = plt.subplots(1, figsize=(9, 9))
df.assign(cl=labels).plot(column='cl', categorical=True, k=2, cmap=hmap, linewidth=0.1, ax=ax, edgecolor='white', legend=True)
ax.set_axis_off()
plt.show()
# In[45]:
spots = ['n.sig.', 'doughnut']
labels = [spots[i] for i in doughnut*1]
# In[46]:
df = df
from matplotlib import colors
hmap = colors.ListedColormap(['lightblue', 'lightgrey'])
f, ax = plt.subplots(1, figsize=(9, 9))
df.assign(cl=labels).plot(column='cl', categorical=True, k=2, cmap=hmap, linewidth=0.1, ax=ax, edgecolor='white', legend=True)
ax.set_axis_off()
plt.show()
# In[47]:
spots = ['n.sig.', 'diamond']
labels = [spots[i] for i in diamond*1]
# In[48]:
df = df
from matplotlib import colors
hmap = colors.ListedColormap(['pink', 'lightgrey'])
f, ax = plt.subplots(1, figsize=(9, 9))
df.assign(cl=labels).plot(column='cl', categorical=True, k=2, cmap=hmap, linewidth=0.1, ax=ax, edgecolor='white', legend=True)
ax.set_axis_off()
plt.show()
# In[49]:
sig = 1 * (li.p_sim < 0.05)
hotspot = 1 * (sig * li.q==1)
coldspot = 3 * (sig * li.q==3)
doughnut = 2 * (sig * li.q==2)
diamond = 4 * (sig * li.q==4)
spots = hotspot + coldspot + doughnut + diamond
spots
# In[50]:
spot_labels = [ '0 ns', '1 hot spot', '2 doughnut', '3 cold spot', '4 diamond']
labels = [spot_labels[i] for i in spots]
# In[51]:
from matplotlib import colors
hmap = colors.ListedColormap([ 'lightgrey', 'red', 'lightblue', 'blue', 'pink'])
f, ax = plt.subplots(1, figsize=(9, 9))
df.assign(cl=labels).plot(column='cl', categorical=True, k=2, cmap=hmap, linewidth=0.1, ax=ax, edgecolor='white', legend=True)
ax.set_axis_off()
plt.show()
# In[ ]:
# In[ ]: