Spatial relationships and joins
DS Python for GIS and Geoscience
October, 2020© 2020, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons
%matplotlib inline
import pandas as pd
import geopandas
countries = geopandas.read_file("zip://./data/ne_110m_admin_0_countries.zip")
cities = geopandas.read_file("zip://./data/ne_110m_populated_places.zip")
rivers = geopandas.read_file("zip://./data/ne_50m_rivers_lake_centerlines.zip")
An important aspect of geospatial data is that we can look at spatial relationships: how two spatial objects relate to each other (whether they overlap, intersect, contain, .. one another).
The topological, set-theoretic relationships in GIS are typically based on the DE-9IM model. See https://en.wikipedia.org/wiki/Spatial_relation for more information.
(Image by Krauss, CC BY-SA 3.0)
Let's first create some small toy spatial objects:
A polygon (note: we use .item()
here to to extract the scalar geometry object from the GeoSeries of length 1):
belgium = countries.loc[countries['name'] == 'Belgium', 'geometry'].item()
Two points:
paris = cities.loc[cities['name'] == 'Paris', 'geometry'].item()
brussels = cities.loc[cities['name'] == 'Brussels', 'geometry'].item()
And a linestring:
from shapely.geometry import LineString
line = LineString([paris, brussels])
Let's visualize those 4 geometry objects together (I only put them in a GeoSeries to easily display them together with the geopandas .plot()
method):
geopandas.GeoSeries([belgium, paris, brussels, line]).plot(cmap='tab10')
You can recognize the abstract shape of Belgium.
Brussels, the capital of Belgium, is thus located within Belgium. This is a spatial relationship, and we can test this using the individual shapely geometry objects as follow:
brussels.within(belgium)
And using the reverse, Belgium contains Brussels:
belgium.contains(brussels)
On the other hand, Paris is not located in Belgium:
belgium.contains(paris)
paris.within(belgium)
The straight line we draw from Paris to Brussels is not fully located within Belgium, but it does intersect with it:
belgium.contains(line)
line.intersects(belgium)
The same methods that are available on individual shapely
geometries as we have seen above, are also available as methods on GeoSeries
/ GeoDataFrame
objects.
For example, if we call the contains
method on the world dataset with the paris
point, it will do this spatial check for each country in the world
dataframe:
countries.contains(paris)
Because the above gives us a boolean result, we can use that to filter the dataframe:
countries[countries.contains(paris)]
And indeed, France is the only country in the world in which Paris is located.
Another example, extracting the linestring of the Amazon river in South America, we can query through which countries the river flows:
amazon = rivers[rivers['name'] == 'Amazonas'].geometry.item()
countries[countries.crosses(amazon)] # or .intersects
REFERENCE:
Overview of the different functions to check spatial relationships (spatial predicate functions):
equals
contains
crosses
disjoint
intersects
overlaps
touches
within
covers
See https://shapely.readthedocs.io/en/stable/manual.html#predicates-and-relationships for an overview of those methods.
See https://en.wikipedia.org/wiki/DE-9IM for all details on the semantics of those operations.
We will again use the Paris datasets to exercise. Let's start importing them again, and directly converting both to the local projected CRS:
districts = geopandas.read_file("data/paris_districts.geojson").to_crs(epsg=2154)
stations = geopandas.read_file("data/paris_bike_stations.geojson").to_crs(epsg=2154)
EXERCISE: The Eiffel Tower
The Eiffel Tower is an iron lattice tower built in the 19th century, and is probably the most iconic view of Paris.
The location of the Eiffel Tower is: x of 648237.3 and y of 6862271.9
eiffel_tower
. Print the result.Point
class is available in the shapely.geometry
submodulePoint()
constructor.within()
method checks if the object is located within the passed geometry (used as geometry1.within(geometry2)
).contains()
method checks if the object contains the passed geometry (used as geometry1.contains(geometry2)
).distance()
method of one of the geometries can be used.# Import the Point geometry
from shapely.geometry import Point
# %load _solutions/04-spatial-relationships-joins1.py
# %load _solutions/04-spatial-relationships-joins2.py
# Accessing the Montparnasse geometry (Polygon)
district_montparnasse = districts.loc[52, 'geometry']
bike_station = stations.loc[293, 'geometry']
# %load _solutions/04-spatial-relationships-joins3.py
# %load _solutions/04-spatial-relationships-joins4.py
# %load _solutions/04-spatial-relationships-joins5.py
EXERCISE: In which district in the Eiffel Tower located?
In previous exercise, we constructed a Point
geometry for its location, and we checked that it was not located in the Montparnasse district. Let's now determine in which of the districts of Paris it is located.
mask
.districts
dataframe with the boolean mask and print the result.contains()
method of the districts
GeoDataFrame.df[..]
.# Construct a point object for the Eiffel Tower
eiffel_tower = Point(648237.3, 6862271.9)
# %load _solutions/04-spatial-relationships-joins6.py
# %load _solutions/04-spatial-relationships-joins7.py
EXERCISE: How far is the closest bike station?
Now, we might be interested in the bike stations nearby the Eiffel Tower. To explore them, let's visualize the Eiffel Tower itself as well as the bikes stations within 1km.
To do this, we can calculate the distance to the Eiffel Tower for each of the stations. Based on this result, we can then create a mask that takes True
if the station is within 1km, and False
otherwise, and use it to filter the stations GeoDataFrame. Finally, we make a visualization of this subset.
dist_eiffel
.dist_eiffel
).stations
GeoDataFrame where the distance to the Eiffel Tower is less than 1 km (note that the distance is in meters). Call the result stations_eiffel
..distance()
method of a GeoDataFrame works element-wise: it calculates the distance between each geometry in the GeoDataFrame and the geometry passed to the method..min()
method to calculate the minimum value.s < 100
.# %load _solutions/04-spatial-relationships-joins8.py
# %load _solutions/04-spatial-relationships-joins9.py
# %load _solutions/04-spatial-relationships-joins10.py
# Make a plot of the close-by restaurants
ax = stations_eiffel.to_crs(epsg=3857).plot()
geopandas.GeoSeries([eiffel_tower], crs='EPSG:2154').to_crs(epsg=3857).plot(ax=ax, color='red')
import contextily
contextily.add_basemap(ax)
ax.set_axis_off()
In the previous section of this notebook, we could use the spatial relationship methods to check in which country a certain city was located. But what if we wanted to perform this same operation for every city and country? For example, we might want to know for each city in which country it is located.
In tabular jargon, this would imply adding a column to our cities dataframe with the name of the country in which it is located. Since country name is contained in the countries dataset, we need to combine - or "join" - information from both datasets. Joining on location (rather than on a shared column) is called a "spatial join".
So here we will do:
countries
and cities
dataframes, determine for each city the country in which it is located.Pandas provides functionality to join or merge dataframes in different ways, see https://chrisalbon.com/python/data_wrangling/pandas_join_merge_dataframe/ for an overview and https://pandas.pydata.org/pandas-docs/stable/merging.html for the full documentation.
To illustrate the concept of joining the information of two dataframes with pandas, let's take a small subset of our cities
and countries
datasets:
cities2 = cities[cities['name'].isin(['Bern', 'Brussels', 'London', 'Paris'])].copy()
cities2['iso_a3'] = ['CHE', 'BEL', 'GBR', 'FRA']
cities2
countries2 = countries[['iso_a3', 'name', 'continent']]
countries2.head()
We added a 'iso_a3' column to the cities
dataset, indicating a code of the country of the city. This country code is also present in the countries
dataset, which allows us to merge those two dataframes based on the common column.
Joining the cities
dataframe with countries
will transfer extra information about the countries (the full name, the continent) to the cities
dataframe, based on a common key:
cities2.merge(countries2, on='iso_a3')
But for this illustrative example we added the common column manually, it is not present in the original dataset. However, we can still know how to join those two datasets based on their spatial coordinates.
In the previous section, we have seen the notion of spatial relationships between geometry objects: within, contains, intersects, ...
In this case, we know that each of the cities is located within one of the countries, or the other way around that each country can contain multiple cities.
We can test such relationships using the methods we have seen in the previous notebook:
france = countries.loc[countries['name'] == 'France', 'geometry'].squeeze()
cities.within(france)
The above gives us a boolean series, indicating for each point in our cities
dataframe whether it is located within the area of France or not.
Because this is a boolean series as result, we can use it to filter the original dataframe to only show those cities that are actually within France:
cities[cities.within(france)]
We could now repeat the above analysis for each of the countries, and add a column to the cities
dataframe indicating this country. However, that would be tedious to do manually, and is also exactly what the spatial join operation provides us.
(note: the above result is incorrect, but this is just because of the coarse-ness of the countries dataset)
SPATIAL JOIN = transferring attributes from one layer to another based on their spatial relationship
Different parts of this operations:
In this case, we want to join the cities
dataframe with the information of the countries
dataframe, based on the spatial relationship between both datasets.
We use the geopandas.sjoin
function:
joined = geopandas.sjoin(cities, countries, op='within', how='left')
joined
joined['continent'].value_counts()
We will again use the Paris datasets to do some exercises. Let's start importing them:
districts = geopandas.read_file("data/paris_districts.geojson").to_crs(epsg=2154)
stations = geopandas.read_file("data/paris_bike_stations.geojson").to_crs(epsg=2154)
EXERCISE:
joined
.geopandas.sjoin()
function takes as first argument the dataframe to which we want to add information, and as second argument the dataframe that contains this additional information.# %load _solutions/04-spatial-relationships-joins11.py
# %load _solutions/04-spatial-relationships-joins12.py
EXERCISE: Map of tree density by district (I)
Using a dataset of all trees in public spaces in Paris, the goal is to make a map of the tree density by district. We first need to find out how many trees each district contains, which we will do in this exercise. In the following exercise, we will use this result to calculate the density and create a map.
To obtain the tree count by district, we first need to know in which district each tree is located, which we can do with a spatial join. Then, using the result of the spatial join, we will calculate the number of trees located in each district using the pandas 'group-by' functionality.
"paris_trees.gpkg"
and call the result trees
. Also read the districts dataset we have seen previously ("paris_districts.geojson"
), and call this districts
. Convert the districts dataset to the same CRS as the trees dataset.'district_name'
to the trees dataset using a spatial join. Call the result joined
.geopandas.sjoin()
function.geopandas.sjoin()
takes as first argument the dataframe to which we want to add information, and as second argument the dataframe that contains this additional information.op
argument is used to specify which spatial relationship between both dataframes we want to use for joining (options are 'intersects'
, 'contains'
, 'within'
).# %load _solutions/04-spatial-relationships-joins13.py
# %load _solutions/04-spatial-relationships-joins14.py
# %load _solutions/04-spatial-relationships-joins15.py
EXERCISE: Map of tree density by district (II)
Calculate the number of trees located in each district: group the joined
DataFrame by the 'district_name'
column, and calculate the size of each group. Call the resulting Series trees_by_district
.
We then convert trees_by_district
to a DataFrame for the next exercise.
df.groupby('key').aggregation_method()
, substituting 'key' and 'aggregation_method' with the appropriate column name and method..size()
method.# %load _solutions/04-spatial-relationships-joins16.py
# %load _solutions/04-spatial-relationships-joins17.py
# %load _solutions/04-spatial-relationships-joins18.py
EXERCISE: Map of tree density by district (III)
Now we have obtained the number of trees by district, we can make the map of the districts colored by the tree density.
For this, we first need to merge the number of trees in each district we calculated in the previous step (trees_by_district
) back to the districts dataset. We will use the pd.merge()
function to join two dataframes based on a common column.
Since not all districts have the same size, we should compare the tree density for the visualisation: the number of trees relative to the area.
pd.merge()
function to merge districts
and trees_by_district
dataframes on the 'district_name'
column. Call the result districts_trees
.'n_trees_per_area'
to the districts_trees
dataframe, based on the 'n_trees'
column divided by the area.districts_trees
dataframe, using the 'n_trees_per_area'
column to determine the color of the polygons.pd.merge()
function takes the two dataframes you want to merge as the first two arguments.on
keyword.df['col']
, while adding a column to a DataFrame can be done with df['new_col'] = values
where values
can be the result of a computation.area
attribute. So considering a GeoDataFrame gdf
, then gdf.geometry.area
will return a Series with the area of each geometry..plot()
method of a GeoDataFrame to make a visualization of the geometries.column=
keyword.# %load _solutions/04-spatial-relationships-joins19.py
# %load _solutions/04-spatial-relationships-joins20.py
# %load _solutions/04-spatial-relationships-joins21.py