Introduction to geospatial vector data in Python
DS Python for GIS and Geoscience
October, 2020© 2020, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons
%matplotlib inline
import pandas as pd
import geopandas
Geospatial data is often available from specific GIS file formats or data stores, like ESRI shapefiles, GeoJSON files, geopackage files, PostGIS (PostgreSQL) database, ...
We can use the GeoPandas library to read many of those GIS file formats (relying on the fiona
library under the hood, which is an interface to GDAL/OGR), using the geopandas.read_file
function.
For example, let's start by reading a shapefile with all the countries of the world (adapted from http://www.naturalearthdata.com/downloads/110m-cultural-vectors/110m-admin-0-countries/, zip file is available in the /data
directory), and inspect the data:
countries = geopandas.read_file("zip://./data/ne_110m_admin_0_countries.zip")
# or if the archive is unpacked:
# countries = geopandas.read_file("data/ne_110m_admin_0_countries/ne_110m_admin_0_countries.shp")
countries.head()
iso_a3 | name | continent | pop_est | gdp_md_est | geometry | |
---|---|---|---|---|---|---|
0 | AFG | Afghanistan | Asia | 34124811.0 | 64080.0 | POLYGON ((61.21082 35.65007, 62.23065 35.27066... |
1 | AGO | Angola | Africa | 29310273.0 | 189000.0 | MULTIPOLYGON (((23.90415 -11.72228, 24.07991 -... |
2 | ALB | Albania | Europe | 3047987.0 | 33900.0 | POLYGON ((21.02004 40.84273, 20.99999 40.58000... |
3 | ARE | United Arab Emirates | Asia | 6072475.0 | 667200.0 | POLYGON ((51.57952 24.24550, 51.75744 24.29407... |
4 | ARG | Argentina | South America | 44293293.0 | 879400.0 | MULTIPOLYGON (((-66.95992 -54.89681, -67.56244... |
countries.plot()
<AxesSubplot:>
What do we observe:
.head()
we can see the first rows of the dataset, just like we can do with Pandas.geometry
column and the different countries are represented as polygons.plot()
method to quickly get a basic visualization of the dataWe used the GeoPandas library to read in the geospatial data, and this returned us a GeoDataFrame
:
type(countries)
geopandas.geodataframe.GeoDataFrame
A GeoDataFrame contains a tabular, geospatial dataset:
Such a GeoDataFrame
is just like a pandas DataFrame
, but with some additional functionality for working with geospatial data:
.geometry
attribute that always returns the column with the geometry information (returning a GeoSeries). The column name itself does not necessarily need to be 'geometry', but it will always be accessible as the .geometry
attribute.countries.geometry
0 POLYGON ((61.21082 35.65007, 62.23065 35.27066... 1 MULTIPOLYGON (((23.90415 -11.72228, 24.07991 -... 2 POLYGON ((21.02004 40.84273, 20.99999 40.58000... 3 POLYGON ((51.57952 24.24550, 51.75744 24.29407... 4 MULTIPOLYGON (((-66.95992 -54.89681, -67.56244... ... 172 MULTIPOLYGON (((167.84488 -16.46633, 167.51518... 173 POLYGON ((52.00001 19.00000, 52.78218 17.34974... 174 POLYGON ((19.89577 -24.76779, 20.16573 -24.917... 175 POLYGON ((23.21505 -17.52312, 22.56248 -16.898... 176 POLYGON ((29.43219 -22.09131, 28.79466 -21.639... Name: geometry, Length: 177, dtype: geometry
type(countries.geometry)
geopandas.geoseries.GeoSeries
countries.geometry.area
<ipython-input-8-68baff36b7de>:1: UserWarning: Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation. countries.geometry.area
0 63.593500 1 103.599439 2 3.185163 3 7.095047 4 278.923392 ... 172 0.631326 173 38.475618 174 112.718524 175 62.789498 176 32.280371 Length: 177, dtype: float64
It's still a DataFrame, so we have all the Pandas functionality available to use on the geospatial dataset, and to do data manipulations with the attributes and geometry information together.
For example, we can calculate average population number over all countries (by accessing the 'pop_est' column, and calling the mean
method on it):
countries['pop_est'].mean()
41712369.84180791
Or, we can use boolean filtering to select a subset of the dataframe based on a condition:
africa = countries[countries['continent'] == 'Africa']
africa.plot();
REMEMBER:
GeoDataFrame
allows to perform typical tabular data analysis together with spatial operationsGeoDataFrame
(or Feature Collection) consists of:Spatial vector data can consist of different types, and the 3 fundamental types are:
And each of them can also be combined in multi-part geometries (See https://shapely.readthedocs.io/en/stable/manual.html#geometric-objects for extensive overview).
For the example we have seen up to now, the individual geometry objects are Polygons:
print(countries.geometry[2])
POLYGON ((21.0200403174764 40.84272695572588, 20.99998986174722 40.58000397395401, 20.67499677906363 40.43499990494303, 20.61500044117275 40.11000682225935, 20.15001590341052 39.62499766698397, 19.98000044117015 39.69499339452341, 19.96000166187321 39.91500580500605, 19.40608198413673 40.25077342382247, 19.31905887215714 40.72723012955356, 19.40354983895429 41.40956574153546, 19.54002729663711 41.71998607031276, 19.37176883309496 41.87754751237065, 19.37176816334725 41.8775506797835, 19.30448611825079 42.19574514420782, 19.73805138517963 42.68824738216557, 19.80161339689869 42.50009349219084, 20.07070000000004 42.58863000000008, 20.28375451018189 42.32025950781508, 20.52295000000004 42.21787000000006, 20.59024654668023 41.85540891928363, 20.59024743010491 41.85540416113361, 20.4631750830992 41.51508901627534, 20.60518191903736 41.08622630468523, 21.0200403174764 40.84272695572588))
Let's import some other datasets with different types of geometry objects.
A dateset about cities in the world (adapted from http://www.naturalearthdata.com/downloads/110m-cultural-vectors/110m-populated-places/, zip file is available in the /data
directory), consisting of Point data:
cities = geopandas.read_file("zip://./data/ne_110m_populated_places.zip")
print(cities.geometry[0])
POINT (12.45338654497177 41.90328217996012)
And a dataset of rivers in the world (from http://www.naturalearthdata.com/downloads/50m-physical-vectors/50m-rivers-lake-centerlines/, zip file is available in the /data
directory) where each river is a (multi-)line:
rivers = geopandas.read_file("zip://./data/ne_50m_rivers_lake_centerlines.zip")
print(rivers.geometry[0])
LINESTRING (51.9371337598152 55.70106609892139, 51.88086646731369 55.68625891701544, 51.82031249962222 55.69745514553858, 51.7476018274624 55.69366250841807, 51.6628417966117 55.60817291874525, 51.57871093775964 55.59943268477065, 51.51342773400279 55.58312409100404, 51.50854492161091 55.52948639548083, 51.48583984403365 55.49640534033426, 51.36914062543957 55.46796295772435, 51.21306254869774 55.50264985760492, 51.13452148447897 55.48273346527725, 51.07934570274205 55.46759674659262, 50.98022460947817 55.46637604371949, 50.83445217522774 55.45630956063775, 50.6883789060617 55.42011139502489, 50.4118652342932 55.40119049644431, 50.07802734358711 55.38112213757665, 49.82216796867687 55.33466217681809, 49.53222656260584 55.260614325191, 49.38232421848795 55.17182037990665, 49.24808475131027 55.11301870345045)
type(countries.geometry[0])
shapely.geometry.polygon.Polygon
To construct one ourselves:
from shapely.geometry import Point, Polygon, LineString
p = Point(0, 0)
print(p)
POINT (0 0)
polygon = Polygon([(1, 1), (2,2), (2, 1)])
polygon.area
0.5
polygon.distance(p)
1.4142135623730951
REMEMBER:
Single geometries are represented by shapely
objects:
single_shapely_object.distance(other_point)
-> distance between two pointsgeodataframe.distance(other_point)
-> distance for each point in the geodataframe to the other pointax = countries.plot(edgecolor='k', facecolor='none', figsize=(15, 10))
rivers.plot(ax=ax)
cities.plot(ax=ax, color='red')
ax.set(xlim=(-20, 60), ylim=(-40, 40))
[(-20.0, 60.0), (-40.0, 40.0)]
See the visualization-02-geopandas.ipynb notebook for more details on visualizing geospatial datasets.
Throughout the exercises in this course, we will work with several datasets about the city of Paris.
Here, we start with the following datasets:
paris_districts_utm.geojson
data/paris_bike_stations_mercator.gpkg
Both datasets are provided as spatial datasets using a GIS file format.
Let's explore further those datasets, now using the spatial aspect as well.
EXERCISE:
We will start with exploring the bicycle station dataset (available as a GeoPackage file: data/paris_bike_stations_mercator.gpkg
)
stations
.type(..)
to check any Python object type.shape
attribute to get the number of featuresstations = geopandas.read_file("data/paris_bike_stations_mercator.gpkg")
type(stations)
geopandas.geodataframe.GeoDataFrame
stations.head()
name | bike_stands | available_bikes | geometry | |
---|---|---|---|---|
0 | 14002 - RASPAIL QUINET | 44 | 4 | POINT (259324.887 6247620.771) |
1 | 20503 - COURS DE VINCENNES PYRÉNÉES | 21 | 3 | POINT (267824.377 6249062.894) |
2 | 20011 - PYRÉNÉES-DAGORNO | 21 | 0 | POINT (267742.135 6250378.469) |
3 | 31008 - VINCENNES (MONTREUIL) | 56 | 0 | POINT (271326.638 6250750.824) |
4 | 43006 - MINIMES (VINCENNES) | 28 | 27 | POINT (270594.689 6248007.705) |
stations.shape
(1226, 4)
EXERCISE:
stations
dataset.plot
method accepts a figsize
keyword).stations.plot(figsize=(12,6))
<AxesSubplot:>
A plot with just some points can be hard to interpret without any spatial context. Therefore, in the next exercise we will learn how to add a background map.
We are going to make use of the contextily package. The add_basemap()
function of this package makes it easy to add a background web map to our plot. We begin by plotting our data first, and then pass the matplotlib axes object (returned by dataframe's plot()
method) to the add_basemap()
function. contextily
will then download the web tiles needed for the geographical extent of your plot.
EXERCISE:
contextily
.stations
, but assign the result to an ax
variable.markersize
keyword of the plot()
method for this).add_basemap()
function of contextily
to add a background map: the first argument is the matplotlib axes object ax
.import contextily
ax = stations.plot(figsize=(12,6), markersize=5)
contextily.add_basemap(ax)
EXERCISE:
df['col_name']
hist()
method to plot a histogram of its values.stations['bike_stands'].hist()
<AxesSubplot:>
EXERCISE:
Let's now visualize where the available bikes are actually stationed:
stations
dataset (also with a (12, 6) figsize).'available_bikes'
columns to determine the color of the points. For this, use the column=
keyword.legend=True
keyword to show a color bar.stations.plot(figsize=(12, 6), column='available_bikes', legend=True)
<AxesSubplot:>
EXERCISE:
Next, we will explore the dataset on the administrative districts of Paris (available as a GeoJSON file: "data/paris_districts_utm.geojson")
districts
..shape
attribute)districts
dataset (set the figure size to (12, 6)).districts = geopandas.read_file("data/paris_districts_utm.geojson")
districts.head()
id | district_name | population | geometry | |
---|---|---|---|---|
0 | 1 | St-Germain-l'Auxerrois | 1672 | POLYGON ((451922.133 5411438.484, 451922.080 5... |
1 | 2 | Halles | 8984 | POLYGON ((452278.419 5412160.893, 452192.407 5... |
2 | 3 | Palais-Royal | 3195 | POLYGON ((451553.806 5412340.522, 451528.058 5... |
3 | 4 | Place-Vendôme | 3044 | POLYGON ((451004.908 5412654.095, 450960.640 5... |
4 | 5 | Gaillon | 1345 | POLYGON ((451328.752 5412991.278, 451294.721 5... |
districts.shape
(80, 4)
districts.plot(figsize=(12, 6))
<AxesSubplot:>
EXERCISE:
What are the largest districts (biggest area)?
districts
dataframe.df['new_col'] = values
sort_values()
method, specifying the colum to sort on with the by='col_name'
keyword. Check the help of this method to see how to sort ascending or descending.districts.geometry.area
0 8.685379e+05 1 4.122371e+05 2 2.735494e+05 3 2.693111e+05 4 1.879097e+05 ... 75 1.294254e+06 76 8.061191e+05 77 1.486139e+06 78 1.598127e+06 79 2.089783e+06 Length: 80, dtype: float64
# dividing by 10^6 for showing km²
districts['area'] = districts.geometry.area / 1e6
districts.sort_values(by='area', ascending=False)
id | district_name | population | geometry | area | |
---|---|---|---|---|---|
45 | 46 | Picpus | 62947 | POLYGON ((456790.759 5408686.978, 456841.941 5... | 7.201383 |
60 | 61 | Auteuil | 67967 | POLYGON ((444930.499 5411923.067, 444957.444 5... | 6.380679 |
44 | 45 | Bel-Air | 33976 | POLYGON ((456987.121 5409120.599, 456996.502 5... | 5.967841 |
61 | 62 | Muette | 45214 | POLYGON ((444686.860 5413985.234, 445358.893 5... | 5.475037 |
62 | 63 | Porte-Dauphine | 27423 | POLYGON ((446548.869 5414236.010, 447025.036 5... | 3.085061 |
... | ... | ... | ... | ... | ... |
9 | 10 | Enfants-Rouges | 8562 | POLYGON ((453580.220 5412266.849, 453591.609 5... | 0.271603 |
3 | 4 | Place-Vendôme | 3044 | POLYGON ((451004.908 5412654.095, 450960.640 5... | 0.269311 |
5 | 6 | Vivienne | 2917 | POLYGON ((451686.936 5412747.032, 451682.879 5... | 0.243418 |
11 | 12 | Sainte-Avoie | 7501 | POLYGON ((452928.277 5412227.550, 452830.786 5... | 0.213201 |
4 | 5 | Gaillon | 1345 | POLYGON ((451328.752 5412991.278, 451294.721 5... | 0.187910 |
80 rows × 5 columns
EXERCISE:
'population_density'
representing the number of inhabitants per squared kilometer (Note: The area is given in squared meter, so you will need to multiply the result with 10**6
).'population_density'
to color the polygons. For this, use the column=
keyword.legend=True
keyword to show a color bar.# Add a population density column
districts['population_density'] = districts['population'] / districts.geometry.area * 10**6
# Make a plot of the districts colored by the population density
districts.plot(column='population_density', figsize=(12, 6), legend=True)
<AxesSubplot:>
# As comparison, the misleading plot when not turning the population number into a density
districts.plot(column='population', figsize=(12, 6), legend=True)
<AxesSubplot:>
fiona
¶Under the hood, GeoPandas uses the Fiona library (pythonic interface to GDAL/OGR) to read and write data. GeoPandas provides a more user-friendly wrapper, which is sufficient for most use cases. But sometimes you want more control, and in that case, to read a file with fiona you can do the following:
import fiona
from shapely.geometry import shape
with fiona.Env():
with fiona.open("zip://./data/ne_110m_admin_0_countries.zip") as collection:
for feature in collection:
# ... do something with geometry
geom = shape(feature['geometry'])
# ... do something with properties
print(feature['properties']['name'])
Afghanistan Angola Albania United Arab Emirates Argentina Armenia Antarctica Fr. S. Antarctic Lands Australia Austria Azerbaijan Burundi Belgium Benin Burkina Faso Bangladesh Bulgaria Bahamas Bosnia and Herz. Belarus Belize Bolivia Brazil Brunei Bhutan Botswana Central African Rep. Canada Switzerland Chile China Côte d'Ivoire Cameroon Dem. Rep. Congo Congo Colombia Costa Rica Cuba N. Cyprus Cyprus Czechia Germany Djibouti Denmark Dominican Rep. Algeria Ecuador Egypt Eritrea Spain Estonia Ethiopia Finland Fiji Falkland Is. France Gabon United Kingdom Georgia Ghana Guinea Gambia Guinea-Bissau Eq. Guinea Greece Greenland Guatemala Guyana Honduras Croatia Haiti Hungary Indonesia India Ireland Iran Iraq Iceland Israel Italy Jamaica Jordan Japan Kazakhstan Kenya Kyrgyzstan Cambodia South Korea Kosovo Kuwait Laos Lebanon Liberia Libya Sri Lanka Lesotho Lithuania Luxembourg Latvia Morocco Moldova Madagascar Mexico Macedonia Mali Myanmar Montenegro Mongolia Mozambique Mauritania Malawi Malaysia Namibia New Caledonia Niger Nigeria Nicaragua Netherlands Norway Nepal New Zealand Oman Pakistan Panama Peru Philippines Papua New Guinea Poland Puerto Rico North Korea Portugal Paraguay Palestine Qatar Romania Russia Rwanda W. Sahara Saudi Arabia Sudan S. Sudan Senegal Solomon Is. Sierra Leone El Salvador Somaliland Somalia Serbia Suriname Slovakia Slovenia Sweden Swaziland Syria Chad Togo Thailand Tajikistan Turkmenistan Timor-Leste Trinidad and Tobago Tunisia Turkey Taiwan Tanzania Uganda Ukraine Uruguay United States of America Uzbekistan Venezuela Vietnam Vanuatu Yemen South Africa Zambia Zimbabwe
geopandas.GeoDataFrame({
'geometry': [Point(1, 1), Point(2, 2)],
'attribute1': [1, 2],
'attribute2': [0.1, 0.2]})
geometry | attribute1 | attribute2 | |
---|---|---|---|
0 | POINT (1.00000 1.00000) | 1 | 0.1 |
1 | POINT (2.00000 2.00000) | 2 | 0.2 |
For example, if you have lat/lon coordinates in two columns:
df = pd.DataFrame(
{'City': ['Buenos Aires', 'Brasilia', 'Santiago', 'Bogota', 'Caracas'],
'Country': ['Argentina', 'Brazil', 'Chile', 'Colombia', 'Venezuela'],
'Latitude': [-34.58, -15.78, -33.45, 4.60, 10.48],
'Longitude': [-58.66, -47.91, -70.66, -74.08, -66.86]})
gdf = geopandas.GeoDataFrame(
df, geometry=geopandas.points_from_xy(df.Longitude, df.Latitude))
gdf
City | Country | Latitude | Longitude | geometry | |
---|---|---|---|---|---|
0 | Buenos Aires | Argentina | -34.58 | -58.66 | POINT (-58.66000 -34.58000) |
1 | Brasilia | Brazil | -15.78 | -47.91 | POINT (-47.91000 -15.78000) |
2 | Santiago | Chile | -33.45 | -70.66 | POINT (-70.66000 -33.45000) |
3 | Bogota | Colombia | 4.60 | -74.08 | POINT (-74.08000 4.60000) |
4 | Caracas | Venezuela | 10.48 | -66.86 | POINT (-66.86000 10.48000) |
See http://geopandas.readthedocs.io/en/latest/gallery/create_geopandas_from_pandas.html for full example