In a previous post, we've look at GeoViews
as a convenient and powerful Python library for visualizing geo data. We've seen that it is able to plot tens of thousands of points on a map in spite of being fully interactive. In this post, we will explore another library that is part of the HoloViz initiative. Datashader is able to visualize truly large datasets by using an optimized rendering pipeline. HoloViews' support for Datashader makes plotting millions of data points pretty easy, even while maintaining interactivity. In the following, we will again use our Cologne bike rental data to demonstrate DataShader's abilities. Get the data here and follow along.
Our dataset contains all KVB bike locations for the whole year 2018 2019. Thats going to be a lot of data to display at once. But this is where datashader's strength comes into play. Other interactive libraries, i.e. bokeh, plotly etc. embed the data as JSON in an .hmtl
file and let the browser do all the work. In contrast, datashader renders an image containing the processed data at the appropriate level. Consequently, all the hard work (the computation) is done beforehand.
In the first step, we load the libraries we'll be using:
# coding: utf-8
import pandas as pd
import numpy as np
from bokeh.io import output_notebook
output_notebook()
pd.set_option('display.max_columns', 100)
from holoviews.operation.datashader import datashade
import geoviews as gv
import cartopy.crs as ccrs
from colorcet import fire
In addition to the standard stack we make use of a few additional libraries to deal with geo spatial data and the visualization process. Cartopy
let's us deal with the Coordinate Reference System (CRS). Colorcet
is also part of HoloViz and offers uniform colormaps. Finally, GeoViews and HoloViews deal with the geo spatial data and the plotting of it.
As before, we can now start by reading in a shapefile containing the borders of the districts of Cologne. You can get it here. For that, we use the geopandas
library:
import geopandas as gpd
districts = gpd.read_file('./data/Stadtteil_WGS84.shp',
encoding='utf8')
districts['area_km2'] = districts['SHAPE_AREA'] / 1000 / 1000
districts.rename(columns={'STT_NAME': 'district'}, inplace=True)
districts.head()
STT_NR | district | SHAPE_AREA | SHAPE_LEN | geometry | area_km2 | |
---|---|---|---|---|---|---|
0 | 909 | Flittard | 7.735915e+06 | 14368.509990 | POLYGON ((7.01364 51.02217, 7.01349 51.02200, ... | 7.735915 |
1 | 905 | Dellbrück | 9.946585e+06 | 16722.887925 | POLYGON ((7.06747 50.99363, 7.06758 50.99357, ... | 9.946585 |
2 | 807 | Brück | 7.499469e+06 | 12759.921710 | POLYGON ((7.07104 50.95452, 7.07138 50.95388, ... | 7.499469 |
3 | 707 | Urbach | 2.291919e+06 | 7008.335906 | POLYGON ((7.09195 50.88922, 7.09249 50.88921, ... | 2.291919 |
4 | 501 | Nippes | 2.995158e+06 | 8434.060948 | POLYGON ((6.95471 50.97357, 6.95475 50.97332, ... | 2.995158 |
Following, we process our bike location data. After reading in the csv, we remove points too far from the median location in order to remove location outliers:
!wget -O bikes_2019.zip https://query.data.world/s/jqqvvxevfxwyialwsgb6v2acqmv6zg
!unzip bikes_2019.zip
# Read in athe bike location data
df = pd.read_csv("./2019_bikes.csv")
df.drop(columns=df.columns[0], inplace=True)
print(df.shape)
/home/mc/dev/envs/jameda/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3063: DtypeWarning: Columns (1) have mixed types.Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
(1509491, 6)
# Exclude location outliers
df = df[ abs(df['lat'] - df['lat'].median()) < 0.08]
df = df[ abs(df['lon'] - df['lon'].median()) < 0.08]
df.head(2)
bike_id | scrape_weekday | u_id | lat | lon | scrape_datetime | |
---|---|---|---|---|---|---|
0 | 21664 | Tue | 34183 | 50.936540 | 6.948328 | 2019-01-01 00:05:02 |
1 | 21619 | Tue | 672578 | 50.953946 | 6.912720 | 2019-01-01 00:05:02 |
So we have about 1.5 million locations to deal with. How hard is it to plot all of them on a map of the city?
# Load districts as Polygons and Locations as Points and Overlay them on map
polys = gv.Polygons(districts, vdims=['district', 'area_km2'], crs=ccrs.PlateCarree())
points = gv.Points(df, ['lon', 'lat'], crs=ccrs.PlateCarree())
plot = gv.tile_sources.CartoDark()\
* datashade(points, expand=False, height=2000, width=2000,
cmap=fire, normalization='eq_hist')\
* polys.opts(alpha=0.1, color='white', tools=['hover'])
plot.opts(width=700, height=600, bgcolor='black')
#gv.save(plot, "./img/datashader/map_cologne1.html")