Wherefore art thou CitiBike?

Citibike is NYC's new(ish) and very popular bike share program. Citibike customers can take a bike from any station, and leave it at any other station. It's natural to assume that after some time, certain areas of the city will end up with a high density of bikes, and others areas will have few bikes. Moreover, this relative density is likely to fluctuate in both space and time.

The following is an attempt to visualize the approximate density of available bikes throughout the day, using Citibike's own system data available at: http://citibikenyc.com/system-data

Import Statements

In [1]:
# data handling and formatting
import pandas as pd
import numpy as np
import datetime as dt
from copy import copy, deepcopy

# plotting
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import matplotlib.figure as fig
import matplotlib.image as mpimg

# image creation / modification 
import scipy.ndimage as ndi
from scipy import misc
from matplotlib.colors import LinearSegmentedColormap

# mapping / shapefile stuff
from mpl_toolkits.basemap import Basemap
from shapelib import ShapeFile
import dbflib

Parameters

Set the path to the Citibike data file. This is the first available file on http://citibikenyc.com/system-data , and spans all of July 2013 so not all of the current citibike stations are going to be active in this file.

In [2]:
data_path = '2013-07-CitiBikeTripData.csv'

The parameter time_increment refers to the amount of time (minutes) by which we step forward in each frame of the animation, and time_window_size is the size of the time window we're looking at in each frame. number_of_days is the number of days' worth of data that should be used in the visualization.

In [3]:
time_increment = 5 
time_window_size = 10
number_of_days = 3 # the number of days worth of data that should be used in the visualization

The boundary lat/lon values of the map are chosen to give a decent bounding box around lower Manhattan and much of northwestern brooklyn. This is the area we'll be looking in.

In [4]:
lon0 = -74.0417
lon1 = -73.8982
lat0 = 40.6851
lat1 = 40.7721

The following parameters are for use with a grid-density gaussian filter algorithm, where image_width and image_height refer to the width and height (in pixels) of a grid (2D array) that we will use to represent the geographic area we're looking at.

The gaussian_filter_radius parameter is the radius we'll be feeding into the scipy.ndimage.gaussianfilter() method.

In [5]:
image_width = 200
image_height = 200
gaussian_filter_radius = 3

Pandas

The data comes in the form of a large CSV file, so use Pandas' read_csv method to read the data.

In [6]:
# read in the undocking data
undocking_dataframe = pd.read_csv(data_path, usecols=[1,5,6], index_col=0, parse_dates=True)
#read in the docking datax
docking_dataframe_unsorted = pd.read_csv(data_path, usecols=[2,9,10], index_col=0, parse_dates=True)
docking_dataframe = docking_dataframe_unsorted.sort_index()

We now have two dataframes, undocking_dataframe and docking_dataframe each indexed by a timestamp (sorted earliest to latest), and containing a latitude and a longitude. See below for an example.

In [7]:
undocking_dataframe.head()
Out[7]:
start station latitude start station longitude
starttime
2013-07-01 00:00:00 40.753231 -73.970325
2013-07-01 00:00:02 40.749718 -74.002950
2013-07-01 00:01:04 40.730287 -73.990765
2013-07-01 00:01:06 40.718939 -73.992663
2013-07-01 00:01:10 40.734927 -73.992005

5 rows × 2 columns

mpl_toolkits.basemap

Now we need a map on which to plot our geographic data. We use mpl_toolkits.basemap.Basemap to instantiate a map, using our latitude and longitude bounds.

In [8]:
m = Basemap(llcrnrlon=lon0, llcrnrlat=lat0, urcrnrlon=lon1, urcrnrlat=lat1, resolution='h', lat_0=lat0, lon_0=lon0)

To create the effect of a more detailed map, we add a background image to the basemap.

In [9]:
osm_image = misc.imread('map.png')
#plot it onto the map
m.imshow(osm_image, extent=[lon0, lat0, lon1, lat1], origin='upper', aspect='auto')
Out[9]:
<matplotlib.image.AxesImage at 0x10dbaa450>

Custom Colormap

Create a custom colormap for the m.imshow() method to use. This will display high numbers as red, low numbers as blue, and in-the-middle numbers as clear.

In [10]:
bwr_cmap = plt.get_cmap('bwr')
cdict_new = bwr_cmap._segmentdata.copy()
cdict_new['alpha'] = ((0.0, 1.0, 1.0),
                #   (0.25,1.0, 1.0),
                   (0.6, 0.4, 0.0),
                #   (0.75,1.0, 1.0),
                   (1.0, 1.0, 1.0))
blue_red1 = LinearSegmentedColormap('BlueRed1', cdict_new)

Grid Density Gaussian Filter Methods

The method here is adapted from this StackOverflow question: http://stackoverflow.com/questions/6652671/efficient-method-of-calculating-density-of-irregularly-spaced-points

As a slight modification, the algorithm is broken into three methods:

  • grid_density_gaussian_filter_img applies a gaussian filter to the input image (which is a 2D array) using scipy.ndimage.gaussian_filter and returns the result.
  • iterate_image takes a 2D-array (previous_img) and the docking/undocking data (positive_data and negative_date). The 'grid density' algorithm simply increments a pixel in previous_img for each corresponding point in positive_data, or decrements a pixel in previous_img for each corresponding point in negative_date.
  • deacy_img is an optionally-used method which takes the current image array (current_img) and reduces the absolute value of every nonzero pixel in the array by 1. The resulting image is returned. This prevents areas of really high density from swamping out all the other areas.
In [11]:
# return the gaussian filtered data for a given current_img
def grid_density_gaussian_filter_img(current_img, radius):
		r = radius
		return ndi.gaussian_filter(current_img, (r,r)) #apply gaussian filter to the image and return the ndimage

# go through and bring all image data closer to zero
def decay_img(current_img): 
		for index,value in np.ndenumerate( current_img ):
			if value > 0:
				current_img[index] -= 1
			if value < 0:
				current_img[index] += 1
		return current_img
			
# add 'positive_data' and subtract 'negative data' to previous_img and return a new image with those values
def iterate_image(previous_img, x0, y0, x1, y1, w, h, positive_data, negative_data):
		imgw = previous_img.shape[0]
		imgh = previous_img.shape[1]
		new_img = copy(previous_img)
		kx = (imgw - 1) / (x1 - x0)
		ky = (imgh - 1) / (y1 - y0)
		for x, y in positive_data:
				ix = int((x - x0) * kx) 
				iy = int((y - y0) * ky)
				if 0 <= ix < imgw and 0 <= iy < imgh:
					new_img[iy][ix] += 1
		for x, y in negative_data:
				ix = int((x - x0) * kx)
				iy = int((y - y0) * ky)
				if 0 <= ix < imgw and 0 <= iy < imgh:
					new_img[iy][ix] -= 1
		return new_img

Create An Array Of matplotlib.image.AxesImage Objects

To generate an animated heatmap of all the docking or undocking events in the CitiBike system, one approach is to generate an array of objects of the type matplotlib.image.AxesImage and then call Matplotlib's animation.ArtistAnimation method on that array to generate an animation. Below is the code for the hard part -- building the array of AxesImages from our data.

In [12]:
# find the earliest undocking timestamp in the database -- this will be the start time of the data used for the animation
initial_start_date_undocking = undocking_dataframe.index[0]

# ims will hold the series of images that make up the animation
ims = []

# determine the number of frames in the animation
num_frames = (number_of_days*24*60)/time_increment
In [13]:
for i in range(num_frames):
	# find the indices of the next time windows for undocking events
	next_start_date_undocking_index = undocking_dataframe.index.searchsorted( initial_start_date_undocking + dt.timedelta(minutes=time_increment*i) )
	next_end_date_undocking_index = undocking_dataframe.index.searchsorted( initial_start_date_undocking + dt.timedelta(minutes=(time_window_size + time_increment*i) ) )
	# grab the next time windows for undocking events
	next_start_date_undocking = undocking_dataframe.index[next_start_date_undocking_index]
	next_end_date_undocking = undocking_dataframe.index[next_end_date_undocking_index]
	
	# find the indices of the next time windows for re-docking events
	next_start_date_docking_index = docking_dataframe.index.searchsorted( next_start_date_undocking )
	next_end_date_docking_index = docking_dataframe.index.searchsorted( next_end_date_undocking )
	# grab the next time windows for undocking events
	next_start_date_docking = docking_dataframe.index[next_start_date_docking_index]
	next_end_date_docking = docking_dataframe.index[next_end_date_docking_index]
	
	# get the next few undocking events
	next_data_undocking = undocking_dataframe[next_start_date_undocking:next_end_date_undocking]
	x_udata = next_data_undocking['start station longitude']
	y_udata = next_data_undocking['start station latitude']
	x_undocking, y_undocking = m(x_udata,y_udata)
	
	# get the next few docking events
	next_data_docking = docking_dataframe[next_start_date_docking:next_end_date_docking]
	x_ddata = next_data_docking['end station longitude']
	y_ddata = next_data_docking['end station latitude']
	x_docking, y_docking = m(x_ddata,y_ddata)	
	
	# get the new heatmap image
	if i == 0:
		new_heatmap_image = np.zeros((image_width, image_height))
	else:
		new_heatmap_image = iterate_image(heatmap_image, lon0, lat0, lon1, lat1, image_width, image_height, zip(x_docking,y_docking), zip(x_undocking,y_undocking))
	
	if i % 100 == 0:
		new_heatmap_image = decay_img(new_heatmap_image)
		
	# generate the z dimension with the newlt iterated image
	z_dimension = grid_density_gaussian_filter_img(new_heatmap_image, gaussian_filter_radius)
	
	#reset the heatmap_image variable
	heatmap_image = new_heatmap_image
	
	# produce the im from this data
	heatmap_im = m.imshow(z_dimension , origin='lower', extent=[lon0, lat0, lon1, lat1], alpha=.4, vmin=-1, vmax=1, cmap=blue_red1, aspect='auto')
	# find the text string for displaying time/date
	current_date = initial_start_date_undocking + dt.timedelta(minutes=time_increment*i)
	current_date_string = current_date.strftime('%I:%M%p on %a %b %d')
	current_date_plotted = plt.text(-73.9509, 40.6909, current_date_string, fontsize=14)
	# append all Artist instances to the array of drawn images
	ims.append([current_date_plotted,heatmap_im])

Animate the Images

In [14]:
# set the figure size
plt.gcf().set_size_inches(15.0, 12.0)
plt.gcf().subplots_adjust(left=0, bottom=0, right=1, top=1, wspace=None, hspace=None)

# run the animation	
anim = animation.ArtistAnimation(plt.gcf(), ims, interval=50, blit=True, repeat=True)
In [15]:
#plot all of these points
plt.show()