The Metropolitan Transportation Authority (MTA) operates the New York City Subway. On their website, the MTA publishes data from the turnstiles in its subway stations. For each turnstile, passenger entries into and exits out of the subway station are logged accumulatively for four-hour periods: each turnstile has an entry and an exit counter and the data essentially provide the counter values every four hours.
In this notebook, we will first explore and prepare the turnstile data. Then we will determine the busiest stations and characterize stations as commuter origins or commuter destinations. Thereafter, we will explore the evolution of ridership over the course of the day and the year. Finally, we will build a linear regression model that reproduces the daily ridership.
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:60% !important; }</style>"))
First, some imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec
from mpl_toolkits.basemap import Basemap
The data comes in indiviual files containing the turnstile data of one week. For now, we will explore one exemplary file from the week preceding 3/24/18.
data = pd.read_csv('turnstile_180324.txt')
# Fixes read-in problem of last column name:
data = data.rename(columns = {data.columns[-1]:'EXITS'})
# Remove irrelevant columns
data = data.drop(['DIVISION','DESC','LINENAME'], axis=1)
data.shape[0]
There are 197,149 samples in this data set.
data.head()
The columns C/A
, UNIT
, and SCP
denote identifiers for a specific turnstile. STATION
is the name of the subway station. ENTRIES
and EXITS
are the accumulated counter values of the turnstile (not the number of entries in the given time interval).
We will create additional columns to make data evaluation easier:
# Create timestamp column
data['timestamp'] = pd.to_datetime(data['DATE'] + ' ' + data['TIME'])
# Create column with length of interval since last data entry
data['interval'] = data['timestamp'] - data['timestamp'].shift(1)
# Calculate number of entries/exits in the preceding interval
data['ENTRY_DIFF'] = data['ENTRIES'] - data['ENTRIES'].shift(1)
data['EXIT_DIFF'] = data['EXITS'] - data['EXITS'].shift(1)
data.head()
In order to better understand the structure of the data set, we can look at histograms for the time at which turnstile counter data are reported and the length of the reporting intervals.
fig, ax = plt.subplots(1,2, figsize=(12,4), gridspec_kw={'wspace': 0.3})
#Plot histogram of times
ax[0].hist(data['TIME'].apply(pd.Timedelta) / pd.Timedelta('1 hour'),
bins=48)
ax[0].set_title('Reporting Times')
ax[0].set_xlabel('Hour of Day')
ax[0].set_xticks(range(24)[::2])
ax[0].set_xlim(0,24)
ax[0].set_ylabel('Samples in Data Set')
# Plot histogram of interval lengths
ax[1].hist(data['interval'].dropna() / pd.Timedelta('1 hour'), bins=100)
ax[1].set_title('Reporting Intervals')
ax[1].set_xlabel('Interval Length (Hours)')
ax[1].set_ylabel('Samples in Data Set')
plt.show()
Most reporting times come in four-hour intervals, but at two separate time sequences: a little more than half the reporting times are at 0, 4, 8, 12, 16, and 20 hours (using the 24-hour format) and a large amount of the rest are reported at 1, 5, 9, 13, 17, and 21 hours.
There are a few samples with other reporting times but we will not take those into account. The small amount of reporting intervals of -168 hours are an artefact stemming from the time difference between the last sample of one turnstile and the first of the next turnstile. We will also eliminate those samples.
In the following, we remove:
# Remove first entry of each turnstile
data['NEW_SCP'] = ~ (data['SCP'] == data['SCP'].shift(1))
data = data[ (data['NEW_SCP']==False)
# Remove negative no. of entries/exits:
& (data['ENTRY_DIFF']>=0) & (data['EXIT_DIFF']>=0) # only positive entries/exits
# Remove entry/exit rates that are too high
& (abs(data['ENTRY_DIFF'])<4000) & (abs(data['EXIT_DIFF'])<4000)
# Remove intervals that are not around 4 hours long
& (data['interval']/pd.Timedelta('1 hour')>3.9) & (data['interval']/pd.Timedelta('1 hour')<4.5)
].drop('NEW_SCP', axis=1)
data.shape[0]
After eliminating these anomalous samples, we are still left with 189,301 which is about 96% of the original samples.
Of the remaining, realistic entry/exit data, we can plot a histogram of the number of samples with a certain entry/exit frequency:
plt.hist(data['ENTRY_DIFF'], bins=100, alpha=0.5, label='Entries')
plt.hist(data['EXIT_DIFF'], bins=100, alpha=0.5, label='Exits')
plt.yscale('log', nonposy='clip')
plt.xlabel('Entries in a Time Interval')
plt.ylabel('Samples in Data Set')
plt.legend()
plt.show()
We now want to find out what the busiest stations are. For this purpose we will look at both entries and exits at any given station.
# Group data by station
total_riders = data.groupby(['STATION'], as_index=False) \
[['ENTRY_DIFF','EXIT_DIFF']].sum()
# Columns for sum and difference of entries and exits
total_riders['TOTAL_DIFF'] = \
total_riders['ENTRY_DIFF'] + total_riders['EXIT_DIFF']
total_riders['ENTRY_EXIT_DEFICIT'] = \
total_riders['ENTRY_DIFF'] - total_riders['EXIT_DIFF']
The total_riders
DataFrame
contains the summed-up entries and exits for each station as well as their sum and difference. By sorting the DataFrame
we can immediately tell the busiest stations. 34 St./Penn Station is the busiest with 1.7 million turnstile crossings in this week in March 2018.
total_riders.sort_values(by=['TOTAL_DIFF'], ascending=False).head()
We now want to visualize the weekly ridership for each station on a map. The file stations_conversion.csv
contains the geographic coordinates for each station. We will load these data and merge it with the total_riders
DataFrame
.
stations = pd.read_csv('stations_conversion.csv')
total_riders = pd.merge(total_riders, stations, on='STATION', how='inner')
total_riders.head()
Now we can plot the stations as markers on a Basemap
map with the size of the markers corresponding to the total ridership of the station. The data are displayed on a satellite image retrieved using arcgisimage
.
plt.figure(figsize=(8, 8))
# Create map with basemap
m = Basemap(projection='cyl', resolution='i',
llcrnrlat = total_riders['GTFS Latitude'].min(),
llcrnrlon = total_riders['GTFS Longitude'].min(),
urcrnrlat = total_riders['GTFS Latitude'].max(),
urcrnrlon = total_riders['GTFS Longitude'].max())
# Load satellite image
m.arcgisimage(service='ESRI_Imagery_World_2D',
xpixels=1500, verbose=True)
# Draw stations with marker size according to ridership
for line in total_riders.iterrows():
x,y = m(line[1]['GTFS Longitude'],line[1]['GTFS Latitude'])
size = line[1]['TOTAL_DIFF'] / 100000
plt.plot(x, y, 'o', markersize=size, color='red', alpha=0.5,
markeredgewidth=1, markeredgecolor='white')
plt.show()
From this map visualization we can tell that the busiest stations are located in Manhattan, specifically in Midtown.
Using the MTA ridership data, we can learn about commute in New York City by identifying stations that people commute to (commuter destinations) and stations that people commute from (commuter origins).
We create two subsets of the data, one for the morning (data_am
) and one for the evening (data_pm
), and group them by station. We then merge them and create a new column am_pm_difference
that reflects the difference in ridership in the morning and in the evening at a given station. More preciesly, a large am_pm_difference
arises from more entries in the morning or exits in the evening, as well as less entries in the evening and less exits in the morning.
# Morning data grouped by station
data_am = data[(data['TIME']=='08:00:00') | \
(data['TIME']=='09:00:00')]\
.groupby('STATION', as_index=False).sum()
# Evening data grouped by station
data_pm = data[(data['TIME']=='20:00:00') | \
(data['TIME']=='21:00:00')]\
.groupby('STATION', as_index=False).sum()
# Merge morning and evening data
commute = pd.merge(data_am, data_pm, on='STATION', suffixes=['am','pm'])
# Calculate difference
commute['am_pm_difference'] \
= commute['ENTRY_DIFFam'] + commute['EXIT_DIFFpm'] \
- commute['ENTRY_DIFFpm'] - commute['EXIT_DIFFam']
This difference (which is a measure for how many riders commute from this station) is plotted in the following histogram. It is immediately clear that more stations are commuter origins than commuter destinations.
plt.hist(commute[abs(commute['am_pm_difference'])<100000]['am_pm_difference'], bins=100)
plt.xlabel('AM-PM Difference')
plt.ylabel('Number of Stations')
plt.show()
We now want to plot these data for each station. First, we exclude anomalous values. Then, we merge the commute
DataFrame
with the station locations and plot them using Basemap
.
commute = commute[abs(commute['am_pm_difference'])<50000]
commute = pd.merge(commute, stations, on='STATION', how='inner')
fig = plt.figure(figsize=(8, 8))
# Create map with basemap
m = Basemap(projection='cyl', resolution='i',
llcrnrlat = commute['GTFS Latitude'].min(),
llcrnrlon = commute['GTFS Longitude'].min(),
urcrnrlat = commute['GTFS Latitude'].max(),
urcrnrlon = commute['GTFS Longitude'].max())
# Load satellite image
m.arcgisimage(service='ESRI_Imagery_World_2D',
xpixels = 1500, verbose= True)
# Draw stations with marker sizes according to AM-PM difference
for line in commute.iterrows():
x,y = m(line[1]['GTFS Longitude'],line[1]['GTFS Latitude'])
difference = line[1]['am_pm_difference']
marker_size = 2000
if difference > 0: # Commuter origins
size = difference / marker_size
color = 'blue'
else: # Commuter destinations
size = - difference/ marker_size
color = 'red'
plt.plot(x, y, 'o', markersize=size, color=color, alpha=0.5,
markeredgewidth=1, markeredgecolor='white')
plt.show()