Exploritory Data Analysis


We examine the NSW train Exit and Entry traffic for the Kings Cross, Newtown, Parramatta, Circular Quay, Museum, Central and Town Hall stations for evidence of changes due to the introduction of the NSW Lockout Law on 24-Feb-2014.

Using Baysian Change Point detection we found:

  1. No evidence of changes to Kings Cross or Parramatta Exit traffic from the introduction of the lockout law.
  2. Evidence of strong growth in the Parramatta Friday night Exit traffic that is unrelated of the lockout laws and which has increased traffic by 200% since Jan-2013
  3. Evidence of changes in the Newtown Friday night Exit traffic as a result of the lockout laws and which has increased traffic by 300% since the law came into effect.

The Data

We were provided with train turnstile "validation" data by Transport for NSW. This data is a summary of Exit and Entry traffic by Station, Hour and Date.

The data covers the period 2013-02-01 to 2016-07-31, for Friday, Saturday and Sunday nights from 5PM to 2AM. Since this spans the transition from Magnetic tickets to OPAL cards the data is further divided by Source.

In the notebook 1_Data_Transformation we transform the data from its raw form to one more suitable for our analysis. Specifically we need observations between midnight and the last train (2pm) to have the Date as the previous day. This means all observation in the period 5pm-2am will have the same Date and be considered part of the same Night.

In [86]:
import numpy as np
import pandas as pd
from datetime import timedelta, datetime

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 2)

import seaborn as sns

import outlier
import utilities as util
In [2]:
df_raw = pd.read_pickle("TrainValidationData/df_raw.pkl") 
stations= ['Kings Cross Station', 'Newtown Station', 'Parramatta Station',
           'Circular Quay Station', 'Museum Station',
           'Central Station', 'Town Hall Station']

In the notebook 2_Cleaning we examine the data in more detail. Unsurprisingly we found some data is missing, however we conclude the impact is minimal and we can work with this data.

Counting the number of whole days of missing data we find:

In [3]:
daily = (df_raw.query('Station in @stations').
            pivot_table(index = 'Date', columns = ['Station'], values='Exit', aggfunc=sum).dropna(axis='columns', how='all'))
print("Number of days", len(daily))
for station in daily.columns:
    print("{:>22}: {:}".format(station, daily[station].isnull().sum()))
Number of days 549
       Central Station: 2
 Circular Quay Station: 13
   Kings Cross Station: 14
        Museum Station: 16
       Newtown Station: 58
    Parramatta Station: 6
     Town Hall Station: 0

We observe

  • Town Hall has no missing data and Central just two days.
  • Newtown is missing 10% (40 days), which needs further investigation.

We examine the missing data more closely in 2_Cleaning and discover the missing data for:

  1. Kings Cross and Parramatta is a few whole days on Saturday which is probably due to the station being closed for station or trackwork.
  2. Newtown is largely the first 5 months of Saturday data from Feb to June of 2013. This is probably a result of the station upgrade works started in 2013.

Counting the missing data by Hour of the day we get:

In [4]:
(df_raw.query('Station in @stations').
     pivot_table(index = 'Date', columns = ['Station', 'Hour'], values='Exit', aggfunc=sum).
     dropna(axis='columns', how='all').
     pivot('Station', 'Hour', 0))
Hour 5PM 6PM 7PM 8PM 9PM 10PM 11PM 12AM 1AM 2AM
Central Station 2 2 2 2 2 5 6 6 6 33
Circular Quay Station 27 27 27 30 30 33 32 20 214 528
Kings Cross Station 27 26 28 29 29 30 29 23 134 499
Museum Station 32 36 36 35 36 119 118 127 290 536
Newtown Station 93 92 96 112 111 178 197 184 319 470
Parramatta Station 15 12 15 13 18 19 20 15 76 403
Town Hall Station 0 0 0 0 0 0 0 0 36 435

We for observe for most stations the missing data does not change by Hour, where as Town Hall and Museum the count increases later at night possibly indicating that the station closes early or the gates are left open and people do not swipe off.

Analysis of Hourly Traffic

In this section we look at the traffic patterns at the hourly level. This data is created from the raw data by adding the Exit/Entry traffic for Magnetic Tickets and OPAL for each Station/Date/Hour.

In [5]:
df_hourly = pd.read_pickle("TrainValidationData/df_hourly.pkl")

Hourly averages

The trellis plot below list Nights from left to right and Stations from top to bottom using a common and y-axis sale. Each plot displays the average Exit traffic by hour from 5PM - 2AM

In [6]:
df_hourly['Year']    = df_hourly.index.year

g = sns.factorplot(data=df_hourly, x='Hour', y='Exit', kind="bar", 
               row="Station", row_order=stations[:5], col = 'Night',
               size=1.75, aspect=2, ci = None, margin_titles = True, sharey=True)

We plot Central and Townhall separately due to the disparity of the y-axis scales.

In [7]:
df_hourly['Year']    = df_hourly.index.year

g = sns.factorplot(data=df_hourly, x='Hour', y='Exit', kind="bar", 
               row="Station", row_order=stations[5:], col = 'Night',
               size=1.75, aspect=2, ci = None, margin_titles = True, sharey=True)

We observe the:

  • Exit traffic declines by Hour as expected.
  • Pattern of decline is different for each night.
  • the 5pm and 6pm Friday night traffic is much higher than Saturday or Sunday and (with the exception of Newtown) is disproportionate to the rest of the Friday night. We suspect this to be commuters heading home from work.

For this reason subsequent analysis only considers data from 7pm onwards.

Hourly time series

The following trellis plot list Stations from left to right and Hours from top to bottom starting at 7PM. The X and Y scales are common across all the plots to make comparison easy.

Each each plot of the time series shows:

  • The Exit traffic (blue line).
  • A trend line (dotted green), calculated using linear regression (ordinary least squares). Note this is indicative only and does not indicate linear growth.
  • Outliers (red star), found using the Interquartile Range (IQR) method.
In [8]:
In [9]: