1. Impact of Major Events on Traffic Flow

Authors

  • Enrique Sanchez
  • Parker Addison

2. Question and Importance:

How does the context in which an event is held affect the traffic conditions surrounding the event?

The aim of our project was to explore the effect of events on nearby traffic conditions. We specifically focused on the effects of an event’s time, location, and the number of attendees. Although this does not take into consideration other important attributes such as attendee demographics, we were eager to explore the predictive power of these attributes. Generally, these three features are easily accessible so being able to predict traffic congestion solely on them would be tremendously useful.

Answering the question posed above would be valuable for communities where events are held frequently. Large events (e.g. Comic Con) can be a huge burden on locals due to its upbringing of traffic. Being able to predict the traffic conditions and plan ahead for such conditions may save people time and keep busy cities moving. Let’s also be honest with ourselves, who wants to deal with pesky traffic?!

From a business case perspective, event planners would have a field day having access to this tool. A planner for an event with an already-established venue (fixed location) can utilize this tool to decide upon an event schedule (day of week and time of day begin/end) that allows attendees to sit in minimal traffic. While a planner for a brand new event, or an event that seeks a new location, can use traffic predictions to get a sense of where people are coming from, and can choose a venue that minimizes traffic as well as a venue that satisfies other conditions. For example, a certain location may lead to an increased traffic flow through a high crime part of the surrounding city. It is unlikely that the event planners want their attendees developing a poor reputation for their city, so they can choose a different venue that routes traffic in another way.

Of course, this is also handy for the city itself. Being aware of the impact that major events can bring to the city can help in developing regulations that minimize the impediment of events on traffic conditions. However, some major events may be in the best interests of the city in terms of the economic gain it can bring. In such cases, using this tool allows cities to effectively and efficiently prepare for the increased traffic that will be making its way through its city. This can come in the form of better traffic management, greater access to public transportation in the area, or the creation of more roads.

It is also easy to imagine people being deterred from moving to a city that has a poor traffic reputation. If a solution to minimize/better manage unexpected traffic congestion is devised, it may help in developing a better reputation for the city and motivate more people to move there which can subsequently have a positive economic impact on the city. Of course, solving the event traffic problem is not the solution to solve the overarching traffic problem. However, it may be a critical step to slowly solve the problem that affects millions (or billions) of people.

3. Background & Literature

  1. “Special Event Management.” Texas A&M Transportation Institute https://mobility.tamu.edu/mip/strategies-pdfs/traffic-management/technical-summary/Special-Event-Management-4-Pg.pdf

This article confirms many of the effects that special events can have on traffic congestion, as was mentioned before, and what can be done to minimize it. Rather than offering a method of predicting traffic congestion, it instead offers solutions to alleviate it. We would hope that these solutions would be used in conjunction with our congestion predictions to effectively determine the scale organizers need to take to implement them.

  1. Kwoczek, Simon, et al. “Predicting Traffic Congestion in Presence of Planned Special Events.” Journal of Visual Languages & Computing, vol. 25, no. 6, 2014, pp. 357–364. https://ksiresearchorg.ipage.com/seke/dms14paper/paper17.pdf

This article aligns more with our implementation goals. However, instead of focusing on predicting incoming traffic to the event, it aims to predict outgoing traffic (after the event has ended) which they refer to as second wave traffic. They used many different algorithms such as KNN which were trained on historical traffic data from past events. Implementation details were not explicitly specified so we are uncertain of how to improve their methodology. Their results are claimed to be up to 35% better than state of the art solutions. Ultimately, we intend on doing something similar but we are of course dealing with incoming traffic.

  1. Humphreys, Brad R., and Hyunwoong Pyun. “Professional Sporting Events and Traffic: Evidence from US Cities.” SSRN Electronic Journal, 2017, doi:10.2139/ssrn.2940762. http://busecon.wvu.edu/phd_economics/pdf/17-05.pdf

The article reflects on the scarcity of research on the topic of events and local traffic conditions. In an effort to uncover some useful information, they explore the relationship between local traffic and Major League Baseball games. It found that for each additional 1,000 fans in attendance to an MLB games, there was a 1.749 increase in the average daily miles traveled. This amounts to a 6.9% increase in total annual vehicle miles driven in a typical city with annual MLB events. Overall, this would constitute a 2% increase in traffic congestion as a result of MLB sporting events.

  1. Zagidullin, Ramil. “Model of Road Traffic Management in the City during Major Sporting Events.” Transportation Research Procedia, vol. 20, 2017, pp. 709–716., doi:10.1016/j.trpro.2017.01.115. https://www.sciencedirect.com/science/article/pii/S2352146517301151

The article explores the methods of road traffic management for sporting events. In doing so a mathematical model is built that reveals the root causes of increased travel time around sporting events. These causes include: background road traffic, public transport, and transport for major sporting events.


Overall, it was quite difficult to find references for the topic we worked on. Even the references that we did manage to find acknowledged the lack of information on the topic. Unfortunately, we were unable to get a hold of historical traffic data so we could not follow some methodologies highlighted above (reference 2). We did however get a better sense of what the factors that cause traffic congestion around events were. This helped us narrow down the features we used to predict traffic congestion. These features include the number of vehicles that came into the area, the proximity to highways, and the navigability of the area under normal conditions. As will be seen, we gathered this information from various data sources and using some geospatial tools such as geoenrichment and service areas.

4. Libraries Used

We didn’t know which packages we would use in our proposal, though we did hypothesize that we would use arcgis.network, which is still the case even though parts of our project changed considerably.

arcgis.geocoding
Used to add geometric locations to events in unseen locations which were missing latitude/longitude.

arcgis.features.manage_data
Used to dissolve the highways into a single feature to allow for distances to be computed between events and highways, and used to clip the layer to our study extent in order to speed up calculations.

arcgis.geometry
Used to calculate the Euclidean distance between an event’s location and the nearest highway, and to create geometry objects from service area polygons.

arcgis.geoenrichment
Used to calculate the number of automobiles owned in a location’s baseline service area.

arcgis.network
Used to calculate service areas around an event’s location (both baseline and historical).

pandas, datetime, etc
General packages used primarily to clean our data, query our data, or to add further logic to certain operations (such as creating service areas for a specific date and time).

5. Data Sources


Title: Special Events

URL: https://data.sandiego.gov/datasets/special-events/

Number of records: 2840

Description: Dataset provided by DataSD containing details on all San Diego events since May 2016 that required a Special Events Permit. Includes: Event name, Event Type, Event Url, Location (both description of address and lat/lon), Start date and time, End date and time, and Expected number of attendees and participants.


Title: California State Highways

URL: https://ucsdonline.maps.arcgis.com/home/item.html?id=22cd676ed1f74a7290f64dd1dc9b8363 https://services1.arcgis.com/8CpMUd3fdw6aXef7/arcgis/rest/services/California_State_Highway/FeatureServer/0

Number of records: 1370

Description: An official feature layer representing all highway routes in California. Provided by Caltrans for planning purposes, and validated with the Postmile Validation Wizard, last updated October 2017.


Unobtained
Title: HERE Historical Traffic Data

URL: https://www.here.com/products/traffic-solutions/road-traffic-analytics

Description: Access to HERE’s historical traffic data would have allowed us to look at traffic flow (summarized by a single statistic such as average speed or throughput rate) for each street segment. We were unable to obtain this dataset after reaching out to HERE, so we modified our project to utilize service areas as opposed to manually examining street segments.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from datetime import datetime
import time

from IPython.display import HTML
from IPython import display
import warnings
warnings.filterwarnings('ignore')

from arcgis.gis import GIS
import arcgis
import arcgis.network as network
from arcgis.features import Feature, FeatureSet
from arcgis.geoenrichment import *
from arcgis.features.manage_data import dissolve_boundaries
from arcgis.geometry import distance
from arcgis.geocoding import Geocoder, get_geocoders, geocode

import matplotlib
%matplotlib inline

gis = GIS(username="ens004_UCSDOnline5")
Enter password: ········
In [2]:
# Reading in the raw events data
events = pd.read_csv('special_events_list_datasd.csv')
events.head(3)
Out[2]:
event_title event_id event_subtitle event_type event_desc event_loc event_start event_end exp_attendance exp_participants event_host event_url event_address latitude longitude
0 Pacific Beach Tuesday Farmers' Market 51870 NaN FARMERS This farmer's market offers locally grown vege... Bayard Street between Grand and Garnet Avenues 2019-12-31 14:00:00 2019-12-31 19:00:00 800 60 Discover PB www.pacificbeachmarket.com Bayard Street & Grand Avenue 32.799998 -117.254587
1 Sunday Artisan Market 52120 NaN FARMERS The Sunday Artisan Market provides space for l... 5th Avenue between Market Street and J Street 2019-12-29 10:00:00 2019-12-29 15:00:00 NaN 30 Gaslamp Quarter Association www.gaslamp.org 5th Avenue & Market Street NaN NaN
2 Old Town Artisan's Market 51714 NaN FARMERS A weekend open air market offering an array of... Harney Street between San Diego Avenue and Con... 2019-12-29 09:00:00 2019-12-29 16:30:00 500 100 Old Town San Diego Chamber of Commerce NaN Harney Street & San Diego Avenue 32.752779 -117.194902

6. Data Cleaning

Special Events Dataset:

Due to the fact that the event data set was provided by DataSD, we anticipated it to require minimal cleaning. To our surprise, the data was actually very messy. There were problems with event start/end times, attendance for multi-day events, missing values, missing locations, “San Diego” events that are not within San Diego, and several other small errors. We believe that this comes as a consequence of bad data entry. We were also unable to access metadata for the data set so some assumptions needed to be made when cleaning. Overall, data cleaning required a lot of manipulation using pandas and some ArcGIS functions such as geocoding.

A detailed description of the data cleaning for the events data set can be found in the cell below.

California State Highways:

Note: The dissolve_boundaries and extract_data was performed inside of the Map Viewer on ArcGIS Online. As such, there is no code, but here are the steps we took:

  1. Load California State Highways.

  2. Make sure the highways are dissolved. This layer already is, but layers that we used previously were not. Under the Analysis tab, run Dissolve Boundaries with default settings.

  3. Load in the dissolved layer.

  4. Under the Analysis tab, run Extract Data with Study Area set to a drawn box with proper min/max lat/long values, and check Clip Features.

In [3]:
#! This does not need to be run multiple times.  The cleaned events feature layer is
#  published on ArcGIS Online -- just load it in that way
run = False
if run:

    # ---------------------------- PART 1: Basic Cleaning/Setup ----------------------------

    # Converting start and end times to datetime and creating year column
    events['event_start'] = pd.to_datetime(events['event_start'])
    events['event_end'] = pd.to_datetime(events['event_end'])
    events['year'] = events['event_start'].dt.year

    # Filling in null attendancez (helpful for future cleaning)
    events['exp_attendance'] = events['exp_attendance'].fillna('')
    events['exp_participants'] = events['exp_participants'].fillna('')

    # We also have columns where text is present. We want to avoid any potential entry
    # errors (unwanted spaces, capitalization, etc) so we will clean these columns.
    def clean_text(text):
        if pd.isnull(text):
            return np.nan
        no_marks = text.replace("'", "").replace(",", "")
        lower_whitespace = no_marks.lower().strip()

        return lower_whitespace

    text_cols = ['event_title', 'event_type', 'event_loc', 'event_host', 'event_address']
    for col in text_cols:
        events[col] = events[col].apply(clean_text)


    # ----------------------------- PART 2: Start/End Time -----------------------------

    # Interestingly, some events ended before they started giving us negative time deltas...
    # This seems to arise from false representations of afternoon/morning times. For example, 
    # Old Town's Artisan Market begins at 9am and ends at 4:30pm yet we get some instances of 
    # ending times of 16:30:00 (correct) and 04:30:00. 12 hours off! They are not always 12
    # hours off however. Those that end at midnight or 1am are represented as 00:00:00 or 
    # 01:00:00 in the same day that the event started! In such cases we must add 24 hours.

    # Filter data set to those that have a start and end in the same day
    same_day = events[events['event_start'].dt.day == events['event_end'].dt.day]

    # Filter further to those events that end before they start
    error = same_day[same_day['event_end'] < same_day['event_start']]

    # Filtering to rows that will have time errors
    day_error = error[error['event_end'].dt.hour.isin([0, 1])]
    twelve_error = error[~error['event_end'].dt.hour.isin([0, 1])]

    # Adding appropriate amount of time
    day_error['event_end'] = day_error['event_end'] + pd.Timedelta(days=1)
    twelve_error['event_end'] = twelve_error['event_end'] + pd.Timedelta(hours=12)

    # Fixing time errors
    events.loc[day_error.index, 'event_end'] = day_error['event_end']
    events.loc[twelve_error.index, 'event_end'] = twelve_error['event_end']

    # We also have an event that is 20 days long without a break between days. After
    # doing some research online, this is in reality a single day event. Let's fix
    # this single event.
    event_start = pd.to_datetime('2019-04-27 18:00:00')
    event_end = pd.to_datetime('2019-04-27 23:00:00')

    botany_bash = events[events['event_title'] == 'san diego natural history museum botany bash']
    events.loc[botany_bash.index, 'event_end'] = event_end
    events.loc[botany_bash.index, 'event_start'] = event_start


    # ------------------------------- PART 3: Attendance -------------------------------

    # Since attendance has a huge effect on the traffic impact of an event we must ensure
    # that the attendance is as accurate as possible. We have two problems however:
    #
    # 1. Every observation that has '(xx-day event)' (a multi day event) in the event title 
    #    has an expected attendance equal to that of the expected attendance of the entire event! Note
    #    that this appears to affect the 'festival', 'athletic', and 'concert' events.
    #    There also exists multi day events that do not specify the length of the event in the title
    #    and simply have a 'X,XXX/day' in the attendance columns. 
    # 2. Nearly 10% of the data is missing attendance

    # Let's begin by fixing the multi day event problem.

    # Extracting affected rows
    affected_events = ['festival', 'athletic', 'concert']
    cols = ['event_title', 'exp_attendance', 'exp_participants', 'event_start', 'event_end', 'year']
    attendance = events[events['event_type'].isin(affected_events)][cols]

    # Only events events that happen more than once in a year can be affected - 655 observations
    event_counts = attendance.groupby(['event_title', 'year']).size()
    dup_events = event_counts[event_counts > 1].reset_index().drop(0, axis=1)
    dup_events_data = events.merge(dup_events, on=['event_title', 'year'], how='inner', right_index=True)

    # The longest multi day event spans 41 days, we can exlude events that have larger extents 
    multi_cut = pd.Timedelta(days=41)
    ind_events = dup_events_data.groupby(['event_title', 'year'])

    event_duration = ind_events.apply(lambda x: x['event_end'].max() - x['event_start'].min())
    multiday_events = event_duration[event_duration < multi_cut].reset_index().drop(0, axis=1)
    multiday_events_data = dup_events_data.merge(multiday_events, on=['event_title', 'year'], how='inner', right_index=True)

    # We have now determined the potential multi day events (415 obervations), we can now assess
    # which events have errors in its attendance... Unfortunately, after deeply analyzing the
    # data, there seems to be no pattern to accurately determine the events with errors. Only
    # events we are sure have accurate attendances are those with'X,XXX/day' representations. 
    # To prevent innacuracies in our future predictions, we will drop the other multi-day events.

    # We want to remove events with no 'X,XXX/day' representation
    potential_errors = multiday_events_data[~multiday_events_data['exp_attendance'].str.contains('/day')].index
    events = events.drop(potential_errors, axis=0)

    # Now we need to deal with events with missing attendances. There is very little we can do
    # about missing attendance. If we look at events with missing attendance, nearly half come
    # from 'daily food trucks' which we don't really consider a true event. We have chosen to
    # not risk bad imputations and simply drop these events. GIGO!
    events = events[(events['exp_attendance'] != '') & (events['exp_participants'] != '')]

    # Let's also clean the attendance columns so that they are actual numbers!
    def clean_attendance(event):
        if event == '':
            return np.nan

        return int(event.replace('/day', '').replace(',', ''))

    events['exp_attendance'] = events['exp_attendance'].apply(clean_attendance)
    events['exp_participants'] = events['exp_participants'].apply(clean_attendance)


    # ----------------------------- PART 4: Event Locations -----------------------------

    # Since we are focused on determining the traffic impact of an event on surrounding areas,
    # it is critical that we know where these events are located. It may be tempting to 
    # to go ahead and geocode all missing locations but some events share the same location 
    # (as represented by the 'event_address' variable) so we can replace missing locations
    # with locations of events that occurred at the same location. We see some very slight 
    # variations in coordinates for the same same location at times but they still seem
    # like fair estimates.

    # Let's begin by creating a column that holds both the lat and lon for each event in a tuple.
    events['location'] = events.apply(lambda x: (x['latitude'], x['longitude']), axis=1)

    # Now we can create a dictionary for event addresses and their corresponding coordinates
    with_location = events.dropna(subset=['latitude', 'longitude'])
    location_dict = with_location.groupby('event_address')['location'].unique().apply(lambda x: x[0]).to_dict()

    # Let's now replace events with missing locations with this dictionary
    missing_locations = events[(events['latitude'].isnull()) | (events['longitude'].isnull())]
    missing_locations['location'] = missing_locations['event_address'].apply(lambda x: location_dict.get(x))

    # Now that we saved some locations, let's put it back into the dataframe
    missing_locations['latitude'] = missing_locations['location'].apply(lambda x: np.nan if pd.isnull(x) else x[0])
    missing_locations['longitude'] = missing_locations['location'].apply(lambda x: np.nan if pd.isnull(x) else x[1])

    events['latitude'].loc[missing_locations.index] = missing_locations['latitude']
    events['longitude'].loc[missing_locations.index] = missing_locations['longitude']

    # We still have some missing locations. We will fill these in using geocoding!
    # There are two columns that indicate location aside from lat and lon: event_loc & event_address.
    # event_loc includes a brief description of the location which may be tricky to geocode. 
    # event_address on the other hand, gives us the intersection at which an event happens 
    # (e.g. 5th Avenue & Market Street). Thankfully, intersections can be geocoded!

    # Finding locations that are still missing
    further_missing = events[(events['latitude'].isnull()) | (events['longitude'].isnull())]

    # Getting the unique event addresses so we don't geocode the same address (only 23 locations!)
    unique_addresses = further_missing['event_address'].unique()

    # Geocode!
    for address in unique_addresses:
        if pd.isnull(address):
            continue

        # Extracting coordinates
        geocoded = geocode(address + ', San Diego')
        longitude = geocoded[0]['attributes']['X']
        latitude = geocoded[0]['attributes']['Y']

        # Imputing
        further_missing['latitude'].loc[further_missing['event_address'] == address] = latitude
        further_missing['longitude'].loc[further_missing['event_address'] == address] = longitude

    # Now that we geocoded, we only have 20 locations without an address! We will simply drop these.
    # We can also update our events data set
    events['latitude'].loc[further_missing.index] = further_missing['latitude']
    events['longitude'].loc[further_missing.index] = further_missing['longitude']
    events = events.dropna(subset=['latitude', 'longitude'])


    # ----------------------------- PART 5: Finishing Touches -----------------------------

    # Now that the most important features area clean, we can start dropping and setting our
    # data frame for some work!

    # We should remove any events that are outside of San Diego county.  There aren't too many
    # cases of this, and there aren't any 'close-calls', so we can use rudimentary extents to
    # figure out what to drop.
    sd_extent = {"lonmin": -117.6, "lonmax": -116, "latmin": 32.5, "latmax": 33.5}
    events = events[
          (events.latitude >= sd_extent["latmin"])
        & (events.latitude <= sd_extent["latmax"])
        & (events.longitude >= sd_extent["lonmin"])
        & (events.longitude <= sd_extent["lonmax"])
    ]

    # Creating a variable for the total expected attendance
    events['total_attendance'] = events['exp_attendance'] + events['exp_participants']

    # Creating clearer date/time columns
    events['event_date'] = events['event_start'].dt.date
    events['event_start'] = events['event_start'].dt.time
    events['event_end'] = events['event_end'].dt.time

    # Keeping only necessary columns
    cols = ['event_title', 'event_id', 'event_type', 'event_date', 'event_start',
            'event_end', 'total_attendance', 'latitude', 'longitude']

    events = events[cols].reset_index(drop=True)

    # Renaming columns
    events.columns = pd.Series(events.columns).apply(lambda x: x.replace('event_', ''))

    # Let's now see the clean data!
    display(events.info())
    events.head()
    
# Converting data to sdf and creating feature layer from it for future use
# Note creating copy since you can't create a feature layer from an sdf with datetime columns

# events_sdf = pd.DataFrame.spatial.from_xy(events, x_column = 'longitude', y_column='latitude')
# events_sdf = events_sdf.astype({'date':'str','start':'str','end':'str'})

# events_fl = events_sdf.spatial.to_featurelayer(title='San Diego Event Locations', tags='events').layers[0]
#
# NOTE: This was already run, we can just read it in from arcgis.
#       This feature layer is a cleaned version of our events dataset.
events_fl = gis.content.get("eda42c7fb00f4996a00b769ed74843c6").layers[0]

# We will use 3857 for this project
events_sdf = events_fl.query(out_sr='3857').sdf
events_sdf.head()
Out[3]:
FID title id type date start end_ total_atte latitude longitude SHAPE
0 1 pacific beach tuesday farmers market 51870 farmers 2019-12-31 14:00:00 19:00:00 860 32.799998 -117.254587 {'x': -13052720.918015596, 'y': 3868787.072960...
1 2 old town artisans market 51714 farmers 2019-12-29 09:00:00 16:30:00 600 32.752779 -117.194902 {'x': -13046076.791943701, 'y': 3862535.324132...
2 3 2019 hillcrest farmers market (sundays) 52070 farmers 2019-12-29 09:00:00 14:00:00 4400 32.748542 -117.149901 {'x': -13041067.32580241, 'y': 3861974.5260695...
3 4 old town artisans market 51713 farmers 2019-12-28 09:00:00 16:30:00 600 32.752779 -117.194902 {'x': -13046076.791943701, 'y': 3862535.324132...
4 5 city heights farmers market (every saturday) 51818 farmers 2019-12-28 09:00:00 13:00:00 600 32.747753 -117.099983 {'x': -13035510.50172489, 'y': 3861870.0833714...
In [4]:
# Plotting the events
map1 = gis.map('San Diego')
map1.add_layer(events_fl)
display.Image('images/map1.png')
Out[4]: