#!/usr/bin/env python # coding: utf-8 # # Finding Heavy Traffic Indicators on I-94 # # *** # ## Introduction # # In this project, we're going to analyze a dataset on westbound traffic on [Interstate Highway I-94](https://en.wikipedia.org/wiki/Interstate_94). **John Hogue** made the dataset available, and you can download it from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume). # # The following are descriptions for each column in the dataset: # # * `holiday` – US National holidays plus regional holiday, Minnesota State Fair, # * `temp` – average temp in kelvin, # * `rain_1h` – the amount in mm of rain per hour, # * `snow_1h` – the amount in mm of snow per hour, # * `clouds_all` – % of cloud cover, # * `weather_main` – a short textual description of the current weather, # * `weather_description` – a longer textual description of the current weather, # * `date_time` – the time of the data collected in local CST time, # * `traffic_volume` – hourly I-94 ATR 301 reported westbound traffic volume. # # ### Goal # # * The goal of our analysis is to determine some indicators of heavy traffic on I-94. These indicators can be weather type, time of day, time of week, etc. # # ### Summary of Results # # The indicators with the highest traffic volume are shown below: # # * Time: # # | Time Indicators | Description | AVG Traffic Volume (approximate), cars/hr | # | --------------- | ---------------------------------- | ----------------------------------------- | # | Daytime hours | Business days (7 to 19) | 4,000 to 6,000 | # | Early morning | Business days (at 6) | 5,500 | # | Peak hours | Business days (7 and 16) | 6,000 | # | Day of week | Friday (daytime) | 5,500 | # | Warm months | March to October (except July) | 4,900 | # | Peak year | 2017 - Due to road works (daytime) | 5,000 | # # # * Weather: # # # # # # # # # # # # # # # # # # # # # # # # # # # #
Weather DescriptionAVG Traffic # Volume, cars/hr
Snow with showers> 5,000 (Daytime)
Light rain and snow
Proximity storm with drizzle
Proximity shower> 2,500 (Nighttime)
Light intensity shower
# # # * Holidays: # # | Holiday | AVG Traffic Volume, cars/hr | # | ------- | --------------------------- | # | Columbus Day (Indigenous Peoples' Day) | About 3,500 | # # *We conclude that temperature does not influence the intensity of traffic volume.* # # *** # ## Initial exploration of the data set # # Let's start by importing all the libraries we'll need and begin by reading the dataset into pandas. # In[1]: import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns get_ipython().run_line_magic('matplotlib', 'inline') # In[2]: i_94 = pd.read_csv('Metro_Interstate_Traffic_Volume.csv') i_94.head() # In[3]: i_94.tail() # In[4]: i_94.info() # The dataset has 48,204 rows and 9 columns, and there are no null values except for the column `holiday` which shows several null values. Each row describes the traffic and weather data for a specific time. # # **To Consider** # # The [dataset documentation](https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume) mentions that a station located approximately halfway between Minneapolis and Saint Paul records the traffic data. For this station, the direction of the route is west (that is, cars are moving from east to west). This means that the results of our analysis will refer to westbound traffic in the vicinity of the station. # # > *We must avoid generalizing our results to the entire I-94 freeway.* # # *** # ## Implementation of Support Functions # # Before continuing, we'll implement some functions that'll allow us to have a better view of the values and save lines of code. This will make it easier for us to explore and analyze the data. # In[5]: def float_format(value, n=2): value = round(value, n) return '{:,}'.format(value) def bins_number(df): n = df.shape[0] k = 1 + 3.32 * np.log10(n) return int(round(k)) def gen_hist_plot(df, col_name, bins=10, extra_title='', extra_xlabel='', color=None): x_label = col_name.replace('_', ' ').title() title = x_label + extra_title x_label += extra_xlabel plt.hist(df[col_name], bins=bins, color=color, edgecolor='gray') plt.title(title, fontsize=18) plt.xlabel(x_label, fontsize=14) plt.ylabel('Frequency', fontsize=14) sns.despine() def get_comm_describes(lst_dfs, comm_col, lst_names): lst_desc = [] indexes = len(lst_dfs) for i in range(indexes): lst_desc.append(lst_dfs[i][comm_col].describe()) df_desc = pd.concat(lst_desc, axis=1) col_title = comm_col.replace('_', ' ').upper() df_desc.index.name = col_title df_desc.columns = lst_names for i in range(indexes): df_desc[lst_names[i]] = df_desc[lst_names[i]].apply(float_format) return df_desc def gen_line_plot(df, col_name, title='', label=None, xlabel=None, ylabel=None, xticks=None, labels=None, despine=True, xmin=None, xmax=None, ymin=None, ymax=None, linestyle=None, color=None, marker=None, rotation=None): plt.plot(df[col_name], label=label, color=color, linestyle=linestyle, marker=marker) plt.title(title, fontsize=18) plt.xlabel(xlabel, fontsize=14) plt.ylabel(ylabel, fontsize=14) if xticks: ticks = [i for i in range(xticks[0], xticks[1] + 1)] else: ticks = None plt.xticks(ticks=ticks, labels=labels, fontsize=12, rotation=rotation) plt.yticks(fontsize=12) plt.xlim(xmin, xmax) plt.ylim(ymin, ymax) if despine: sns.despine() def group_by_new_col_dt(df, new_group_col, dt_col, dt=0, col_sep_dweek=None): """dt=0 -> Group by year. dt=1 -> Group by month. dt=2 -> Group by day. dt=3 -> Group by hour.""" if dt == 0: df[new_group_col] = df[dt_col].dt.year elif dt == 1: df[new_group_col] = df[dt_col].dt.month elif dt == 2: df[new_group_col] = df[dt_col].dt.dayofweek elif dt == 3: df[new_group_col] = df[dt_col].dt.hour if col_sep_dweek: # 4 == Friday business_days = df.copy()[df[col_sep_dweek] <= 4] # 5 == Saturday weekend = df.copy()[df[col_sep_dweek] >= 5] hour_business = business_days.groupby(new_group_col).mean() hour_weekend = weekend.groupby(new_group_col).mean() return hour_business, hour_weekend return df.groupby(new_group_col).mean() def gen_scatter_plot(df, col_1, col_2, title, xlabel, ylabel, xmin=None, xmax=None, color=None): plt.scatter(df[col_1], df[col_2], color=color, alpha=0.1) plt.title(title, fontsize=18) plt.xlim(xmin, xmax) plt.xlabel(xlabel, fontsize=14) plt.ylabel(ylabel, fontsize=14) sns.despine() def gen_barh_plot(df, col_x, col_y, title, xlabel=None, color=None, xv_line=None): plt.barh(df[col_x], df[col_y], height=0.4, color=color) plt.title(title, fontsize=18) plt.xlabel(xlabel, fontsize=14) plt.ylabel('') plt.tick_params(left=False) if xv_line: plt.axvline(x=xv_line, color='dimgrey', linewidth=0.2) sns.despine(left=True) # *** # ## Analyzing Traffic Volume # # We're going to start our analysis by examining the distribution of the `traffic_volume` column. # In[6]: col_bins = bins_number(i_94) gen_hist_plot(i_94, 'traffic_volume', bins=col_bins, extra_title=' Histogram', extra_xlabel=', cars/hr', color='lightsteelblue') print('TRAFIC VOLUME STATS\n', i_94['traffic_volume'].describe().apply(float_format), sep='\n') # **Observations** # # * Overall, the hourly traffic volume ranged from 0 to 7,280 cars, with an average of 3,260 cars per hour. # * Approximately 25% of the time, there were only 1,193 cars or fewer passing through the station each hour - this probably occurs at night, or when a road is under construction. However, 25% of the time, the traffic volume was four times higher (4,933 cars or more). # * Also, we can observe that the most frequent traffic volume values range between 0 and 1'000, between 2'500 and 3'500, and between 4'500 and 6'000 cars/hr. # # These observations gives our analysis an interesting direction: comparing daytime data with nighttime data. # # *** # ## Traffic Volume: Day vs. Night # # We'll start by dividing the dataset into two parts: # # * Daytime data: hours from 7 AM to 7 PM (12 hours). # * Night-time data: hours from 7 PM to 7 AM (12 hours). # # While this is not a perfect criterion for distinguishing between nighttime and daytime, it's a good starting point. # In[7]: i_94['date_time'] = pd.to_datetime(i_94['date_time']) day = i_94.copy()[(i_94['date_time'].dt.hour >= 7) & (i_94['date_time'].dt.hour < 19)] night = i_94.copy()[(i_94['date_time'].dt.hour >= 19) | (i_94['date_time'].dt.hour < 7)] print('Total rows - Day: {:,}'.format(day.shape[0])) print('Total rows - Night: {:,}'.format(night.shape[0])) # This significant difference in row numbers between day and night is due to a few hours of missing data. For instance, if we look at rows 176 and 177, you'll notice there's no data for two hours (4 and 5). # In[8]: i_94.iloc[176:178] # Now that we've isolated day and night, we're going to look at the histograms of traffic volume side-by-side by using a grid chart. # In[9]: bins_day = bins_number(day) bins_night = bins_number(night) dfs = [day, night] num_bins = [bins_day, bins_night] extra_titles = [': Daytime', ': Nighttime'] colors = ['deepskyblue', 'slateblue'] plt.figure(figsize=(12, 4)) for i in range(2): plt.subplot(1, 2, i+1) gen_hist_plot(dfs[i], 'traffic_volume', bins=num_bins[i], extra_title=extra_titles[i], extra_xlabel=', cars/hr', color=colors[i]) plt.tight_layout(pad=2) print(get_comm_describes([day, night], 'traffic_volume', ['DAYTIME', 'NIGHTTIME'])) # **Daytime:** # We observe that the traffic volume distribution is skewed to the left. This means that most of the traffic volume values are high: # # - The highest traffic volume values are between 4,000 and 6,000 cars/hour, with an average of 4,762 cars/hour. # - Approximately, 25% of the day there were 4,252 cars or less passing through the station every hour of the day. On the other hand, about 25% of the day there were 5,559 cars/hour or more. # # **Nighttime:** # We note that the distribution of traffic volume is skewed to the right. This means that most of the traffic volume values are low: # # - Most traffic volume values range between 0 and 1'000, followed by values between 2'500 and 3'500, and finally the range between 5'000 and 6'000; with an average of 1,785 cars/hour. # - Approximately 25% of the night there were 530 cars or less passing through the station every hour of the night. On the other hand, about 25% of the night there were 2,819 cars/hour or more. # # > *The results of the nighttime traffic volume show a distribution with 3 very noticeable peaks, which we saw earlier in the general histogram; in particular, those peaks located further to the left.* # # Although there are still measurements of more than 5,000 cars per hour, nighttime traffic is generally light. # # *** # ## Time Indicators # # Now, let's examine the indicators of heavy traffic as a function of time. There may be more traffic in a given month, on a given day, or at a given time of day. # # We will make some line graphs showing how the traffic volume changes as mentioned above. # # To get an overview, let's start by plotting the traffic intensity by years. # In[10]: by_year_day = group_by_new_col_dt(day, 'year', 'date_time') by_year_night = group_by_new_col_dt(night, 'year', 'date_time') dfs = [by_year_day, by_year_night] titles = ['Average Traffic by Year - Daytime', 'Average Traffic by Year - Nighttime'] # colors = ['deepskyblue', 'slateblue'] plt.figure(figsize=(12, 4)) for i in range(2): plt.subplot(1, 2, i+1) gen_line_plot(dfs[i], 'traffic_volume', title=titles[i], xlabel='Year', ylabel='Traffic volume, cars/hr', color=colors[i]) plt.tight_layout(pad=2) # Immediately we notice in the daytime graph that the lowest traffic volume corresponds to the year 2016, after that, the maximum peak is observed in the year 2017 (this peak is also observed in the nighttime traffic). # # This may have been a consequence of general highway works, which caused a temporary reduction in traffic volume. This, in turn, led to an increase in traffic volume the following year. # # As for the 2017 peak in nighttime traffic volume, it was most likely caused by the adjustment of the [speed limit on I-94 west of Minneapolis](https://www.mprnews.org/story/2017/06/06/i94-construction-zone-speed-limit-reduced) due to construction work on the highway at that time. # In[11]: by_month_day = group_by_new_col_dt(day, 'month', 'date_time', dt=1) by_month_night = group_by_new_col_dt(night, 'month', 'date_time', dt=1) months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'] dfs = [by_month_day, by_month_night] titles = ['Average Traffic by Month - Daytime', 'Average Traffic by Month - Nighttime'] # colors = ['deepskyblue', 'slateblue'] plt.figure(figsize=(12, 4)) for i in range(2): plt.subplot(1, 2, i+1) gen_line_plot(dfs[i], 'traffic_volume', title=titles[i], xticks=[1, 12], labels=months, xmin=1, xmax=12, ylabel='Traffic volume, cars/hr', color=colors[i], rotation=30) plt.tight_layout(pad=2) # Traffic (both daytime and nighttime) seems less intense in the cold months (November-February) and more in the warm months (March-October), with the exception of July. # # > *The peak we observed in June in nighttime traffic is most likely due to the adjustment of the speed limit as mentioned above.* # # Next, let's see how the traffic volume has changed each year in July. # In[12]: july_day = day[day['month'] == 7].groupby('year').mean() july_night = night[night['month'] == 7].groupby('year').mean() dfs = [july_day, july_night] titles = ['Traffic volume in July - Daytime', 'Traffic volume in July - Nighttime'] # colors = ['deepskyblue', 'slateblue'] plt.figure(figsize=(12, 4)) for i in range(2): plt.subplot(1, 2, i+1) gen_line_plot(dfs[i], 'traffic_volume', title=titles[i], xlabel='Year', ylabel='Traffic volume, cars/hr', color=colors[i]) plt.tight_layout(pad=2) # Generally, traffic is quite heavy in July both during the day and at night (within the usual range for each scenario), similar to the other warm months. The only exception we see is 2016, which had a high drop in traffic volume. One possible reason for this is highway construction - [this article from 2016](https://www.crainsdetroit.com/article/20160728/NEWS/160729841/weekend-construction-i-96-us-23-bridge-work-i-94-lane-closures-i-696) supports this hypothesis. # # As a tentative conclusion, we can say that warm months generally show higher traffic compared to cold months. In a warm month, one can expect for each hour of the day a traffic volume close to 5,000 cars and close to 2,000 cars for each hour of the night. # # **Let's now look at a more granular indicator: Day of the Week** # In[13]: by_dw_day = group_by_new_col_dt(day, 'day_of_week', 'date_time', dt=2) by_dw_night = group_by_new_col_dt(night, 'day_of_week', 'date_time', dt=2) dweeks = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'] dfs = [by_dw_day, by_dw_night] titles = ['Traffic by Day of Week - Daytime', 'Traffic by Day of Week - Nighttime'] # colors = ['deepskyblue', 'slateblue'] plt.figure(figsize=(12, 4)) for i in range(2): plt.subplot(1, 2, i+1) gen_line_plot(dfs[i], 'traffic_volume', title=titles[i], xticks=[0, 6], labels=dweeks, xmin=0, xmax=6, ylabel='Traffic volume, cars/hr', color=colors[i], rotation=45) plt.tight_layout(pad=2) # Traffic volume is significantly higher on weekdays (Monday to Friday). # # **Daytime:** # Except for Monday, values above 5,000 are only observed on weekdays. Traffic is lighter on weekends, with values below 4,000 cars. # # **Nighttime:** # Except for Monday, only values above 1,800 are observed during weekdays. Traffic is lighter on weekends, with values around 1,600 cars or less. # # ### Traffic Volume by Hour # # To analyze traffic volume by hour, we should take into account that the weekends would drag down the average values. Hence, it makes sense to look at the averages separately for business days and weekends. # In[14]: day_hour_business, day_hour_weekend = group_by_new_col_dt(day, 'hour', 'date_time', dt=3, col_sep_dweek='day_of_week') night_hour_business, night_hour_weekend = group_by_new_col_dt(night, 'hour', 'date_time', dt=3, col_sep_dweek='day_of_week') night_hours = ['0','1', '2', '3', '4', '5', '6', '', '', '', '', '', '', '', '', '', '', '', '', '19', '20','21','22','23'] plt.figure(figsize=(14, 4)) # Plot traffic volume by hour - daytime plt.subplot(1, 2, 1) gen_line_plot(day_hour_business, 'traffic_volume', label='Business Days', xticks=[7, 18], xmin=7, xmax=18, xlabel='Time', ylabel='Traffic volume, cars/hr', color='blue') gen_line_plot(day_hour_weekend, 'traffic_volume', title='Traffic volume by hour: Daytime', label='Weekends', xticks=[7, 18], despine=True, xmin=7, xmax=18, ymin=1500, ymax=6250, xlabel='Time', ylabel='Traffic volume, cars/hr', color='red') plt.legend() # Plot traffic volume by hour - nighttime plt.subplot(1, 2, 2) gen_line_plot(night_hour_business, 'traffic_volume', label='Business Days', xticks=[0, 23], labels=night_hours, xmin=-0.5, xmax=23.5, xlabel='Time', ylabel='Traffic volume, cars/hr', linestyle='', color='blue', marker='.') gen_line_plot(night_hour_weekend, 'traffic_volume', title='Traffic volume by hour: Nighttime', label='Weekends', xticks=[0, 23], labels=night_hours, despine=True, xmin=-0.5, xmax=23.5, ymin=0, ymax=6250, xlabel='Time', ylabel='Traffic volume, cars/hr', linestyle='', color='red', marker='.') plt.legend() plt.tight_layout(pad=2) # During both daytime and nighttime hours, traffic volumes are generally higher during weekdays compared to weekends. # # * **Daytime hours:** # As somewhat expected on weekdays, peak hours are around 7am and 4pm, when most people commute from home to work and back. We see volumes of more than 6,000 cars during peak hours. # On the other hand, on weekends we observe that there are no peaks on the graph. Traffic increases from 7 a.m. to noon, then reaches a plateau and from 4 p.m. onwards it starts to decrease. # # # * **Nighttime hours:** # In the weekday plot, we see that traffic gradually decreases from 7:00 p.m. to 2:00 a.m. then increases rapidly from 3:00 a.m. to 6:00 p.m. While in the weekend plot, we see that traffic reaches its minimum at 3:00 a.m. then from 4:00 to 6:00, it increases again, but slightly. # # *** # ## Weather Indicators # # Another possible indicator of heavy traffic is weather. The dataset provides us with a few useful columns about weather: `temp`, `rain_1h`, `snow_1h`, `clouds_all`, `weather_main`, `weather_description`. # # A few of these columns are numerical, so let's start by looking up their correlation values with `traffic_volume` column. # In[15]: print('Correlation: Column "traffic_volume" with the meteorological columns (only numerical columns):') day.corr()['traffic_volume'][['temp', 'rain_1h', 'snow_1h', 'clouds_all']].apply(float_format, args=[3]) # Temperature shows the strongest correlation with a value of only +0.13. The other relevant columns (`rain_1h`, `snow_1h`, `clouds_all`) do not show any strong correlation with **traffic volume**. # # Now, we'll perform the correlation plot between traffic volume and temperature but first, let's explore the temperature values in the `temp` column. # In[16]: print(get_comm_describes([day, night], 'temp', ['DAYTIME', 'NIGHTTIME'])) print('\nTEMPERATURES (K) - DAYTIME', day['temp'].value_counts(dropna=False).sort_index(), sep='\n') print('\nTEMPERATURES (K) - NIGHTTIME', night['temp'].value_counts(dropna=False).sort_index(), sep='\n') # Temperatures are on the Kelvin (K) scale. We can see that almost all values are in normal ranges but we also see some values equal to zero (absolute zero). Since zero Kelvin values do not occur in nature, we'll only consider those records that are in the 240 to 315 K range. # In[17]: dfs = [day, night] titles = ['Traffic Volume at Different Temperatures\nDaytime', 'Traffic Volume at Different Temperatures\nNighttime'] colors = ['deepskyblue', 'slateblue'] plt.figure(figsize=(12, 6)) for i in range(2): plt.subplot(1, 2, i+1) gen_scatter_plot(dfs[i], 'temp', 'traffic_volume', titles[i], 'Temperature (K)', 'Traffic volume, cars/hr', xmin=240, xmax=315, color=colors[i]) plt.tight_layout(pad=2) # We can conclude that in both daytime and nighttime traffic the temperature is not a strong indicator of heavy traffic. # # Let's now look at the other weather related columns: `weather_main` and `weather_description`. # # *** # ## Weather Types # # To start, we're going to group the data by `weather_main` and look at the `traffic_volume` averages. # In[18]: weather_main_day = day.groupby('weather_main').mean().sort_values('traffic_volume').reset_index() weather_main_night = night.groupby('weather_main').mean().sort_values('traffic_volume').reset_index() dfs = [weather_main_day, weather_main_night] titles = ['Traffic Volume by Weather Main\nDaytime', 'Traffic Volume by Weather Main\nNighttime'] xv_lines = [5000, 2000] # colors = ['deepskyblue', 'slateblue'] plt.figure(figsize=(14, 6)) for i in range(2): plt.subplot(1, 2, i+1) gen_barh_plot(dfs[i], 'weather_main', 'traffic_volume', titles[i], xlabel='Traffic volume, cars/hr', color=colors[i], xv_line=xv_lines[i]) plt.tight_layout(pad=2) # In the graph with daytime data, it appears that there are no weather types where traffic volume exceeds 5,000 cars. Similarly, in the graph with nighttime data, the traffic volume did not exceed 2,000 cars. This makes it more difficult to find a heavy traffic indicator. # # Next, we'll group by `weather_description`, which has a more granular weather classification. # In[19]: weather_description_day = day.groupby('weather_description').mean().sort_values('traffic_volume').reset_index() # colors = ['deepskyblue', 'slateblue'] # xv_lines = [5000, 2000] plt.figure(figsize=(7, 12)) gen_barh_plot(weather_description_day, 'weather_description', 'traffic_volume', 'Traffic Volume by Weather Description\nDaytime', xlabel='Traffic volume, cars/hr', color=colors[0], xv_line=xv_lines[0]) # In the daytime data, we appreciate that there are three weather types where traffic volume exceeds 5,000 cars: **Shower snow, Light rain and snow and Proximity thunderstorm with drizzle**. # # It's not clear why these types of weather have the highest average traffic values; these are bad weathers, but not so bad. However, we must bear in mind that really bad weather conditions are generally forecast in advance, so perhaps people try to anticipate to avoid traveling on those days. # In[20]: weather_description_night = night.groupby('weather_description').mean().sort_values('traffic_volume').reset_index() # colors = ['deepskyblue', 'slateblue'] # xv_lines = [5000, 2000] xv_lines[1] = 2500 plt.figure(figsize=(7, 12)) gen_barh_plot(weather_description_night, 'weather_description', 'traffic_volume', 'Traffic Volume by Weather Description\nNighttime', xlabel='Traffic volume, cars/hr', color=colors[1], xv_line=xv_lines[1]) # In the nighttime data, we appreciate that there are two weather types in where the traffic volume exceeds 2,500 cars: **Proximity shower rain and Light intensity shower rain**. # # Similar to the daytime scenario, these are bad weathers, but not as bad; however, they are enough for people to decide to slow down to avoid accidents and thus cause an increase in traffic intensity. # # *** # ## Traffic Volume on Holidays # # Let's start by exploring the frequency table for this field. # In[21]: print('HOLIDAYS (FREQUENCY):\n', i_94['holiday'].value_counts(), sep='\n') # We can see that there are few records for holidays and many records with values equal to **'none'**. Perhaps, some holidays were overlooked and these were recorded as common days. # # Next, let's see which is the oldest and youngest date in the dataset. # In[22]: oldest_date = min(i_94['date_time']) youngest_date = max(i_94['date_time']) print('Oldest date: {}\nYoungest date: {}'.format(oldest_date, youngest_date)) # Well, now we'll collect all the holidays that are within the range of dates obtained previously (we'll do it by guiding us from [this website](https://www.calendarpedia.com/holidays/federal-holidays-2012.html)). We'll then update the dataset with the collected holidays, and finally we'll redisplay the frequency table for the `holidays` column. # In[23]: i_94['date'] = i_94['date_time'].dt.strftime('%Y-%m-%d') # Non-fixed holidays: var_holidays = {'Labor Day': ('2013-09-02', '2014-09-01', '2015-09-07', '2016-09-05', '2017-09-04', '2018-09-03'), 'Thanksgiving Day': ('2012-11-22', '2013-11-28', '2014-11-27', '2015-11-26', '2016-11-24', '2017-11-23'), 'Martin Luther King Jr Day': ('2013-01-21', '2014-01-20', '2015-01-19', '2016-01-18', '2017-01-16', '2018-01-15'), 'Columbus Day': ('2012-10-08', '2013-10-14', '2014-10-13', '2015-10-12', '2016-10-10', '2017-10-09'), 'Veterans Day': ('2012-11-11', '2012-11-12', '2013-11-11', '2014-11-11', '2015-11-11', '2016-11-11', '2017-11-10', '2017-11-11'), 'Washingtons Birthday': ('2013-02-18', '2014-02-17', '2015-02-16', '2016-02-15', '2017-02-20', '2018-02-19'), 'Memorial Day': ('2013-05-27', '2014-05-26', '2015-05-25', '2016-05-30', '2017-05-29', '2018-05-28') } for holi, dates in var_holidays.items(): for date in dates: i_94.loc[i_94['date'] == date, 'holiday'] = holi # Fixed holidays: i_94.loc[i_94['date'].str[5:] == '07-04', 'holiday'] = 'Independence Day' i_94.loc[i_94['date'].str[5:] == '12-25', 'holiday'] = 'Christmas Day' i_94.loc[i_94['date'].str[5:] == '01-01', 'holiday'] = 'New Years Day' print('UPDATED HOLIDAYS (FREQUENCY):\n', i_94['holiday'].value_counts(), sep='\n') # We noticed that there were indeed unrecorded holidays. # Now that we have the updated data, we can visualize the volume of traffic by holiday. # In[24]: filter_holidays = (i_94['holiday'] != 'None') & (i_94['holiday'] != 'State Fair') holiday_bh_plot = i_94[filter_holidays].groupby('holiday').mean().sort_values('traffic_volume').reset_index() gen_barh_plot(holiday_bh_plot, 'holiday', 'traffic_volume', 'Traffic volume by holiday', xlabel='Traffic volume, cars/hr', color='lightsteelblue', xv_line=3000) # The holiday that records the highest traffic volume rate with around **3,500 cars/hour** is **Columbus Day**. We should keep in mind that the data we are analyzing in this project are related to the state of **Minnesota** where such holiday was renamed as **Indigenous Peoples' Day** [since 2014](https://www.npr.org/sections/thetwo-way/2014/04/27/307445328/minneapolis-renames-columbus-day-as-indigenous-peoples-day). From the results obtained, we can deduce that such holiday is widely celebrated there; and therefore, we see reflected the highest volume of traffic on that holiday. # # Meanwhile, the holidays with the least traffic are **Christmas Day** and **New Year's Day**. This is probably because people prefer to spend these holidays at home with their families or friends. # # *** # ## Conclusion # # In this project, we tried to find some indicators of heavy westbound traffic on Interstate Highway I-94. The data were captured by a station halfway between Minneapolis and Saint Paul. # Below, we show three types of relevant indicators conducive to heavy traffic. # # **Time indicators:** # # Traffic tends to be heaviest during the day (7:00 am to 7:00 pm). Particularly the warm months (March to October) compared to the cold months (November to February). # Higher traffic volume intensity is observed on weekdays than on weekends. Peak hours are around 7 am and 4 pm with about 6,000 cars/hour. # # The lowest average traffic volume was recorded in 2016, this due to works on the highway in that year. However, in 2017 the maximum peak was recorded, presumably due to a speed adjustment due to the continuation of works over the highway. # # # **Weather indicators:** # # Daytime (> 5,000 cars/hour): # - Snow with showers. # - Light rain and snow. # - Proximity storm with drizzle. # # Nighttime (> 2,500 cars/hour): # - Proximity shower. # - Light intensity shower. # # # **Holiday indicators:** # # The holiday with the highest traffic intensity is related to ***Columbus Day*** (also known as ***Indigenous Peoples' Day*** in Minnesota). While the holidays with the lowest intensity are 'Christmas Day' and 'New Year's Day', presumably because people prefer to stay at home to celebrate with family or friends. # # > *Temperature doesn't influence traffic intensity.*