In this project, we're going to analyze a dataset on westbound traffic on Interstate Highway I-94. John Hogue made the dataset available, and you can download it from the UCI Machine Learning Repository.
The following are descriptions for each column in the dataset:
holiday
– US National holidays plus regional holiday, Minnesota State Fair,temp
– average temp in kelvin,rain_1h
– the amount in mm of rain per hour,snow_1h
– the amount in mm of snow per hour,clouds_all
– % of cloud cover,weather_main
– a short textual description of the current weather,weather_description
– a longer textual description of the current weather,date_time
– the time of the data collected in local CST time,traffic_volume
– hourly I-94 ATR 301 reported westbound traffic volume.The indicators with the highest traffic volume are shown below:
Time Indicators | Description | AVG Traffic Volume (approximate), cars/hr |
---|---|---|
Daytime hours | Business days (7 to 19) | 4,000 to 6,000 |
Early morning | Business days (at 6) | 5,500 |
Peak hours | Business days (7 and 16) | 6,000 |
Day of week | Friday (daytime) | 5,500 |
Warm months | March to October (except July) | 4,900 |
Peak year | 2017 - Due to road works (daytime) | 5,000 |
Weather Description | AVG Traffic Volume, cars/hr |
---|---|
Snow with showers | > 5,000 (Daytime) |
Light rain and snow | |
Proximity storm with drizzle | |
Proximity shower | > 2,500 (Nighttime) |
Light intensity shower |
Holiday | AVG Traffic Volume, cars/hr |
---|---|
Columbus Day (Indigenous Peoples' Day) | About 3,500 |
We conclude that temperature does not influence the intensity of traffic volume.
Let's start by importing all the libraries we'll need and begin by reading the dataset into pandas.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
i_94 = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
i_94.head()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
i_94.tail()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
48199 | None | 283.45 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2018-09-30 19:00:00 | 3543 |
48200 | None | 282.76 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 20:00:00 | 2781 |
48201 | None | 282.73 | 0.0 | 0.0 | 90 | Thunderstorm | proximity thunderstorm | 2018-09-30 21:00:00 | 2159 |
48202 | None | 282.09 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 22:00:00 | 1450 |
48203 | None | 282.12 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 23:00:00 | 954 |
i_94.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48204 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null object 8 traffic_volume 48204 non-null int64 dtypes: float64(3), int64(2), object(4) memory usage: 3.3+ MB
The dataset has 48,204 rows and 9 columns, and there are no null values except for the column holiday
which shows several null values. Each row describes the traffic and weather data for a specific time.
To Consider
The dataset documentation mentions that a station located approximately halfway between Minneapolis and Saint Paul records the traffic data. For this station, the direction of the route is west (that is, cars are moving from east to west). This means that the results of our analysis will refer to westbound traffic in the vicinity of the station.
We must avoid generalizing our results to the entire I-94 freeway.
Before continuing, we'll implement some functions that'll allow us to have a better view of the values and save lines of code. This will make it easier for us to explore and analyze the data.
def float_format(value, n=2):
value = round(value, n)
return '{:,}'.format(value)
def bins_number(df):
n = df.shape[0]
k = 1 + 3.32 * np.log10(n)
return int(round(k))
def gen_hist_plot(df, col_name, bins=10, extra_title='', extra_xlabel='', color=None):
x_label = col_name.replace('_', ' ').title()
title = x_label + extra_title
x_label += extra_xlabel
plt.hist(df[col_name], bins=bins, color=color, edgecolor='gray')
plt.title(title, fontsize=18)
plt.xlabel(x_label, fontsize=14)
plt.ylabel('Frequency', fontsize=14)
sns.despine()
def get_comm_describes(lst_dfs, comm_col, lst_names):
lst_desc = []
indexes = len(lst_dfs)
for i in range(indexes):
lst_desc.append(lst_dfs[i][comm_col].describe())
df_desc = pd.concat(lst_desc, axis=1)
col_title = comm_col.replace('_', ' ').upper()
df_desc.index.name = col_title
df_desc.columns = lst_names
for i in range(indexes):
df_desc[lst_names[i]] = df_desc[lst_names[i]].apply(float_format)
return df_desc
def gen_line_plot(df, col_name, title='', label=None, xlabel=None, ylabel=None, xticks=None, labels=None, despine=True,
xmin=None, xmax=None, ymin=None, ymax=None, linestyle=None, color=None, marker=None, rotation=None):
plt.plot(df[col_name], label=label, color=color, linestyle=linestyle, marker=marker)
plt.title(title, fontsize=18)
plt.xlabel(xlabel, fontsize=14)
plt.ylabel(ylabel, fontsize=14)
if xticks:
ticks = [i for i in range(xticks[0], xticks[1] + 1)]
else:
ticks = None
plt.xticks(ticks=ticks, labels=labels, fontsize=12, rotation=rotation)
plt.yticks(fontsize=12)
plt.xlim(xmin, xmax)
plt.ylim(ymin, ymax)
if despine:
sns.despine()
def group_by_new_col_dt(df, new_group_col, dt_col, dt=0, col_sep_dweek=None):
"""dt=0 -> Group by year.
dt=1 -> Group by month.
dt=2 -> Group by day.
dt=3 -> Group by hour."""
if dt == 0:
df[new_group_col] = df[dt_col].dt.year
elif dt == 1:
df[new_group_col] = df[dt_col].dt.month
elif dt == 2:
df[new_group_col] = df[dt_col].dt.dayofweek
elif dt == 3:
df[new_group_col] = df[dt_col].dt.hour
if col_sep_dweek:
# 4 == Friday
business_days = df.copy()[df[col_sep_dweek] <= 4]
# 5 == Saturday
weekend = df.copy()[df[col_sep_dweek] >= 5]
hour_business = business_days.groupby(new_group_col).mean()
hour_weekend = weekend.groupby(new_group_col).mean()
return hour_business, hour_weekend
return df.groupby(new_group_col).mean()
def gen_scatter_plot(df, col_1, col_2, title, xlabel, ylabel, xmin=None, xmax=None, color=None):
plt.scatter(df[col_1], df[col_2], color=color, alpha=0.1)
plt.title(title, fontsize=18)
plt.xlim(xmin, xmax)
plt.xlabel(xlabel, fontsize=14)
plt.ylabel(ylabel, fontsize=14)
sns.despine()
def gen_barh_plot(df, col_x, col_y, title, xlabel=None, color=None, xv_line=None):
plt.barh(df[col_x], df[col_y], height=0.4, color=color)
plt.title(title, fontsize=18)
plt.xlabel(xlabel, fontsize=14)
plt.ylabel('')
plt.tick_params(left=False)
if xv_line:
plt.axvline(x=xv_line, color='dimgrey', linewidth=0.2)
sns.despine(left=True)
We're going to start our analysis by examining the distribution of the traffic_volume
column.
col_bins = bins_number(i_94)
gen_hist_plot(i_94, 'traffic_volume', bins=col_bins, extra_title=' Histogram', extra_xlabel=', cars/hr', color='lightsteelblue')
print('TRAFIC VOLUME STATS\n', i_94['traffic_volume'].describe().apply(float_format), sep='\n')
TRAFIC VOLUME STATS count 48,204.0 mean 3,259.82 std 1,986.86 min 0.0 25% 1,193.0 50% 3,380.0 75% 4,933.0 max 7,280.0 Name: traffic_volume, dtype: object
Observations
These observations gives our analysis an interesting direction: comparing daytime data with nighttime data.
We'll start by dividing the dataset into two parts:
While this is not a perfect criterion for distinguishing between nighttime and daytime, it's a good starting point.
i_94['date_time'] = pd.to_datetime(i_94['date_time'])
day = i_94.copy()[(i_94['date_time'].dt.hour >= 7) & (i_94['date_time'].dt.hour < 19)]
night = i_94.copy()[(i_94['date_time'].dt.hour >= 19) | (i_94['date_time'].dt.hour < 7)]
print('Total rows - Day: {:,}'.format(day.shape[0]))
print('Total rows - Night: {:,}'.format(night.shape[0]))
Total rows - Day: 23,877 Total rows - Night: 24,327
This significant difference in row numbers between day and night is due to a few hours of missing data. For instance, if we look at rows 176 and 177, you'll notice there's no data for two hours (4 and 5).
i_94.iloc[176:178]
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
176 | None | 281.17 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-10 03:00:00 | 361 |
177 | None | 281.25 | 0.0 | 0.0 | 92 | Clear | sky is clear | 2012-10-10 06:00:00 | 5875 |
Now that we've isolated day and night, we're going to look at the histograms of traffic volume side-by-side by using a grid chart.
bins_day = bins_number(day)
bins_night = bins_number(night)
dfs = [day, night]
num_bins = [bins_day, bins_night]
extra_titles = [': Daytime', ': Nighttime']
colors = ['deepskyblue', 'slateblue']
plt.figure(figsize=(12, 4))
for i in range(2):
plt.subplot(1, 2, i+1)
gen_hist_plot(dfs[i], 'traffic_volume', bins=num_bins[i], extra_title=extra_titles[i],
extra_xlabel=', cars/hr', color=colors[i])
plt.tight_layout(pad=2)
print(get_comm_describes([day, night], 'traffic_volume', ['DAYTIME', 'NIGHTTIME']))
DAYTIME NIGHTTIME TRAFFIC VOLUME count 23,877.0 24,327.0 mean 4,762.05 1,785.38 std 1,174.55 1,441.95 min 0.0 0.0 25% 4,252.0 530.0 50% 4,820.0 1,287.0 75% 5,559.0 2,819.0 max 7,280.0 6,386.0
Daytime: We observe that the traffic volume distribution is skewed to the left. This means that most of the traffic volume values are high:
Nighttime: We note that the distribution of traffic volume is skewed to the right. This means that most of the traffic volume values are low:
The results of the nighttime traffic volume show a distribution with 3 very noticeable peaks, which we saw earlier in the general histogram; in particular, those peaks located further to the left.
Although there are still measurements of more than 5,000 cars per hour, nighttime traffic is generally light.
Now, let's examine the indicators of heavy traffic as a function of time. There may be more traffic in a given month, on a given day, or at a given time of day.
We will make some line graphs showing how the traffic volume changes as mentioned above.
To get an overview, let's start by plotting the traffic intensity by years.
by_year_day = group_by_new_col_dt(day, 'year', 'date_time')
by_year_night = group_by_new_col_dt(night, 'year', 'date_time')
dfs = [by_year_day, by_year_night]
titles = ['Average Traffic by Year - Daytime', 'Average Traffic by Year - Nighttime']
# colors = ['deepskyblue', 'slateblue']
plt.figure(figsize=(12, 4))
for i in range(2):
plt.subplot(1, 2, i+1)
gen_line_plot(dfs[i], 'traffic_volume', title=titles[i], xlabel='Year',
ylabel='Traffic volume, cars/hr', color=colors[i])
plt.tight_layout(pad=2)
Immediately we notice in the daytime graph that the lowest traffic volume corresponds to the year 2016, after that, the maximum peak is observed in the year 2017 (this peak is also observed in the nighttime traffic).
This may have been a consequence of general highway works, which caused a temporary reduction in traffic volume. This, in turn, led to an increase in traffic volume the following year.
As for the 2017 peak in nighttime traffic volume, it was most likely caused by the adjustment of the speed limit on I-94 west of Minneapolis due to construction work on the highway at that time.
by_month_day = group_by_new_col_dt(day, 'month', 'date_time', dt=1)
by_month_night = group_by_new_col_dt(night, 'month', 'date_time', dt=1)
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
dfs = [by_month_day, by_month_night]
titles = ['Average Traffic by Month - Daytime', 'Average Traffic by Month - Nighttime']
# colors = ['deepskyblue', 'slateblue']
plt.figure(figsize=(12, 4))
for i in range(2):
plt.subplot(1, 2, i+1)
gen_line_plot(dfs[i], 'traffic_volume', title=titles[i], xticks=[1, 12], labels=months,
xmin=1, xmax=12, ylabel='Traffic volume, cars/hr', color=colors[i], rotation=30)
plt.tight_layout(pad=2)
Traffic (both daytime and nighttime) seems less intense in the cold months (November-February) and more in the warm months (March-October), with the exception of July.
The peak we observed in June in nighttime traffic is most likely due to the adjustment of the speed limit as mentioned above.
Next, let's see how the traffic volume has changed each year in July.
july_day = day[day['month'] == 7].groupby('year').mean()
july_night = night[night['month'] == 7].groupby('year').mean()
dfs = [july_day, july_night]
titles = ['Traffic volume in July - Daytime', 'Traffic volume in July - Nighttime']
# colors = ['deepskyblue', 'slateblue']
plt.figure(figsize=(12, 4))
for i in range(2):
plt.subplot(1, 2, i+1)
gen_line_plot(dfs[i], 'traffic_volume', title=titles[i], xlabel='Year',
ylabel='Traffic volume, cars/hr', color=colors[i])
plt.tight_layout(pad=2)
Generally, traffic is quite heavy in July both during the day and at night (within the usual range for each scenario), similar to the other warm months. The only exception we see is 2016, which had a high drop in traffic volume. One possible reason for this is highway construction - this article from 2016 supports this hypothesis.
As a tentative conclusion, we can say that warm months generally show higher traffic compared to cold months. In a warm month, one can expect for each hour of the day a traffic volume close to 5,000 cars and close to 2,000 cars for each hour of the night.
Let's now look at a more granular indicator: Day of the Week
by_dw_day = group_by_new_col_dt(day, 'day_of_week', 'date_time', dt=2)
by_dw_night = group_by_new_col_dt(night, 'day_of_week', 'date_time', dt=2)
dweeks = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
dfs = [by_dw_day, by_dw_night]
titles = ['Traffic by Day of Week - Daytime', 'Traffic by Day of Week - Nighttime']
# colors = ['deepskyblue', 'slateblue']
plt.figure(figsize=(12, 4))
for i in range(2):
plt.subplot(1, 2, i+1)
gen_line_plot(dfs[i], 'traffic_volume', title=titles[i], xticks=[0, 6], labels=dweeks,
xmin=0, xmax=6, ylabel='Traffic volume, cars/hr', color=colors[i], rotation=45)
plt.tight_layout(pad=2)
Traffic volume is significantly higher on weekdays (Monday to Friday).
Daytime: Except for Monday, values above 5,000 are only observed on weekdays. Traffic is lighter on weekends, with values below 4,000 cars.
Nighttime: Except for Monday, only values above 1,800 are observed during weekdays. Traffic is lighter on weekends, with values around 1,600 cars or less.
To analyze traffic volume by hour, we should take into account that the weekends would drag down the average values. Hence, it makes sense to look at the averages separately for business days and weekends.
day_hour_business, day_hour_weekend = group_by_new_col_dt(day, 'hour', 'date_time', dt=3, col_sep_dweek='day_of_week')
night_hour_business, night_hour_weekend = group_by_new_col_dt(night, 'hour', 'date_time', dt=3, col_sep_dweek='day_of_week')
night_hours = ['0','1', '2', '3', '4', '5', '6', '', '', '', '', '', '', '', '', '', '', '', '', '19', '20','21','22','23']
plt.figure(figsize=(14, 4))
# Plot traffic volume by hour - daytime
plt.subplot(1, 2, 1)
gen_line_plot(day_hour_business, 'traffic_volume', label='Business Days', xticks=[7, 18], xmin=7, xmax=18,
xlabel='Time', ylabel='Traffic volume, cars/hr', color='blue')
gen_line_plot(day_hour_weekend, 'traffic_volume', title='Traffic volume by hour: Daytime', label='Weekends', xticks=[7, 18],
despine=True, xmin=7, xmax=18, ymin=1500, ymax=6250, xlabel='Time', ylabel='Traffic volume, cars/hr', color='red')
plt.legend()
# Plot traffic volume by hour - nighttime
plt.subplot(1, 2, 2)
gen_line_plot(night_hour_business, 'traffic_volume', label='Business Days', xticks=[0, 23], labels=night_hours,
xmin=-0.5, xmax=23.5, xlabel='Time', ylabel='Traffic volume, cars/hr', linestyle='', color='blue', marker='.')
gen_line_plot(night_hour_weekend, 'traffic_volume', title='Traffic volume by hour: Nighttime', label='Weekends',
xticks=[0, 23], labels=night_hours, despine=True, xmin=-0.5, xmax=23.5, ymin=0, ymax=6250,
xlabel='Time', ylabel='Traffic volume, cars/hr', linestyle='', color='red', marker='.')
plt.legend()
plt.tight_layout(pad=2)
During both daytime and nighttime hours, traffic volumes are generally higher during weekdays compared to weekends.
As somewhat expected on weekdays, peak hours are around 7am and 4pm, when most people commute from home to work and back. We see volumes of more than 6,000 cars during peak hours. On the other hand, on weekends we observe that there are no peaks on the graph. Traffic increases from 7 a.m. to noon, then reaches a plateau and from 4 p.m. onwards it starts to decrease.
In the weekday plot, we see that traffic gradually decreases from 7:00 p.m. to 2:00 a.m. then increases rapidly from 3:00 a.m. to 6:00 p.m. While in the weekend plot, we see that traffic reaches its minimum at 3:00 a.m. then from 4:00 to 6:00, it increases again, but slightly.
Another possible indicator of heavy traffic is weather. The dataset provides us with a few useful columns about weather: temp
, rain_1h
, snow_1h
, clouds_all
, weather_main
, weather_description
.
A few of these columns are numerical, so let's start by looking up their correlation values with traffic_volume
column.
print('Correlation: Column "traffic_volume" with the meteorological columns (only numerical columns):')
day.corr()['traffic_volume'][['temp', 'rain_1h', 'snow_1h', 'clouds_all']].apply(float_format, args=[3])
Correlation: Column "traffic_volume" with the meteorological columns (only numerical columns):
temp 0.128 rain_1h 0.004 snow_1h 0.001 clouds_all -0.033 Name: traffic_volume, dtype: object
Temperature shows the strongest correlation with a value of only +0.13. The other relevant columns (rain_1h
, snow_1h
, clouds_all
) do not show any strong correlation with traffic volume.
Now, we'll perform the correlation plot between traffic volume and temperature but first, let's explore the temperature values in the temp
column.
print(get_comm_describes([day, night], 'temp', ['DAYTIME', 'NIGHTTIME']))
print('\nTEMPERATURES (K) - DAYTIME', day['temp'].value_counts(dropna=False).sort_index(), sep='\n')
print('\nTEMPERATURES (K) - NIGHTTIME', night['temp'].value_counts(dropna=False).sort_index(), sep='\n')
DAYTIME NIGHTTIME TEMP count 23,877.0 24,327.0 mean 282.26 280.17 std 13.3 13.3 min 0.0 0.0 25% 272.68 271.7 50% 283.78 281.38 75% 293.44 290.7 max 310.07 307.68 TEMPERATURES (K) - DAYTIME 0.00 2 243.39 1 243.62 1 245.70 2 246.15 2 .. 308.87 1 308.95 1 309.08 1 309.29 1 310.07 1 Name: temp, Length: 5111, dtype: int64 TEMPERATURES (K) - NIGHTTIME 0.00 8 244.22 1 244.82 3 244.89 1 245.62 1 .. 305.49 1 305.52 1 306.02 2 306.29 1 307.68 1 Name: temp, Length: 4972, dtype: int64
Temperatures are on the Kelvin (K) scale. We can see that almost all values are in normal ranges but we also see some values equal to zero (absolute zero). Since zero Kelvin values do not occur in nature, we'll only consider those records that are in the 240 to 315 K range.
dfs = [day, night]
titles = ['Traffic Volume at Different Temperatures\nDaytime', 'Traffic Volume at Different Temperatures\nNighttime']
colors = ['deepskyblue', 'slateblue']
plt.figure(figsize=(12, 6))
for i in range(2):
plt.subplot(1, 2, i+1)
gen_scatter_plot(dfs[i], 'temp', 'traffic_volume', titles[i], 'Temperature (K)', 'Traffic volume, cars/hr',
xmin=240, xmax=315, color=colors[i])
plt.tight_layout(pad=2)
We can conclude that in both daytime and nighttime traffic the temperature is not a strong indicator of heavy traffic.
Let's now look at the other weather related columns: weather_main
and weather_description
.
To start, we're going to group the data by weather_main
and look at the traffic_volume
averages.
weather_main_day = day.groupby('weather_main').mean().sort_values('traffic_volume').reset_index()
weather_main_night = night.groupby('weather_main').mean().sort_values('traffic_volume').reset_index()
dfs = [weather_main_day, weather_main_night]
titles = ['Traffic Volume by Weather Main\nDaytime', 'Traffic Volume by Weather Main\nNighttime']
xv_lines = [5000, 2000]
# colors = ['deepskyblue', 'slateblue']
plt.figure(figsize=(14, 6))
for i in range(2):
plt.subplot(1, 2, i+1)
gen_barh_plot(dfs[i], 'weather_main', 'traffic_volume', titles[i], xlabel='Traffic volume, cars/hr',
color=colors[i], xv_line=xv_lines[i])
plt.tight_layout(pad=2)
In the graph with daytime data, it appears that there are no weather types where traffic volume exceeds 5,000 cars. Similarly, in the graph with nighttime data, the traffic volume did not exceed 2,000 cars. This makes it more difficult to find a heavy traffic indicator.
Next, we'll group by weather_description
, which has a more granular weather classification.
weather_description_day = day.groupby('weather_description').mean().sort_values('traffic_volume').reset_index()
# colors = ['deepskyblue', 'slateblue']
# xv_lines = [5000, 2000]
plt.figure(figsize=(7, 12))
gen_barh_plot(weather_description_day, 'weather_description', 'traffic_volume',
'Traffic Volume by Weather Description\nDaytime', xlabel='Traffic volume, cars/hr',
color=colors[0], xv_line=xv_lines[0])
In the daytime data, we appreciate that there are three weather types where traffic volume exceeds 5,000 cars: Shower snow, Light rain and snow and Proximity thunderstorm with drizzle.
It's not clear why these types of weather have the highest average traffic values; these are bad weathers, but not so bad. However, we must bear in mind that really bad weather conditions are generally forecast in advance, so perhaps people try to anticipate to avoid traveling on those days.
weather_description_night = night.groupby('weather_description').mean().sort_values('traffic_volume').reset_index()
# colors = ['deepskyblue', 'slateblue']
# xv_lines = [5000, 2000]
xv_lines[1] = 2500
plt.figure(figsize=(7, 12))
gen_barh_plot(weather_description_night, 'weather_description', 'traffic_volume',
'Traffic Volume by Weather Description\nNighttime', xlabel='Traffic volume, cars/hr',
color=colors[1], xv_line=xv_lines[1])
In the nighttime data, we appreciate that there are two weather types in where the traffic volume exceeds 2,500 cars: Proximity shower rain and Light intensity shower rain.
Similar to the daytime scenario, these are bad weathers, but not as bad; however, they are enough for people to decide to slow down to avoid accidents and thus cause an increase in traffic intensity.
Let's start by exploring the frequency table for this field.
print('HOLIDAYS (FREQUENCY):\n', i_94['holiday'].value_counts(), sep='\n')
HOLIDAYS (FREQUENCY): None 48143 Labor Day 7 Thanksgiving Day 6 Christmas Day 6 New Years Day 6 Martin Luther King Jr Day 6 Columbus Day 5 Veterans Day 5 Washingtons Birthday 5 Memorial Day 5 Independence Day 5 State Fair 5 Name: holiday, dtype: int64
We can see that there are few records for holidays and many records with values equal to 'none'. Perhaps, some holidays were overlooked and these were recorded as common days.
Next, let's see which is the oldest and youngest date in the dataset.
oldest_date = min(i_94['date_time'])
youngest_date = max(i_94['date_time'])
print('Oldest date: {}\nYoungest date: {}'.format(oldest_date, youngest_date))
Oldest date: 2012-10-02 09:00:00 Youngest date: 2018-09-30 23:00:00
Well, now we'll collect all the holidays that are within the range of dates obtained previously (we'll do it by guiding us from this website). We'll then update the dataset with the collected holidays, and finally we'll redisplay the frequency table for the holidays
column.
i_94['date'] = i_94['date_time'].dt.strftime('%Y-%m-%d')
# Non-fixed holidays:
var_holidays = {'Labor Day': ('2013-09-02', '2014-09-01', '2015-09-07', '2016-09-05', '2017-09-04', '2018-09-03'),
'Thanksgiving Day': ('2012-11-22', '2013-11-28', '2014-11-27', '2015-11-26', '2016-11-24', '2017-11-23'),
'Martin Luther King Jr Day': ('2013-01-21', '2014-01-20', '2015-01-19', '2016-01-18', '2017-01-16', '2018-01-15'),
'Columbus Day': ('2012-10-08', '2013-10-14', '2014-10-13', '2015-10-12', '2016-10-10', '2017-10-09'),
'Veterans Day': ('2012-11-11', '2012-11-12', '2013-11-11', '2014-11-11', '2015-11-11', '2016-11-11', '2017-11-10', '2017-11-11'),
'Washingtons Birthday': ('2013-02-18', '2014-02-17', '2015-02-16', '2016-02-15', '2017-02-20', '2018-02-19'),
'Memorial Day': ('2013-05-27', '2014-05-26', '2015-05-25', '2016-05-30', '2017-05-29', '2018-05-28')
}
for holi, dates in var_holidays.items():
for date in dates:
i_94.loc[i_94['date'] == date, 'holiday'] = holi
# Fixed holidays:
i_94.loc[i_94['date'].str[5:] == '07-04', 'holiday'] = 'Independence Day'
i_94.loc[i_94['date'].str[5:] == '12-25', 'holiday'] = 'Christmas Day'
i_94.loc[i_94['date'].str[5:] == '01-01', 'holiday'] = 'New Years Day'
print('UPDATED HOLIDAYS (FREQUENCY):\n', i_94['holiday'].value_counts(), sep='\n')
UPDATED HOLIDAYS (FREQUENCY): None 46742 Veterans Day 198 Christmas Day 167 Independence Day 162 Labor Day 157 Martin Luther King Jr Day 142 Washingtons Birthday 136 Thanksgiving Day 135 Memorial Day 134 New Years Day 114 Columbus Day 112 State Fair 5 Name: holiday, dtype: int64
We noticed that there were indeed unrecorded holidays. Now that we have the updated data, we can visualize the volume of traffic by holiday.
filter_holidays = (i_94['holiday'] != 'None') & (i_94['holiday'] != 'State Fair')
holiday_bh_plot = i_94[filter_holidays].groupby('holiday').mean().sort_values('traffic_volume').reset_index()
gen_barh_plot(holiday_bh_plot, 'holiday', 'traffic_volume',
'Traffic volume by holiday', xlabel='Traffic volume, cars/hr',
color='lightsteelblue', xv_line=3000)
The holiday that records the highest traffic volume rate with around 3,500 cars/hour is Columbus Day. We should keep in mind that the data we are analyzing in this project are related to the state of Minnesota where such holiday was renamed as Indigenous Peoples' Day since 2014. From the results obtained, we can deduce that such holiday is widely celebrated there; and therefore, we see reflected the highest volume of traffic on that holiday.
Meanwhile, the holidays with the least traffic are Christmas Day and New Year's Day. This is probably because people prefer to spend these holidays at home with their families or friends.
In this project, we tried to find some indicators of heavy westbound traffic on Interstate Highway I-94. The data were captured by a station halfway between Minneapolis and Saint Paul. Below, we show three types of relevant indicators conducive to heavy traffic.
Time indicators:
Traffic tends to be heaviest during the day (7:00 am to 7:00 pm). Particularly the warm months (March to October) compared to the cold months (November to February). Higher traffic volume intensity is observed on weekdays than on weekends. Peak hours are around 7 am and 4 pm with about 6,000 cars/hour.
The lowest average traffic volume was recorded in 2016, this due to works on the highway in that year. However, in 2017 the maximum peak was recorded, presumably due to a speed adjustment due to the continuation of works over the highway.
Weather indicators:
Daytime (> 5,000 cars/hour):
Nighttime (> 2,500 cars/hour):
Holiday indicators:
The holiday with the highest traffic intensity is related to *Columbus Day* (also known as *Indigenous Peoples' Day* in Minnesota). While the holidays with the lowest intensity are 'Christmas Day' and 'New Year's Day', presumably because people prefer to stay at home to celebrate with family or friends.
Temperature doesn't influence traffic intensity.