Finding Heavy Traffic Indicators on I-94¶

We have analysed a dataset about Westbound traffic on I-94 Interstate Highway for the period 2 Oct 2012 at 09.00 through 30 Sept 2018 at 23.00.

The goal of the analysis is to determine a few indicators of heavy traffic on I-94. These indicators can be weather type, time of day, time of week and others.

We start off by reading the data and producing first 5 and last 5 rows of data as well as basic info on the dataset, including number of rows and columns.

In [1]:

import pandas as pd
import numpy as np

i94 = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
i94.head()

Out[1]:

	holiday	temp	clouds_all	weather_main	weather_description	date_time	traffic_volume
0	None	288.28	40	Clouds	scattered clouds	2012-10-02 09:00:00	5545
1	None	289.36	75	Clouds	broken clouds	2012-10-02 10:00:00	4516
2	None	289.58	90	Clouds	overcast clouds	2012-10-02 11:00:00	4767
3	None	290.13	90	Clouds	overcast clouds	2012-10-02 12:00:00	5026
4	None	291.14	75	Clouds	broken clouds	2012-10-02 13:00:00	4918

In [2]:

i94.tail()

Out[2]:

	holiday	temp	clouds_all	weather_main	weather_description	date_time	traffic_volume
48199	None	283.45	75	Clouds	broken clouds	2018-09-30 19:00:00	3543
48200	None	282.76	90	Clouds	overcast clouds	2018-09-30 20:00:00	2781
48201	None	282.73	90	Thunderstorm	proximity thunderstorm	2018-09-30 21:00:00	2159
48202	None	282.09	90	Clouds	overcast clouds	2018-09-30 22:00:00	1450
48203	None	282.12	90	Clouds	overcast clouds	2018-09-30 23:00:00	954

In [3]:

i94.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48204 entries, 0 to 48203
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   holiday              48204 non-null  object 
 1   temp                 48204 non-null  float64
 2   rain_1h              48204 non-null  float64
 3   snow_1h              48204 non-null  float64
 4   clouds_all           48204 non-null  int64  
 5   weather_main         48204 non-null  object 
 6   weather_description  48204 non-null  object 
 7   date_time            48204 non-null  object 
 8   traffic_volume       48204 non-null  int64  
dtypes: float64(3), int64(2), object(4)
memory usage: 3.3+ MB

In [4]:

import matplotlib.pyplot as plt
%matplotlib inline

i94['traffic_volume'].plot.hist()

Out[4]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f898ddb4130>

In [5]:

i94['traffic_volume'].describe()

Out[5]:

count    48204.000000
mean      3259.818355
std       1986.860670
min          0.000000
25%       1193.000000
50%       3380.000000
75%       4933.000000
max       7280.000000
Name: traffic_volume, dtype: float64

The dataset shows the Westbound route of I-94 has over the dataset period has:

A maximum traffic volume of 7280 vehicles and an average of 3260 vehicles.
25% standard deviation shows a quarter of the time a max of 1193 vehicles are recorded and on the other extreme, a quarter of the time 4933 vehicles are recorded. These differences could refer to night time or day time, respectively, or could indicate when the I-94 is under construction.

In [6]:

i94['date_time'] = pd.to_datetime(i94['date_time'])
i94.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48204 entries, 0 to 48203
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   holiday              48204 non-null  object        
 1   temp                 48204 non-null  float64       
 2   rain_1h              48204 non-null  float64       
 3   snow_1h              48204 non-null  float64       
 4   clouds_all           48204 non-null  int64         
 5   weather_main         48204 non-null  object        
 6   weather_description  48204 non-null  object        
 7   date_time            48204 non-null  datetime64[ns]
 8   traffic_volume       48204 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(2), object(3)
memory usage: 3.3+ MB

In [7]:

daytime = i94.copy()[(i94['date_time'].dt.hour >= 7) & (i94['date_time'].dt.hour < 19)]
print(daytime.shape)
nighttime = i94.copy()[(i94['date_time'].dt.hour >= 19) | (i94['date_time'].dt.hour < 7)]
print(nighttime.shape)

(23877, 9)
(24327, 9)

We extracted the hour from date_time column and calculated how many rows reference each of daytime (12 hours) and nighttime (12 hours). According to the calculation, there are more rows of nighttime than daytime. The difference between the two is 450. We will investigate this further.

In [8]:

# hours = i94['date_time'].dt.hour
# i94['hours'] = hours
# i94['increment'] = i94['hours'] - i94['hours'].shift(+1)
# i94[i94['increment'] != 1.0]

We have investigated the difference between total rows of nighttime and daytime. We added an hours column to the dataset and then calculated the difference between each row and its immediately next following row on the basis that each row is a recording for that hour and the next for the next hour. There are many more than 450 rows with a difference of 1. We noticed that there are some consecutive records of the same hour, i.e., a difference of zero and others with multiple hours of difference. This aspect may be needed to be taken into account in later analysis.

In [9]:

i94.copy().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48204 entries, 0 to 48203
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   holiday              48204 non-null  object        
 1   temp                 48204 non-null  float64       
 2   rain_1h              48204 non-null  float64       
 3   snow_1h              48204 non-null  float64       
 4   clouds_all           48204 non-null  int64         
 5   weather_main         48204 non-null  object        
 6   weather_description  48204 non-null  object        
 7   date_time            48204 non-null  datetime64[ns]
 8   traffic_volume       48204 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(2), object(3)
memory usage: 3.3+ MB

In [10]:

plt.figure(figsize=(10, 8))
plt.subplot(1, 2, 1)
plt.hist(daytime['traffic_volume'])
plt.ylabel('Frequency')
plt.xlabel('Traffic Volume')
plt.ylim(0, 8000)
plt.xlim(0, 8000)
plt.title('Daytime')

plt.subplot(1, 2, 2)
plt.hist(nighttime['traffic_volume'])
plt.title('Nighttime')
plt.ylabel('Frequency')
plt.xlabel('Traffic Volume')
plt.ylim(0, 8000)
plt.xlim(0, 8000)
plt.show()

We have produced a grid of 2 histograms to show frequency of traffic_volume in each of daytime and nighttime. We have also ensured both x and y axes have similar datapoints, respectively. Daytime histogram is as expected with a normal curve around the median traffic volume. Nighttime histogram shows as expected highest frequencies for lowest traffic volumes. There are some outliers around traffic_volume of approximately 3000 and also stickiness of frequency for traffic_volume of approximately 1500 through 3500.

In [11]:

daytime['traffic_volume'].describe()

Out[11]:

count    23877.000000
mean      4762.047452
std       1174.546482
min          0.000000
25%       4252.000000
50%       4820.000000
75%       5559.000000
max       7280.000000
Name: traffic_volume, dtype: float64

In [12]:

nighttime['traffic_volume'].describe()

Out[12]:

count    24327.000000
mean      1785.377441
std       1441.951197
min          0.000000
25%        530.000000
50%       1287.000000
75%       2819.000000
max       6386.000000
Name: traffic_volume, dtype: float64

We have provided basic data for each of daytime and nighttime traffic_volume series. This shows some significant differences between traffic volume during daytime and nighttime. Mean values for each are less than half at night, though max values do not show such a difference, presumably because max and values that are close to max are much less frequent for nighttime than in case of daytime.

Considering the findings on nighttime histogram and basic data, we could ignore nighttime to determine indicators for heavy traffic.

In [13]:

daytime['month'] = daytime['date_time'].dt.month
by_month = daytime.groupby('month').mean()
by_month['traffic_volume']

Out[13]:

month
1     4495.613727
2     4711.198394
3     4889.409560
4     4906.894305
5     4911.121609
6     4898.019566
7     4595.035744
8     4928.302035
9     4870.783145
10    4921.234922
11    4704.094319
12    4374.834566
Name: traffic_volume, dtype: float64

We have created a column in the daytime dataset for monthly traffic_volume and populated this column with mean values of traffic_volume for each respective month.

In [14]:

plt.plot(by_month['traffic_volume'])

Out[14]:

[<matplotlib.lines.Line2D at 0x7f898d8e8d00>]

A graph of daytime traffic_volume for each month shows as expected vastly reduced data for beginning January (post-Xmas), December (Xmas) and over the summer holiday season (part of each of July and August).

In [15]:

daytime['dayofweek'] = daytime['date_time'].dt.dayofweek
by_dayofweek = daytime.groupby('dayofweek').mean()
by_dayofweek['traffic_volume']  # 0 is Monday, 6 is Sunday

Out[15]:

dayofweek
0    4893.551286
1    5189.004782
2    5284.454282
3    5311.303730
4    5291.600829
5    3927.249558
6    3436.541789
Name: traffic_volume, dtype: float64

In [16]:

plt.plot(by_dayofweek['traffic_volume'] )

Out[16]:

[<matplotlib.lines.Line2D at 0x7f898eaa31c0>]

A graph of traffic_volume for each day of a calendar week shows that from Friday through Sunday is a significant reduction on otherwise relatively stable levels from Monday through Thursday.

In [17]:

daytime['hour'] = daytime['date_time'].dt.hour
bussiness_days = daytime.copy()[daytime['dayofweek'] <= 4] # 4 == Friday
weekend = daytime.copy()[daytime['dayofweek'] >= 5] # 5 == Saturday
by_hour_business = bussiness_days.groupby('hour').mean()
by_hour_weekend = weekend.groupby('hour').mean()

In [ ]:

In [18]:

plt.figure(figsize=(12, 8))
plt.subplot(1, 2, 1)
plt.plot(by_hour_business['traffic_volume'])
plt.ylabel('Traffic Volume')
plt.xlabel('Hours of Day')
plt.ylim(0, 6250)
plt.xlim(6, 19)
plt.title('Business')

plt.subplot(1, 2, 2)
plt.plot(by_hour_weekend['traffic_volume'])
plt.title('Weekend')
plt.ylabel('Traffic Volume')
plt.xlabel('Hours of Day')
plt.ylim(0, 6250)
plt.xlim(6, 19)
plt.show()

A graph of hours of the day between 7 through 18 during the business week and also on weekends shows that rush hours during the week are at 7 and 16. traffic_volume starts to increase slowly from 9 onwards and accelerates from 13 through 16.

On weekends traffic_volume increases from 7 through 11 and then plateaus through 16.

In [19]:

i94.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48204 entries, 0 to 48203
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   holiday              48204 non-null  object        
 1   temp                 48204 non-null  float64       
 2   rain_1h              48204 non-null  float64       
 3   snow_1h              48204 non-null  float64       
 4   clouds_all           48204 non-null  int64         
 5   weather_main         48204 non-null  object        
 6   weather_description  48204 non-null  object        
 7   date_time            48204 non-null  datetime64[ns]
 8   traffic_volume       48204 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(2), object(3)
memory usage: 3.3+ MB

In [20]:

daytime.copy().info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23877 entries, 0 to 48198
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   holiday              23877 non-null  object        
 1   temp                 23877 non-null  float64       
 2   rain_1h              23877 non-null  float64       
 3   snow_1h              23877 non-null  float64       
 4   clouds_all           23877 non-null  int64         
 5   weather_main         23877 non-null  object        
 6   weather_description  23877 non-null  object        
 7   date_time            23877 non-null  datetime64[ns]
 8   traffic_volume       23877 non-null  int64         
 9   month                23877 non-null  int64         
 10  dayofweek            23877 non-null  int64         
 11  hour                 23877 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(5), object(3)
memory usage: 2.4+ MB

We have concluded so far:

The traffic is usually heavier during warm months (March–October) compared to cold months (November–February).
The traffic is usually heavier on business days compared to weekends.
On business days, the rush hours are around 7 and 16.

Another possible indicator of heavy traffic is weather. The dataset has a number of weather columns: temp, rain_1h, snow_1h, clouds_all, weather_main, weather_description. All excspt the last 2 are numerical.

We will check the correlation of traffic_volume with the various weather columns.

In [21]:

daytime.copy().corr()[['traffic_volume', 'temp', 'rain_1h', 'snow_1h', 'clouds_all']]

Out[21]:

	traffic_volume	temp	rain_1h	snow_1h	clouds_all
temp	0.128317	1.000000	0.010815	-0.019286	-0.135519
rain_1h	0.003697	0.010815	1.000000	-0.000091	0.004993
snow_1h	0.001265	-0.019286	-0.000091	1.000000	0.027721
clouds_all	-0.032932	-0.135519	0.004993	0.027721	1.000000
traffic_volume	1.000000	0.128317	0.003697	0.001265	-0.032932
month	-0.022337	0.222072	0.001176	0.026768	0.000595
dayofweek	-0.416453	0.000188	-0.009683	-0.008814	-0.041980
hour	0.172704	0.162691	0.008279	0.003923	0.023685

There is not a great deal of correlation between traffic_volume and numerical weather column data. We will take a closer look at the correlation between the column with the greatest correlation that is temp. We will produce a scatter graph for this purpose.

In [22]:

plt.scatter(daytime['traffic_volume'], daytime['temp'])
plt.xlabel('Temperature')
plt.ylabel('Traffic Volume')
plt.ylim(220, 350)

Out[22]:

(220.0, 350.0)

The scatter graph between traffic_volume and temp does not show any particular correlation outcome and is not reliable as an indicator of what affects heavy traffic. We have also excluded from the graph some outlier data by limiting the data points on the y-axis.

In [23]:

by_weather_main = daytime.groupby('weather_main').mean()
by_weather_description = daytime.groupby('weather_description').mean()
by_weather_main

Out[23]:

	temp	rain_1h	snow_1h	clouds_all	traffic_volume	month	dayofweek	hour
weather_main
Clear	283.812078	0.000000	0.000000	1.670265	4778.416260	6.490599	3.138928	12.404248
Clouds	282.929274	0.000000	0.000000	62.667548	4865.415996	6.393243	3.005631	12.911974
Drizzle	284.456433	0.170804	0.000000	84.704417	4837.212911	7.105323	2.934315	12.308041
Fog	277.579641	0.163840	0.001409	65.477901	4372.491713	6.646409	2.798343	10.325967
Haze	275.319353	0.040036	0.000000	64.000000	4609.893285	5.832134	2.754197	12.467626
Mist	279.420825	0.249992	0.000825	74.961435	4623.976475	6.734285	2.895102	11.078288
Rain	287.089601	3.972943	0.000292	75.870116	4815.568462	6.774023	2.914467	12.642379
Smoke	292.405833	0.878333	0.000000	53.333333	4564.583333	6.833333	2.416667	13.166667
Snow	267.984505	0.014017	0.001768	80.501376	4396.321183	6.374828	2.750344	12.153370
Squall	296.730000	1.020000	0.000000	75.000000	4211.000000	7.000000	2.000000	14.000000
Thunderstorm	293.364678	1.146475	0.000000	75.184035	4648.212860	7.108647	2.955654	12.694013

In [24]:

by_weather_main['traffic_volume'].plot.barh()

Out[24]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f898eb3af70>

In [25]:

by_weather_description['traffic_volume'].plot.barh(figsize=(5, 10))

Out[25]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f898dc562b0>

There are some weather patterns that result in highest traffic_volume including shower snow, light rain and snow, proximity thunderstorm with drizzle and thunderstorm with light drizzle. These may be a result of more people responding to inclement weather and using their cars.

In [26]:

nighttime['dayofweek'] = nighttime['date_time'].dt.dayofweek
by_dayofweek_night = nighttime.groupby('dayofweek').mean()
by_dayofweek_night['traffic_volume']  # 0 is Monday, 6 is Sunday

nighttime['hour'] = nighttime['date_time'].dt.hour
business_days_night = nighttime.copy()[nighttime['dayofweek'] <= 4] # 4 == Friday
weekend_night = nighttime.copy()[nighttime['dayofweek'] >= 5] # 5 == Saturday
by_hour_business_night = business_days_night.groupby('hour').mean()
by_hour_weekend_night = weekend_night.groupby('hour').mean()
by_hour_business_night

Out[26]:

	temp	rain_1h	snow_1h	clouds_all	traffic_volume	dayofweek
hour
0	279.983283	0.161295	0.000341	44.580777	651.528971	1.980913
1	279.626900	0.106480	0.000421	44.255349	396.913043	1.987578
2	279.453874	0.123588	0.000000	44.566323	301.982818	1.997251
3	279.046059	0.182486	0.000110	43.596841	362.289835	2.000687
4	278.580683	0.142667	0.000067	46.018717	832.661096	2.006016
5	278.603900	0.124663	0.000069	44.959478	2701.296703	1.982143
6	278.432872	0.164433	0.000067	45.685695	5365.983210	1.997314
19	283.439235	0.156652	0.000000	53.014184	3298.340426	1.989362
20	282.617152	0.119718	0.000000	51.443583	2842.433004	1.988717
21	281.830703	0.133628	0.000702	49.395088	2673.042807	1.972632
22	280.982875	0.119544	0.000617	46.618080	2125.913104	1.973371
23	280.385178	0.105409	0.000831	44.722752	1379.549728	1.978883

In [27]:

plt.figure(figsize=(12, 12))
plt.subplot(2, 2, 1)
plt.plot(by_hour_business_night['traffic_volume'])
plt.ylabel('Traffic Volume')
plt.xlabel('Hours of Night - predawn')
plt.ylim(0, 6250)
plt.xlim(-1, 7)
plt.title('Business')

plt.subplot(2, 2, 2)
plt.plot(by_hour_weekend_night['traffic_volume'])
plt.title('Weekend')
plt.ylabel('Traffic Volume')
plt.xlabel('Hours of Night - predawn')
plt.ylim(0, 6250)
plt.xlim(-1, 7)

plt.subplot(2, 2, 3)
plt.plot(by_hour_business_night['traffic_volume'])
plt.ylabel('Traffic Volume')
plt.xlabel('Hours of Night - to midnight')
plt.ylim(0, 6250)
plt.xlim(19, 24)
plt.title('Business')

plt.subplot(2, 2, 4)
plt.plot(by_hour_weekend_night['traffic_volume'])
plt.ylabel('Traffic Volume')
plt.xlabel('Hours of Night - to midnight')
plt.ylim(0, 6250)
plt.xlim(19, 24)
plt.title('Weekend')


plt.show()

We have plotted the nighttime dataset by subsets business days and weekend days. The graph includes a grouping of each hour of each subset and a mean value of traffic_volume. We have set out the graphs in a grid of two rows and two columns to take into account the two separate sets of hours that make up nighttime, i.e., predawn from 24 through 6 and at night up to midnight from 19 through 24 (these are also two different days if they are run consecutively).

The business days' predawn period shows a significant increase in traffic_volume between hours of 4-6 a.m. that reaches a peak around 6 a.m. The weekend predawn period is significantly different traffic_volume, i.e. overall much less traffic with also two peaks (both much lower than business days) just after midnight and around 6 a.m.

The traffic_volume at night up to midnight is mostly similar for both weekend and business days.

Conclusion:

In this project, we tried to find a few indicators of heavy traffic on the I-94 Interstate highway. We managed to find two types of indicators:

Time indicators
- The traffic is usually heavier during warm months (March–October) compared to cold months (November–February).
- The traffic is usually heavier for daytime and end of predawn nighttime on business days compared to the weekends.
- On business days, the rush hours are around 7 and 16.
Weather indicators
- Shower snow
- Light rain and snow
- Proximity thunderstorm with drizzle