Notebook

Finding Heavy Traffic Indicators on I-94¶

Introduction¶

We're going to analyze a dataset about the westbound traffic on the I-94 Interstate highway.

John Hogue made the dataset available, and it can be downloaded from the UCI Machine Learning Repository. The recorded data refer to hourly Interstate 94 Westbound traffic volume for MN DoT ATR station 301, roughly midway between Minneapolis and St Paul, MN. Hourly weather features and holidays included for impacts on traffic volume.

The goal of our analysis is to determine a few indicators of heavy traffic on I-94. These indicators can be weather type, time of the day, time of the week, etc. For instance, we may find out that the traffic is usually heavier in the summer or when it snows.

First glance at the data¶

We'll start by reading the csv file and looking at the first and the last five rows, together with some information about the dataset. The different columns refer to:

holiday: Categorical US National holidays plus regional holiday, Minnesota State Fair
temp: Numeric Average temp in kelvin
rain_1h: Numeric Amount in mm of rain that occurred in the hour
snow_1h: Numeric Amount in mm of snow that occurred in the hour
clouds_all: Numeric Percentage of cloud cover
weather_main: Categorical Short textual description of the current weather
weather_description: Categorical Longer textual description of the current weather
date_time: DateTime Hour of the data collected in local CST time
traffic_volume: Numeric Hourly I-94 ATR 301 reported westbound traffic volume

In [1]:

import pandas as pd

metro = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')

print('First 5 rows:\n', metro.head())
print('Last 5 rows:\n', metro.tail())
print('Additional info:\n', metro.info())

First 5 rows:
   holiday    temp  rain_1h  snow_1h  clouds_all weather_main  \
0    None  288.28      0.0      0.0          40       Clouds   
1    None  289.36      0.0      0.0          75       Clouds   
2    None  289.58      0.0      0.0          90       Clouds   
3    None  290.13      0.0      0.0          90       Clouds   
4    None  291.14      0.0      0.0          75       Clouds   

  weather_description            date_time  traffic_volume  
0    scattered clouds  2012-10-02 09:00:00            5545  
1       broken clouds  2012-10-02 10:00:00            4516  
2     overcast clouds  2012-10-02 11:00:00            4767  
3     overcast clouds  2012-10-02 12:00:00            5026  
4       broken clouds  2012-10-02 13:00:00            4918  
Last 5 rows:
       holiday    temp  rain_1h  snow_1h  clouds_all  weather_main  \
48199    None  283.45      0.0      0.0          75        Clouds   
48200    None  282.76      0.0      0.0          90        Clouds   
48201    None  282.73      0.0      0.0          90  Thunderstorm   
48202    None  282.09      0.0      0.0          90        Clouds   
48203    None  282.12      0.0      0.0          90        Clouds   

          weather_description            date_time  traffic_volume  
48199           broken clouds  2018-09-30 19:00:00            3543  
48200         overcast clouds  2018-09-30 20:00:00            2781  
48201  proximity thunderstorm  2018-09-30 21:00:00            2159  
48202         overcast clouds  2018-09-30 22:00:00            1450  
48203         overcast clouds  2018-09-30 23:00:00             954  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48204 entries, 0 to 48203
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   holiday              48204 non-null  object 
 1   temp                 48204 non-null  float64
 2   rain_1h              48204 non-null  float64
 3   snow_1h              48204 non-null  float64
 4   clouds_all           48204 non-null  int64  
 5   weather_main         48204 non-null  object 
 6   weather_description  48204 non-null  object 
 7   date_time            48204 non-null  object 
 8   traffic_volume       48204 non-null  int64  
dtypes: float64(3), int64(2), object(4)
memory usage: 3.3+ MB
Additional info:
 None

In the above information, we can see that the 'date_time' column ranges from '2012-10-02 09:00:00' to '2018-09-30 23:00:00'. Therefore, we have traffic and weather information for 6 years in 1h periods.

No null values appear in the data, and the different columns' types vary: strings, float and integers.

Analysis¶

Traffic_volume¶

Next, we're going to plot a histogram to visualize the distribution of the traffic_volume column, and see some statistics on this variable.

In [2]:

import matplotlib.pyplot as plt
%matplotlib inline

#Histogram of 'taffic_volume' column
metro['traffic_volume'].plot.hist(title='Traffic volume frequency',xlabel='Traffic volume', ylabel='Frequency', bins=10)
plt.show()

#Additional info
print('Additional statistics:\n', metro['traffic_volume'].describe())

Additional statistics:
 count    48204.000000
mean      3259.818355
std       1986.860670
min          0.000000
25%       1193.000000
50%       3380.000000
75%       4933.000000
max       7280.000000
Name: traffic_volume, dtype: float64

From the previous histogram and statistics, we can observe that the most frequent traffic volumes on our dataset are either lower than 1000 cars per hour, or around 5000 cars per hour. As this data does not follow a normal distribution -it's not even centered around the mean-, the mean value does not represent well the data.

Probably, day- and night-times affect this distribution - maybe during night-time less than 1000 cars per hour cross this road, while during the day it might become more crowded.

Traffic volume x daytime/nighttime¶

This possibility that nighttime and daytime might influence traffic volume gives our analysis an interesting direction: comparing daytime with nighttime data.

We'll start by dividing the dataset into two parts:

Daytime data: hours from 7 a.m. to 7 p.m. (12 hours)
Nighttime data: hours from 7 p.m. to 7 a.m. (12 hours)

While this is not a perfect criterion for distinguishing between nighttime and daytime, it's a good starting point.

In [3]:

#Transform the 'date_time' column to datetime type
metro['date_time'] = pd.to_datetime(metro['date_time'])

#Isolate the daytime and nighttime data in two different datasets
daytime_data = metro[(metro['date_time'].dt.hour >= 7) & (metro['date_time'].dt.hour < 19)]
nighttime_data = metro[(metro['date_time'].dt.hour < 7) | (metro['date_time'].dt.hour >= 19)]

Now we're going to compare the traffic volume at night and during day.

In [4]:

plt.figure(figsize=(8,3), constrained_layout=True)

#Subplot for daytime data
plt.subplot(1, 2, 1)
plt.hist(daytime_data['traffic_volume'])
plt.xlim(0,8000)
plt.ylim(0,8000)
plt.title('Daytime traffic volume distribution')
plt.xlabel('Cars per hour')
plt.ylabel('Frequency')

#Subplot for nighttime data
plt.subplot(1, 2, 2)
plt.hist(nighttime_data['traffic_volume'])
plt.xlim(0,8000)
plt.ylim(0,8000)
plt.title('Nighttime traffic volume distribution')
plt.xlabel('Cars per hour')
plt.ylabel('Frequency')

plt.show()

#Aditional statistics
print('Daytime data:\n', daytime_data['traffic_volume'].describe(), '\n')
print('Nighttime data:\n', nighttime_data['traffic_volume'].describe())

Daytime data:
 count    23877.000000
mean      4762.047452
std       1174.546482
min          0.000000
25%       4252.000000
50%       4820.000000
75%       5559.000000
max       7280.000000
Name: traffic_volume, dtype: float64 

Nighttime data:
 count    24327.000000
mean      1785.377441
std       1441.951197
min          0.000000
25%        530.000000
50%       1287.000000
75%       2819.000000
max       6386.000000
Name: traffic_volume, dtype: float64

From the plots and statistics above, we can conclude that traffic differs heavily in daytime and nighttime.

While daytime traffic volume follows a normal distribution, nighttime traffic volume resembles a negative exponential distribution (higher frequencies on the lower values, which decrease as we increase value).

Therefore, as the traffic at night is very light, we'll better use daytime data to continue our analysis.

Traffic volume x time¶

Previously, we determined that the traffic at night is generally light. Our goal is to find indicators of heavy traffic, so we decided to only focus on the daytime data moving forward.

One of the possible indicators of heavy traffic is time. There might be more people on the road in a certain month, on a certain day, or at a certain time of the day.

We're going to look at a few line plots showing how the traffic volume changed according to the following parameters:

Month
Day of the week
Time of day

Traffic volume x month¶

In [5]:

#Create a new column containing the month
daytime_data['month'] = daytime_data['date_time'].dt.month

#Aggregate the data and avearge it by month
by_month = daytime_data.groupby('month').mean()
print(by_month['traffic_volume'])

"""We get a warning message here. Still needed to fix it"""

#Line plot on 'traffic_volume' x month data
plt.plot(by_month['traffic_volume'])
plt.xlabel('Month')
plt.title('Traffic volume x month')
plt.show()

month
1     4495.613727
2     4711.198394
3     4889.409560
4     4906.894305
5     4911.121609
6     4898.019566
7     4595.035744
8     4928.302035
9     4870.783145
10    4921.234922
11    4704.094319
12    4374.834566
Name: traffic_volume, dtype: float64

C:\Users\Alvaro\AppData\Local\Temp/ipykernel_4076/1848645698.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  daytime_data['month'] = daytime_data['date_time'].dt.month

The frequency table and line plot above show that while traffic volume is high and equally distributed during March-June and August-October, the lowest values appear in January, July and December - probably because of vacation periods.

Traffic volume x day of the week¶

We'll now continue with building line plots for another time unit: day of the week.

In [6]:

#Create a new column containing the day of the week
daytime_data['dayofweek'] = daytime_data['date_time'].dt.dayofweek

#Aggregate the data and avearge it by day of the week
of_dayofweek = daytime_data.groupby('dayofweek').mean()
print(of_dayofweek['traffic_volume']) # 0 is Monday, 6 is Sunday

#Line plot on 'traffic_volume' x day of the week data
plt.plot(of_dayofweek['traffic_volume'])
plt.xlabel('Day of week')
plt.title('Traffic volume x day of week')
plt.show()

C:\Users\Alvaro\AppData\Local\Temp/ipykernel_4076/1456384355.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  daytime_data['dayofweek'] = daytime_data['date_time'].dt.dayofweek

dayofweek
0    4893.551286
1    5189.004782
2    5284.454282
3    5311.303730
4    5291.600829
5    3927.249558
6    3436.541789
Name: traffic_volume, dtype: float64

The frequency table and line plot above show that while traffic volume is high and equally distributed during Monday and Friday (slightly lower on Mondays), the lowest values appear in the weekend - probably because of the influence of commuting to work by car.

Traffic volume x time of the day¶

We'll now generate a line plot for the time of day. The weekends, however, will drag down the average values, so we're going to look at the averages separately. To do that, we'll start by splitting the data based on the day type: business day or weekend.

In [7]:

#Create a new column containing the hour of the day
daytime_data['hour'] = daytime_data['date_time'].dt.hour

#Split the data in business days and weekends datasets
business_days = daytime_data.copy()[daytime_data['dayofweek'] <= 4] # 4 == Friday
weekend = daytime_data.copy()[daytime_data['dayofweek'] >= 5] # 5 == Saturday

#Aggregate the data and avearge it by hour of the day
by_hour_business = business_days.groupby('hour').mean()
by_hour_weekend = weekend.groupby('hour').mean()

print(by_hour_business['traffic_volume'])
print(by_hour_weekend['traffic_volume'])

#Grid with line plots on 'traffic_volume' x day of the week data
plt.figure(figsize=(8,3), constrained_layout=True)

#Line plot for business days
plt.subplot(1, 2, 1)
plt.ylim(0,6500)
plt.plot(by_hour_business['traffic_volume'])
plt.xlabel('Hour of the day')
plt.title('Traffic volume x hour (business days)')

#Line plot for weekends
plt.subplot(1, 2, 2)
plt.ylim(0,6500)
plt.plot(by_hour_weekend['traffic_volume'])
plt.xlabel('Hour of the day')
plt.title('Traffic volume x hour (weekend)')

plt.show()

C:\Users\Alvaro\AppData\Local\Temp/ipykernel_4076/2115677114.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  daytime_data['hour'] = daytime_data['date_time'].dt.hour

hour
7     6030.413559
8     5503.497970
9     4895.269257
10    4378.419118
11    4633.419470
12    4855.382143
13    4859.180473
14    5152.995778
15    5592.897768
16    6189.473647
17    5784.827133
18    4434.209431
Name: traffic_volume, dtype: float64
hour
7     1589.365894
8     2338.578073
9     3111.623917
10    3686.632302
11    4044.154955
12    4372.482883
13    4362.296564
14    4358.543796
15    4342.456881
16    4339.693805
17    4151.919929
18    3811.792279
Name: traffic_volume, dtype: float64

The previous frequency tables and line plots show that, while on business days the distribution reaches its maximum values at 7h and 16h, on weekends the distribution resembles a logarithmic distribution, where latter hours have higher values. This effect is probably caused by the use of the car (commuting to work on business days, leisure on weekends).

Correlations of traffic volume with weather¶

So far, we've focused on finding time indicators for heavy traffic, and we reached the following conclusions:

The traffic is usually heavier during warm months (March–October) compared to cold months (November–February).
The traffic is usually heavier on business days compared to weekends.
On business days, the rush hours are around 7 and 16.

Another possible indicator of heavy traffic is weather. The dataset provides us with a few useful columns about weather: temp, rain_1h, snow_1h, clouds_all, weather_main, weather_description.

A few of these columns are numerical so let's start by looking up their correlation values with traffic_volume.

In [8]:

print(daytime_data.info())

corr_traffic_temp = daytime_data['traffic_volume'].corr(daytime_data['temp'])
corr_traffic_rain = daytime_data['traffic_volume'].corr(daytime_data['rain_1h'])
corr_traffic_snow = daytime_data['traffic_volume'].corr(daytime_data['snow_1h'])
corr_traffic_clouds = daytime_data['traffic_volume'].corr(daytime_data['clouds_all'])

print('The correlation of traffic volume with temperature is:', round(corr_traffic_temp,2))
print('The correlation of traffic volume with rain is:', round(corr_traffic_rain,2))
print('The correlation of traffic volume with snow is:', round(corr_traffic_snow,2))
print('The correlation of traffic volume with clouds is:', round(corr_traffic_clouds,2))

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23877 entries, 0 to 48198
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   holiday              23877 non-null  object        
 1   temp                 23877 non-null  float64       
 2   rain_1h              23877 non-null  float64       
 3   snow_1h              23877 non-null  float64       
 4   clouds_all           23877 non-null  int64         
 5   weather_main         23877 non-null  object        
 6   weather_description  23877 non-null  object        
 7   date_time            23877 non-null  datetime64[ns]
 8   traffic_volume       23877 non-null  int64         
 9   month                23877 non-null  int64         
 10  dayofweek            23877 non-null  int64         
 11  hour                 23877 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(5), object(3)
memory usage: 2.4+ MB
None
The correlation of traffic volume with temperature is: 0.13
The correlation of traffic volume with rain is: 0.0
The correlation of traffic volume with snow is: 0.0
The correlation of traffic volume with clouds is: -0.03

It's very rare to find correlations of plain 0 - such as traffic volume with rain and snow. Therefore, we'll inspect these distributions in scatterplots to see what's happening:

In [9]:

#Scatter plots for snow and rain variables
plt.scatter(x=daytime_data['snow_1h'], y=daytime_data['traffic_volume'])
plt.xlabel('Numeric Amount in mm of snow that occurred in the hour')
plt.ylabel('Traffic volume')
plt.show()

plt.scatter(x=daytime_data['rain_1h'], y=daytime_data['traffic_volume'])
plt.xlabel('Numeric Amount in mm of rain that occurred in the hour')
plt.ylabel('Traffic volume')
plt.show()

The graphs above show that there's a clear outlier in the rain data. Therefore, we'll remove that row and calculate that particular correlation again:

In [10]:

#Correcting data regarding outliers in rain data
daytime_data = daytime_data[daytime_data['rain_1h'] < 50]

#Scatter plot for rain x traffic
plt.scatter(x=daytime_data['rain_1h'], y=daytime_data['traffic_volume'])
plt.xlabel('Numeric Amount in mm of rain that occurred in the hour')
plt.ylabel('Traffic volume')
plt.show()

corr_traffic_rain = daytime_data['traffic_volume'].corr(daytime_data['rain_1h'])

print('The correlation of traffic volume with temperature is:', round(corr_traffic_temp,2))
print('The correlation of traffic volume with rain is:', round(corr_traffic_rain,2))
print('The correlation of traffic volume with snow is:', round(corr_traffic_snow,2))
print('The correlation of traffic volume with clouds is:', round(corr_traffic_clouds,2))

The correlation of traffic volume with temperature is: 0.13
The correlation of traffic volume with rain is: -0.04
The correlation of traffic volume with snow is: 0.0
The correlation of traffic volume with clouds is: -0.03

It seems that the weather variable with the highest correlation with traffic volume is temperature: the higher the temperature, the higher the traffic (although it's a very weak relation - r = 0.13). We'll proceed now to check the scatterplot of these two variables:

In [11]:

#Scatter plot for trafic x temperature
plt.scatter(x=daytime_data['temp'], y=daytime_data['traffic_volume'])
plt.xlabel('Numeric Average temp in kelvin')
plt.ylabel('Traffic volume')
plt.show()

The graph above shows two outliers that should be removed with values of temp = 0º K (life would not be possible then). We'll remove those and correct the graph and correlation:

In [12]:

#Correcting data regarding outliers in temperature
daytime_data = daytime_data[daytime_data['temp'] > 200]

#Scatter plot for traffic x temperature
plt.scatter(x=daytime_data['temp'], y=daytime_data['traffic_volume'])
plt.xlabel('Numeric Average temp in kelvin')
plt.ylabel('Traffic volume')
plt.show()

#Corrected correlation of traffic x temperature
corr_traffic_temp = daytime_data['traffic_volume'].corr(daytime_data['temp'])

print('The correlation of traffic volume with temperature is:', round(corr_traffic_temp,2))
print('The correlation of traffic volume with rain is:', round(corr_traffic_rain,2))
print('The correlation of traffic volume with snow is:', round(corr_traffic_snow,2))
print('The correlation of traffic volume with clouds is:', round(corr_traffic_clouds,2))

The correlation of traffic volume with temperature is: 0.13
The correlation of traffic volume with rain is: -0.04
The correlation of traffic volume with snow is: 0.0
The correlation of traffic volume with clouds is: -0.03

Therefore, the weather variable with the highest correlation with traffic volume is temperature: the higher the temperature, the higher the traffic (although it's a very weak relation - r = 0.13). So, it's not a reliable indicator of heavy traffic.

To see if we can find more useful data, we'll look next at the categorical weather-related columns: weather_main and weather_description.

In [32]:

#Grouping and averaging for categorical weather variables
by_weather_main = daytime_data.groupby('weather_main').mean()
by_weather_description = daytime_data.groupby('weather_description').mean()

#Horizontal bar plot for weather_main: Categorical Short textual description of the current weather
by_weather_main['traffic_volume'].plot(kind='barh')
plt.show()

#Horizontal bar plot for weather_description: Categorical Longer textual description of the current weather
by_weather_description['traffic_volume'].plot(kind='barh', figsize=(6, 12))
plt.show()

The second graph shows that 'shower snow' and 'light rain and snow' exceed 5000 cars, being the registered weathers with the highests traffic volumes.

Conclussion¶

From the analysis we have conducted, we can conclude that traffic volume on this road is affected by:

Month of the year: The traffic is usually heavier during warm months (March–October) compared to cold months (November–February).
Day of week: The traffic is usually heavier on business days compared to weekends.
Hour of the day: On business days, the rush hours are around 7 and 16.
Weather: Shower snow and light rain with snow make traffic volume increase over 5000 cars.

In [ ]: