Finding Heavy Traffic Indicators on I-94¶

Data Set Information:¶

Hourly traffic volume on Interstate 94 westbound for MN DoT ATR station 301, located roughly halfway between Minneapolis and St Paul, MN. Hourly weather updates and holidays are factored in to determine the impact on traffic volume.

Project Goal¶

The goal for this analysis is to identify a few indicators of heavy traffic on I-94. These indicators can include weather, time of day, weekday, and so on. For example, we may discover that traffic is typically heavier in the summer or when it snows.

The dataset is made available by John Hogue, and it can be downloaded from the UCI Machine Learning Repository

Understanding the dataset¶

First, we will import the required libraries. Then, we'll use Python and these (imported) libraries to understand the dataset and the information it contains.

In [1]:

# Importing libraries
import pandas as pd

# Visualisation libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Enable jupyter notebook generate the plots
%matplotlib inline

In [2]:

# Read the dataset
# Print the first and last few rows
traffic = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
traffic

Out[2]:

	holiday	temp	rain_1h	snow_1h	clouds_all	weather_main	weather_description	date_time	traffic_volume
0	None	288.28	0.0	0.0	40	Clouds	scattered clouds	2012-10-02 09:00:00	5545
1	None	289.36	0.0	0.0	75	Clouds	broken clouds	2012-10-02 10:00:00	4516
2	None	289.58	0.0	0.0	90	Clouds	overcast clouds	2012-10-02 11:00:00	4767
3	None	290.13	0.0	0.0	90	Clouds	overcast clouds	2012-10-02 12:00:00	5026
4	None	291.14	0.0	0.0	75	Clouds	broken clouds	2012-10-02 13:00:00	4918
...	...	...	...	...	...	...	...	...	...
48199	None	283.45	0.0	0.0	75	Clouds	broken clouds	2018-09-30 19:00:00	3543
48200	None	282.76	0.0	0.0	90	Clouds	overcast clouds	2018-09-30 20:00:00	2781
48201	None	282.73	0.0	0.0	90	Thunderstorm	proximity thunderstorm	2018-09-30 21:00:00	2159
48202	None	282.09	0.0	0.0	90	Clouds	overcast clouds	2018-09-30 22:00:00	1450
48203	None	282.12	0.0	0.0	90	Clouds	overcast clouds	2018-09-30 23:00:00	954

48204 rows × 9 columns

There are 48204 rows and 9 columns in the traffic dataset. Next, we will print a summary information of the dataset.

In [3]:

# Dataset information 
traffic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48204 entries, 0 to 48203
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   holiday              48204 non-null  object 
 1   temp                 48204 non-null  float64
 2   rain_1h              48204 non-null  float64
 3   snow_1h              48204 non-null  float64
 4   clouds_all           48204 non-null  int64  
 5   weather_main         48204 non-null  object 
 6   weather_description  48204 non-null  object 
 7   date_time            48204 non-null  object 
 8   traffic_volume       48204 non-null  int64  
dtypes: float64(3), int64(2), object(4)
memory usage: 3.3+ MB

Observation¶

There are no NaN values
The data_time column is stored as string (text format) datatype. This can be converted to datetime datatype.

Before transform the date_te column to datetype,lets see the summary statistics of the dataset.

In [4]:

# Summary statistics of the dataset
traffic.describe(include='all')

Out[4]:

	holiday	temp	rain_1h	snow_1h	clouds_all	weather_main	weather_description	date_time	traffic_volume
count	48204	48204.000000	48204.000000	48204.000000	48204.000000	48204	48204	48204	48204.000000
unique	12	NaN	NaN	NaN	NaN	11	38	40575	NaN
top	None	NaN	NaN	NaN	NaN	Clouds	sky is clear	2013-05-19 10:00:00	NaN
freq	48143	NaN	NaN	NaN	NaN	15164	11665	6	NaN
mean	NaN	281.205870	0.334264	0.000222	49.362231	NaN	NaN	NaN	3259.818355
std	NaN	13.338232	44.789133	0.008168	39.015750	NaN	NaN	NaN	1986.860670
min	NaN	0.000000	0.000000	0.000000	0.000000	NaN	NaN	NaN	0.000000
25%	NaN	272.160000	0.000000	0.000000	1.000000	NaN	NaN	NaN	1193.000000
50%	NaN	282.450000	0.000000	0.000000	64.000000	NaN	NaN	NaN	3380.000000
75%	NaN	291.806000	0.000000	0.000000	90.000000	NaN	NaN	NaN	4933.000000
max	NaN	310.070000	9831.300000	0.510000	100.000000	NaN	NaN	NaN	7280.000000

Quick observations¶

There are no NaN values
The holiday column has 12 unique string values
The weather_main column has 11 unique entries
The weather_description column has 38 uniques values
There are 3 dtypes: float, int, and object
The date_time column is stored in the format "year-month-day hour:minutes:second".

We will use this information about the date_time column to convert the column to datetime dtype. This dtype will make it easier working with datetime datatype.

In [5]:

# convert the date_time column to datetime dtype
traffic['date_time'] = pd.to_datetime(traffic['date_time'], format = "%Y-%m-%d %H:%M:%S")

In [6]:

# Check data_time column if the convention was succeful
traffic.date_time.dtype

Out[6]:

dtype('<M8[ns]')

The dtype('<M8[ns]') is a datetime dtype. So our conversion is effected.

Analyzing Traffic Volume¶

We are interested in the traffic volume and indicators that would suggest a rise or fall in traffic. So we're going to visualise traffic_volume column using a histogram. This will help us understand the distribution of the values within. But before we plot our histogram, let's look at the statistics.

In [7]:

# Summary statistics of the traffic_volume column
traffic['traffic_volume'].describe()

Out[7]:

count    48204.000000
mean      3259.818355
std       1986.860670
min          0.000000
25%       1193.000000
50%       3380.000000
75%       4933.000000
max       7280.000000
Name: traffic_volume, dtype: float64

About 25% of the time, there were 1,193 cars or fewer passing the station each hour — this probably occurs during the night, or when a road is under construction.

Almost 5,000 cars apply the road 75% of the time.

In [8]:

# Histogram showing traffic volume distribution
traffic['traffic_volume'].hist()
plt.title('Traffic volume')
plt.xlabel('Traffic volume')
plt.show()

The values in the traffic_column column range between 0 and about 8000
The distibution of values in traffic_volume column is asymmetric
Most of the values are with 3000 and and 6000

There is a possibility that nighttime and daytime might influence traffic volume. We will compare daytime with nighttime data.

To achieve we will divide the dataset into two parts:

Daytime data: hours from 7 a.m. to 7 p.m. (12 hours)
Nighttime data: hours from 7 p.m. to 7 a.m. (12 hours)

In [9]:

#Sepating the day and night traffic
#Getting daytime traffic 
day = traffic.copy()[(traffic['date_time'].dt.hour >= 7) & (traffic['date_time'].dt.hour < 19)]
print(day.shape)

#Getting nighttime traffic 
night = traffic.copy()[(traffic['date_time'].dt.hour >= 19) | (traffic['date_time'].dt.hour < 7)]
print(night.shape)

(23877, 9)
(24327, 9)

Traffic Volume: Day vs. Night (II)¶

We're going to compare the traffic volume at night and during day

In [10]:

# Grid chart for the daytime and nighttime traffic

# Create the grid
plt.figure(figsize = (12,6))

# Day time histogram
plt.subplot(1,2,1)
plt.hist(day['traffic_volume'])
plt.title('Daytime traffic volume')
plt.xlabel('Traffic volume')
plt.ylabel('Frequency')
plt.xlim(0,7500)
plt.ylim(0,8000)

# Night time histogram
plt.subplot(1,2,2)
plt.hist(night['traffic_volume'])
plt.title('Nighttime traffic volume')
plt.xlabel('Traffic volume')
plt.ylabel('Frequency')
plt.xlim(0,7500)
plt.ylim(0,8000)

plt.show()

In [11]:

# Look up some statistics of the day traffic_volume column
day['traffic_volume'].describe()

Out[11]:

count    23877.000000
mean      4762.047452
std       1174.546482
min          0.000000
25%       4252.000000
50%       4820.000000
75%       5559.000000
max       7280.000000
Name: traffic_volume, dtype: float64

In [12]:

# Look up some statistics of the night traffic_volume column
night['traffic_volume'].describe()

Out[12]:

count    24327.000000
mean      1785.377441
std       1441.951197
min          0.000000
25%        530.000000
50%       1287.000000
75%       2819.000000
max       6386.000000
Name: traffic_volume, dtype: float64

The histogram that shows the distribution of traffic volume during the day is left skewed. This means that most of the traffic volume values are high — there are 4,252 or more cars passing the station each hour 75% of the time (because 25% of values are less than 4,252).

The histogram displaying the nighttime data is right skewed. This means that most of the traffic volume values are low — 75% of the time, the number of cars that passed the station each hour was less than 2,819.

Although there are still measurements of over 5,000 cars per hour, the traffic at night is generally light. Our goal is to find indicators of heavy traffic, so we'll only focus on the daytime data moving forward.

Time Indicators¶

Previously, we determined that the traffic at night is generally light. Our goal is to find indicators of heavy traffic, so we decided to only focus on the daytime data moving forward.

One of the possible indicators of heavy traffic is time. There might be more people on the road in a certain month, on a certain day, or at a certain time of the day.

We're going to look at a few line plots showing how the traffic volume changed according to the following parameters:

Month
Day of the week
Time of day

We will use the DataFrame.groupby() method to get the average traffic volume for each month.

In [13]:

# Creating a new column for the month of the traffic
day['month'] = traffic['date_time'].dt.month

# Get the average monthly traffic volume
by_month = day.groupby('month').mean()
by_month['traffic_volume']

# Visualize result with a line chat
plt.plot(by_month['traffic_volume'])
plt.title('Traffic Volume By Month')
plt.xlabel('Month')
plt.ylabel('Frequency')
plt.show()

Traffic looks less heavy during cold months (November–February) and more intense during warm months (March–October), with an exception: July. Let us see if this is the case for every month in July.

In [14]:

# Creating a new column for the year
day['year'] = day['date_time'].dt.year

# Get only July averages
only_july = day[day['month'] == 7]
only_july_avg = only_july.groupby('year').mean()

# Get the average traffic volume by year
only_july_avg['traffic_volume'].plot.line()

plt.show()

For the month of July, the line graph above shows regular traffic patterns as seen in the other months except in the year 2016. A possible activity must have taken placeduring this time. Further investigation can support.

We'll now continue with building line plots for another time unit: day of the week. To get the traffic volume averages for each day of the week.

In [15]:

# Creating a new column for the day of the week
day['dayofweek'] = traffic['date_time'].dt.dayofweek

# Get the average monthly traffic volume
by_dayofweek = day.groupby('dayofweek').mean()
by_dayofweek['traffic_volume']

# Visualize result with a line chat
plt.plot(by_dayofweek['traffic_volume'])
plt.title('Traffic Volume By Weekday')
plt.xlabel('Day: 0 is Monday and 6 is Sunday')
plt.ylabel('Frequency')
plt.show()

Traffic volume is significantly higher during work days (Mondays to Fridays), and lesser during the weekends. This could be as a result of workers applying the road to their places of work, and people generally rest off during weekends.

Next we will look into the traffic volume for the hour of day for weekend and business day.

In [16]:

# Create an hour column
day['hour'] = day['date_time'].dt.hour

# Split the day data into weekend and workdays
workday = day.copy()[day['dayofweek'] <= 4]
weekend = day.copy()[day['dayofweek'] >= 5]

# Get the averages for the weekend and workdays
weekend_avg = weekend.groupby('hour').mean()
workday_avg = workday.groupby('hour').mean()

# Create a grid for the line charts
plt.figure(figsize = (12,5))

# Create the line chart for weekend
plt.subplot(1,2,1)
plt.plot(weekend_avg['traffic_volume'])
plt.title('Traffic change by weekend')
plt.xlabel('hour')
plt.ylabel('Frequency')
plt.xlim(6,20)
plt.ylim(1500,6500)

# Create line chart for workday
plt.subplot(1,2,2)
plt.plot(workday_avg['traffic_volume'])
plt.title('Traffic change by workday')
plt.xlabel('hour')
plt.ylabel('Frequency')
plt.xlim(6,20)
plt.ylim(1500,6500)


plt.show()


# workday.groupby('hour').mean().plt.line()

The graph shows the means changes in traffic volume by the hour. The means are generally higher compared to the weekends. The traffic is heavier from 6 to 17 (Hours).

Weather Indicators¶

So far, we've focused on finding time indicators for heavy traffic, and we reached the following conclusions:

The traffic is usually heavier during warm months (March–October) compared to cold months (November–February).
The traffic is usually heavier on business days compared to weekends.
On business days, the rush hours are around 7 and 16.

Another possible indicator of heavy traffic is weather. The dataset provides us with a few useful columns about weather: temp, rain_1h, snow_1h, clouds_all, weather_main, and weather_description.

We will look up there correlation values with the traffic volume to see if there any sort of relationship.

In [17]:

day.corr()['traffic_volume']

Out[17]:

temp              0.128317
rain_1h           0.003697
snow_1h           0.001265
clouds_all       -0.032932
traffic_volume    1.000000
month            -0.022337
year             -0.003557
dayofweek        -0.416453
hour              0.172704
Name: traffic_volume, dtype: float64

Amongst the temp, rain_1h, snow_1h, clouds_all, weather_main, and weather_description columns, only temp showed a significant correlation value.

Let's visualise this relationship.

In [18]:

# Scatter plot showing the relation between temperature and traffic volume
plt.scatter(day['traffic_volume'],day['temp'])
plt.title('Traffic by temperature')
plt.xlabel('Traffic volume')
plt.ylabel('Temperature')
plt.ylim(200,350)
plt.show()

We can conclude from the scatter plot above that temperature no significant impact on traffic.

Weather Types¶

Next we look into the categorical weather-related columns: weather_main and weather_description, to see if we can find more useful information.

To do this we are going to calculate the mean traffic volume associated with each unique value in these two columns.

In [19]:

# Averages for the unique values in weather_main column
by_weather_main = day.groupby('weather_main').mean()

# Averages for the unique values in weather_description column
by_weather_dsecription = day.groupby('weather_description').mean()

# Let's visualise our results
by_weather_main['traffic_volume'].plot.barh()
plt.title('Traffic by Weather type')
plt.xlabel('Weather type')
plt.ylabel('Traffic volume')
plt.show

Out[19]:

<function matplotlib.pyplot.show(*args, **kw)>

The barchart shows that there is no weather type exceeding 5000 cars. While lesser traffic is seen in Fog wheather type, a higher traffic is seen duringn thunder storms. There is no clear indication of weather type impacting traffic volume. Let us look at the that of weather description.

In [20]:

by_weather_dsecription['traffic_volume'].plot.barh(figsize=(10,10))
plt.title('Traffic volume by weather description')
plt.xlabel('Weather description')
plt.xlabel('Traffic volume')
plt.show()

Three weather descriptions exceeded 5000 cars: clear sky,

Shower snow
Proximity thunderstorm with drizzle
Light rain and snow.

It's not clear why these weather types have the highest average traffic values — this is bad weather, but not that bad. Perhaps more people take their cars out of the garage when the weather is bad instead of riding a bike or walking

Conclusion¶

In this project, we tried to find a few indicators of heavy traffic on the I-94 Interstate highway. We managed to find two types of indicators:

Time indicators
- The traffic is usually heavier during warm months (March–October) compared to cold months (November–February).
- The traffic is usually heavier on business days compared to the weekends.
- On business days, the rush hours are around 7 and 16.
Weather indicators
- Shower snow
- Light rain and snow
- Proximity thunderstorm with drizzle