Hourly traffic volume on Interstate 94 westbound for MN DoT ATR station 301, located roughly halfway between Minneapolis and St Paul, MN. Hourly weather updates and holidays are factored in to determine the impact on traffic volume.
The goal for this analysis is to identify a few indicators of heavy traffic on I-94. These indicators can include weather, time of day, weekday, and so on. For example, we may discover that traffic is typically heavier in the summer or when it snows.
The dataset is made available by John Hogue, and it can be downloaded from the UCI Machine Learning Repository
First, we will import the required libraries. Then, we'll use Python and these (imported) libraries to understand the dataset and the information it contains.
# Importing libraries
import pandas as pd
# Visualisation libraries
import seaborn as sns
import matplotlib.pyplot as plt
# Enable jupyter notebook generate the plots
%matplotlib inline
# Read the dataset
# Print the first and last few rows
traffic = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
traffic
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
48199 | None | 283.45 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2018-09-30 19:00:00 | 3543 |
48200 | None | 282.76 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 20:00:00 | 2781 |
48201 | None | 282.73 | 0.0 | 0.0 | 90 | Thunderstorm | proximity thunderstorm | 2018-09-30 21:00:00 | 2159 |
48202 | None | 282.09 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 22:00:00 | 1450 |
48203 | None | 282.12 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 23:00:00 | 954 |
48204 rows × 9 columns
There are 48204 rows and 9 columns in the traffic dataset. Next, we will print a summary information of the dataset.
# Dataset information
traffic.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48204 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null object 8 traffic_volume 48204 non-null int64 dtypes: float64(3), int64(2), object(4) memory usage: 3.3+ MB
Before transform the date_te column to datetype,lets see the summary statistics of the dataset.
# Summary statistics of the dataset
traffic.describe(include='all')
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
count | 48204 | 48204.000000 | 48204.000000 | 48204.000000 | 48204.000000 | 48204 | 48204 | 48204 | 48204.000000 |
unique | 12 | NaN | NaN | NaN | NaN | 11 | 38 | 40575 | NaN |
top | None | NaN | NaN | NaN | NaN | Clouds | sky is clear | 2013-05-19 10:00:00 | NaN |
freq | 48143 | NaN | NaN | NaN | NaN | 15164 | 11665 | 6 | NaN |
mean | NaN | 281.205870 | 0.334264 | 0.000222 | 49.362231 | NaN | NaN | NaN | 3259.818355 |
std | NaN | 13.338232 | 44.789133 | 0.008168 | 39.015750 | NaN | NaN | NaN | 1986.860670 |
min | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | NaN | NaN | 0.000000 |
25% | NaN | 272.160000 | 0.000000 | 0.000000 | 1.000000 | NaN | NaN | NaN | 1193.000000 |
50% | NaN | 282.450000 | 0.000000 | 0.000000 | 64.000000 | NaN | NaN | NaN | 3380.000000 |
75% | NaN | 291.806000 | 0.000000 | 0.000000 | 90.000000 | NaN | NaN | NaN | 4933.000000 |
max | NaN | 310.070000 | 9831.300000 | 0.510000 | 100.000000 | NaN | NaN | NaN | 7280.000000 |
We will use this information about the date_time column to convert the column to datetime dtype. This dtype will make it easier working with datetime datatype.
# convert the date_time column to datetime dtype
traffic['date_time'] = pd.to_datetime(traffic['date_time'], format = "%Y-%m-%d %H:%M:%S")
# Check data_time column if the convention was succeful
traffic.date_time.dtype
dtype('<M8[ns]')
The dtype('<M8[ns]') is a datetime dtype. So our conversion is effected.
We are interested in the traffic volume and indicators that would suggest a rise or fall in traffic. So we're going to visualise traffic_volume column using a histogram. This will help us understand the distribution of the values within. But before we plot our histogram, let's look at the statistics.
# Summary statistics of the traffic_volume column
traffic['traffic_volume'].describe()
count 48204.000000 mean 3259.818355 std 1986.860670 min 0.000000 25% 1193.000000 50% 3380.000000 75% 4933.000000 max 7280.000000 Name: traffic_volume, dtype: float64
About 25% of the time, there were 1,193 cars or fewer passing the station each hour — this probably occurs during the night, or when a road is under construction.
Almost 5,000 cars apply the road 75% of the time.
# Histogram showing traffic volume distribution
traffic['traffic_volume'].hist()
plt.title('Traffic volume')
plt.xlabel('Traffic volume')
plt.show()
There is a possibility that nighttime and daytime might influence traffic volume. We will compare daytime with nighttime data.
To achieve we will divide the dataset into two parts:
#Sepating the day and night traffic
#Getting daytime traffic
day = traffic.copy()[(traffic['date_time'].dt.hour >= 7) & (traffic['date_time'].dt.hour < 19)]
print(day.shape)
#Getting nighttime traffic
night = traffic.copy()[(traffic['date_time'].dt.hour >= 19) | (traffic['date_time'].dt.hour < 7)]
print(night.shape)
(23877, 9) (24327, 9)
We're going to compare the traffic volume at night and during day
# Grid chart for the daytime and nighttime traffic
# Create the grid
plt.figure(figsize = (12,6))
# Day time histogram
plt.subplot(1,2,1)
plt.hist(day['traffic_volume'])
plt.title('Daytime traffic volume')
plt.xlabel('Traffic volume')
plt.ylabel('Frequency')
plt.xlim(0,7500)
plt.ylim(0,8000)
# Night time histogram
plt.subplot(1,2,2)
plt.hist(night['traffic_volume'])
plt.title('Nighttime traffic volume')
plt.xlabel('Traffic volume')
plt.ylabel('Frequency')
plt.xlim(0,7500)
plt.ylim(0,8000)
plt.show()
# Look up some statistics of the day traffic_volume column
day['traffic_volume'].describe()
count 23877.000000 mean 4762.047452 std 1174.546482 min 0.000000 25% 4252.000000 50% 4820.000000 75% 5559.000000 max 7280.000000 Name: traffic_volume, dtype: float64
# Look up some statistics of the night traffic_volume column
night['traffic_volume'].describe()
count 24327.000000 mean 1785.377441 std 1441.951197 min 0.000000 25% 530.000000 50% 1287.000000 75% 2819.000000 max 6386.000000 Name: traffic_volume, dtype: float64
The histogram that shows the distribution of traffic volume during the day is left skewed. This means that most of the traffic volume values are high — there are 4,252 or more cars passing the station each hour 75% of the time (because 25% of values are less than 4,252).
The histogram displaying the nighttime data is right skewed. This means that most of the traffic volume values are low — 75% of the time, the number of cars that passed the station each hour was less than 2,819.
Although there are still measurements of over 5,000 cars per hour, the traffic at night is generally light. Our goal is to find indicators of heavy traffic, so we'll only focus on the daytime data moving forward.
Previously, we determined that the traffic at night is generally light. Our goal is to find indicators of heavy traffic, so we decided to only focus on the daytime data moving forward.
One of the possible indicators of heavy traffic is time. There might be more people on the road in a certain month, on a certain day, or at a certain time of the day.
We're going to look at a few line plots showing how the traffic volume changed according to the following parameters:
We will use the DataFrame.groupby() method to get the average traffic volume for each month.
# Creating a new column for the month of the traffic
day['month'] = traffic['date_time'].dt.month
# Get the average monthly traffic volume
by_month = day.groupby('month').mean()
by_month['traffic_volume']
# Visualize result with a line chat
plt.plot(by_month['traffic_volume'])
plt.title('Traffic Volume By Month')
plt.xlabel('Month')
plt.ylabel('Frequency')
plt.show()
Traffic looks less heavy during cold months (November–February) and more intense during warm months (March–October), with an exception: July. Let us see if this is the case for every month in July.
# Creating a new column for the year
day['year'] = day['date_time'].dt.year
# Get only July averages
only_july = day[day['month'] == 7]
only_july_avg = only_july.groupby('year').mean()
# Get the average traffic volume by year
only_july_avg['traffic_volume'].plot.line()
plt.show()
For the month of July, the line graph above shows regular traffic patterns as seen in the other months except in the year 2016. A possible activity must have taken placeduring this time. Further investigation can support.
We'll now continue with building line plots for another time unit: day of the week. To get the traffic volume averages for each day of the week.
# Creating a new column for the day of the week
day['dayofweek'] = traffic['date_time'].dt.dayofweek
# Get the average monthly traffic volume
by_dayofweek = day.groupby('dayofweek').mean()
by_dayofweek['traffic_volume']
# Visualize result with a line chat
plt.plot(by_dayofweek['traffic_volume'])
plt.title('Traffic Volume By Weekday')
plt.xlabel('Day: 0 is Monday and 6 is Sunday')
plt.ylabel('Frequency')
plt.show()
Traffic volume is significantly higher during work days (Mondays to Fridays), and lesser during the weekends. This could be as a result of workers applying the road to their places of work, and people generally rest off during weekends.
Next we will look into the traffic volume for the hour of day for weekend and business day.
# Create an hour column
day['hour'] = day['date_time'].dt.hour
# Split the day data into weekend and workdays
workday = day.copy()[day['dayofweek'] <= 4]
weekend = day.copy()[day['dayofweek'] >= 5]
# Get the averages for the weekend and workdays
weekend_avg = weekend.groupby('hour').mean()
workday_avg = workday.groupby('hour').mean()
# Create a grid for the line charts
plt.figure(figsize = (12,5))
# Create the line chart for weekend
plt.subplot(1,2,1)
plt.plot(weekend_avg['traffic_volume'])
plt.title('Traffic change by weekend')
plt.xlabel('hour')
plt.ylabel('Frequency')
plt.xlim(6,20)
plt.ylim(1500,6500)
# Create line chart for workday
plt.subplot(1,2,2)
plt.plot(workday_avg['traffic_volume'])
plt.title('Traffic change by workday')
plt.xlabel('hour')
plt.ylabel('Frequency')
plt.xlim(6,20)
plt.ylim(1500,6500)
plt.show()
# workday.groupby('hour').mean().plt.line()
The graph shows the means changes in traffic volume by the hour. The means are generally higher compared to the weekends. The traffic is heavier from 6 to 17 (Hours).
So far, we've focused on finding time indicators for heavy traffic, and we reached the following conclusions:
Another possible indicator of heavy traffic is weather. The dataset provides us with a few useful columns about weather: temp, rain_1h, snow_1h, clouds_all, weather_main, and weather_description
.
We will look up there correlation values with the traffic volume to see if there any sort of relationship.
day.corr()['traffic_volume']
temp 0.128317 rain_1h 0.003697 snow_1h 0.001265 clouds_all -0.032932 traffic_volume 1.000000 month -0.022337 year -0.003557 dayofweek -0.416453 hour 0.172704 Name: traffic_volume, dtype: float64
Amongst the temp, rain_1h, snow_1h, clouds_all, weather_main, and weather_description
columns, only temp
showed a significant correlation value.
Let's visualise this relationship.
# Scatter plot showing the relation between temperature and traffic volume
plt.scatter(day['traffic_volume'],day['temp'])
plt.title('Traffic by temperature')
plt.xlabel('Traffic volume')
plt.ylabel('Temperature')
plt.ylim(200,350)
plt.show()
We can conclude from the scatter plot above that temperature no significant impact on traffic.
Next we look into the categorical weather-related columns: weather_main and weather_description
, to see if we can find more useful information.
To do this we are going to calculate the mean traffic volume associated with each unique value in these two columns.
# Averages for the unique values in weather_main column
by_weather_main = day.groupby('weather_main').mean()
# Averages for the unique values in weather_description column
by_weather_dsecription = day.groupby('weather_description').mean()
# Let's visualise our results
by_weather_main['traffic_volume'].plot.barh()
plt.title('Traffic by Weather type')
plt.xlabel('Weather type')
plt.ylabel('Traffic volume')
plt.show
<function matplotlib.pyplot.show(*args, **kw)>
The barchart shows that there is no weather type exceeding 5000 cars. While lesser traffic is seen in Fog wheather type, a higher traffic is seen duringn thunder storms. There is no clear indication of weather type impacting traffic volume. Let us look at the that of weather description.
by_weather_dsecription['traffic_volume'].plot.barh(figsize=(10,10))
plt.title('Traffic volume by weather description')
plt.xlabel('Weather description')
plt.xlabel('Traffic volume')
plt.show()
Three weather descriptions exceeded 5000 cars: clear sky,
It's not clear why these weather types have the highest average traffic values — this is bad weather, but not that bad. Perhaps more people take their cars out of the garage when the weather is bad instead of riding a bike or walking
In this project, we tried to find a few indicators of heavy traffic on the I-94 Interstate highway. We managed to find two types of indicators: