Interstate 94 (I-94) is an east–west Interstate Highway connecting the Great Lakes and northern Great Plains regions of the United States. The project is about analyzing the traffic conditions and try to figure out the trends and factors for the traffic.
The goal of our analysis is to determine a few indicators of heavy traffic on I-94. These indicators can be weather type, time of the day, time of the week, etc. For instance, we may find out that the traffic is usually heavier in the summer or when it snows.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
traffic=pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
traffic.head(5)
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
traffic.tail()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
48199 | None | 283.45 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2018-09-30 19:00:00 | 3543 |
48200 | None | 282.76 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 20:00:00 | 2781 |
48201 | None | 282.73 | 0.0 | 0.0 | 90 | Thunderstorm | proximity thunderstorm | 2018-09-30 21:00:00 | 2159 |
48202 | None | 282.09 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 22:00:00 | 1450 |
48203 | None | 282.12 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 23:00:00 | 954 |
traffic.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48204 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null object 8 traffic_volume 48204 non-null int64 dtypes: float64(3), int64(2), object(4) memory usage: 3.3+ MB
traffic['traffic_volume'].plot.hist(bins=20)
<matplotlib.axes._subplots.AxesSubplot at 0x7f182015dc40>
traffic['traffic_volume'].describe()
count 48204.000000 mean 3259.818355 std 1986.860670 min 0.000000 25% 1193.000000 50% 3380.000000 75% 4933.000000 max 7280.000000 Name: traffic_volume, dtype: float64
Between 2012-10-02 09:00:00 and 2018-09-30 23:00:00, the hourly traffic volume varied from 0 to 7,280 cars, with an average of 3,260 cars.
About 25% of the time, there were only 1,193 cars or fewer passing the station each hour — this probably occurs during the night, or when a road is under construction. However, about 25% of the time, the traffic volume was four times as much (4,933 cars or more).
We'll start by dividing the dataset into two parts:
traffic['date_time'] = pd.to_datetime(traffic['date_time'])
day = traffic.copy()[(traffic['date_time'].dt.hour >= 7) & (traffic['date_time'].dt.hour < 19)]
print(day.shape)
night = traffic.copy()[(traffic['date_time'].dt.hour >= 19) | (traffic['date_time'].dt.hour < 7)]
print(night.shape)
(23877, 9) (24327, 9)
fig = plt.figure(figsize=(12,4))
plt.subplot(1, 2, 1)
d=traffic[(traffic['date_time'].dt.hour >= 7) & (traffic['date_time'].dt.hour < 19)]
plt.hist(d['date_time'].dt.hour,bins=12)
plt.xticks(range(7,19))
plt.ylim(ymin=1900, ymax = 2100)
(1900.0, 2100.0)
This indicates the time when the data is missing.
fig = plt.figure(figsize=(11,3.5))
ax1 = plt.subplot(1,2,1)
ax2=plt.subplot(1,2,2)
ax1.hist(day['traffic_volume'])
ax1.set_ylabel('Frequency')
ax1.set_xlabel('Traffic Volume')
ax1.set_title('Day Volume')
ax2.hist(night['traffic_volume'])
ax2.set_title('Night volume')
ax2.set_ylabel('Frequency')
ax2.set_xlabel('Traffic Volume')
plt.show()
day['traffic_volume'].describe()
count 23877.000000 mean 4762.047452 std 1174.546482 min 0.000000 25% 4252.000000 50% 4820.000000 75% 5559.000000 max 7280.000000 Name: traffic_volume, dtype: float64
night['traffic_volume'].describe()
count 24327.000000 mean 1785.377441 std 1441.951197 min 0.000000 25% 530.000000 50% 1287.000000 75% 2819.000000 max 6386.000000 Name: traffic_volume, dtype: float64
The histogram that shows the distribution of traffic volume during the day is left skewed. This means that there are 4,252 or more cars passing the station each hour 75% of the time (because 25% of values are less than 4,252).
The histogram displaying the night-time data is right skewed. This means that most of the traffic volume values are low — 75% of the time, the number of cars that passed the station each hour was less than 2,819.
Although there are still measurements of over 5,000 cars per hour, the traffic at night is generally light. Our goal is to find indicators of heavy traffic, so we'll only focus on the daytime data moving forward.
One of the possible indicators of heavy traffic is time. There might be more people on the road in a certain month, on a certain day, or at a certain time of day.
We're going to look at a few line plots showing how the traffic volume changes according to the following:
day['month'] = day['date_time'].dt.month
by_month = day.groupby('month').mean()
by_month['traffic_volume']
month 1 4495.613727 2 4711.198394 3 4889.409560 4 4906.894305 5 4911.121609 6 4898.019566 7 4595.035744 8 4928.302035 9 4870.783145 10 4921.234922 11 4704.094319 12 4374.834566 Name: traffic_volume, dtype: float64
by_month['traffic_volume'].plot.line()
<matplotlib.axes._subplots.AxesSubplot at 0x7f1820079d90>
The traffic looks less heavy during cold months (November–February) and more intense during warm months (March–October), with one interesting exception: July. Is there anything special about July? Is traffic significantly less heavy in July each year?
To answer the last question, let's see how the traffic volume changed each year in July.
day['dayofweek'] = day['date_time'].dt.dayofweek
by_dayofweek = day.groupby('dayofweek').mean()
by_dayofweek['traffic_volume'] # 0 is Monday, 6 is Sunday
dayofweek 0 4893.551286 1 5189.004782 2 5284.454282 3 5311.303730 4 5291.600829 5 3927.249558 6 3436.541789 Name: traffic_volume, dtype: float64
day['year'] = day['date_time'].dt.year
by_years= day.groupby('year').mean()
by_years['traffic_volume']
year 2012 4675.346861 2013 4834.084298 2014 4765.309296 2015 4748.448485 2016 4637.518293 2017 4865.961752 2018 4726.280534 Name: traffic_volume, dtype: float64
by_dayofweek['traffic_volume'].plot.line()
<matplotlib.axes._subplots.AxesSubplot at 0x7f181d8a5040>
Traffic volume is significantly heavier on business days (Monday – Friday). Except for Monday, we only see values over 5,000 during business days. Traffic is lighter on weekends, with values below 4,000 cars.
Generating a line plot for the time of day will cause a problem as the weekends will drag down the average values, so we're going to look at the averages separately. To do that, we'll start by splitting the data based on the day type: business day or weekend.
day['hour'] = day['date_time'].dt.hour
bussiness_days = day.copy()[day['dayofweek'] <= 4] # 4 == Friday
weekend = day.copy()[day['dayofweek'] >= 5] # 5 == Saturday
by_hour_business = bussiness_days.groupby('hour').mean()
by_hour_weekend = weekend.groupby('hour').mean()
by_hour_business['traffic_volume'].plot.line(label = 'Business Day')
by_hour_weekend['traffic_volume'].plot.line(label = 'Weekend')
plt.legend(loc='lower right')
plt.title('Traffic by hour for day')
plt.xlabel('Hour')
plt.ylabel('Traffic Volume')
Text(0, 0.5, 'Traffic Volume')
At each hour of the day, the traffic volume is generally higher during business days compared to the weekends. As somehow expected, the rush hours are around 7 and 16 — when most people travel from home to work and back. We see volumes of over 6,000 cars at rush hours.
To summarize, we found a few time-related indicators of heavy traffic:
Another possible indicator of heavy traffic is weather. The dataset provides us with a few useful columns about weather: temp, rain_1h, snow_1h, clouds_all, weather_main, weather_description.
A few of these columns are numerical, so let's start by looking up their correlation values with traffic_volume.
day.corr()['traffic_volume']
temp 0.128317 rain_1h 0.003697 snow_1h 0.001265 clouds_all -0.032932 traffic_volume 1.000000 month -0.022337 dayofweek -0.416453 year -0.003557 hour 0.172704 Name: traffic_volume, dtype: float64
Temperature shows the strongest correlation with a value of just +0.13. The other relevant columns (rain_1h, snow_1h, clouds_all) don't show any strong correlation with traffic_value.
Let's generate a scatter plot to visualize the correlation between temp and traffic_volume.
scatter=sns.relplot(data=day,x='traffic_volume',y='temp',hue='weather_main')
scatter.set(ylim=(240, 310))
<seaborn.axisgrid.FacetGrid at 0x7f181d8a53a0>
We can conclude that temperature doesn't look like a solid indicator of heavy traffic.
Let's now look at the weather_main distribution separately.
by_weather_main = day.groupby('weather_main').mean()
by_weather_main['traffic_volume'].plot.barh()
plt.show()
It looks like there's no weather type where traffic volume exceeds 5,000 cars. This makes finding a heavy traffic indicator more difficult. Let's also group by weather_description, which has a more granular weather classification.
To start, we're going to group the data by weather_description and look at the traffic_volume averages.
by_weather_description = day.groupby('weather_description').mean()
by_weather_description['traffic_volume'].plot.barh(figsize=(5,10))
plt.show()
It looks like there are three weather types where traffic volume exceeds 5,000:
Shower snow Light rain and snow Proximity thunderstorm with drizzle It's not clear why these weather types have the highest average traffic values — this is bad weather, but not that bad. Perhaps more people take their cars out of the garage when the weather is bad instead of riding a bike or walking.
In this project, we tried to find a few indicators of heavy traffic on the I-94 Interstate highway. We managed to find two types of indicators:
Time indicators
- The traffic is usually heavier during warm months (March–October) compared to cold months (November–February).
- The traffic is usually heavier on business days compared to the weekends.
- On business days, the rush hours are around 7 and 16.
Lets start with the grouping.
night['month'] = night['date_time'].dt.month
n_by_month = night.groupby('month').mean()
n_by_month['traffic_volume']
month 1 1616.610448 2 1716.961841 3 1817.272029 4 1786.116598 5 1829.852518 6 1932.272727 7 1838.349193 8 1897.564079 9 1818.959858 10 1852.168591 11 1680.311799 12 1622.508393 Name: traffic_volume, dtype: float64
n_by_month['traffic_volume'].plot.line()
<matplotlib.axes._subplots.AxesSubplot at 0x7f1820084520>
The traffic looks less heavy during cold months (November–March) and more intense during warm months (April–October)
night['dayofweek'] = night['date_time'].dt.dayofweek
n_by_dayofweek = night.groupby('dayofweek').mean()
n_by_dayofweek['traffic_volume'].plot.line()
<matplotlib.axes._subplots.AxesSubplot at 0x7f181c5cd1c0>
The traffic keeps on increasing during the week and drops in the weekend the same trend as it was seen in the day data.
night['year'] = night['date_time'].dt.year
nby_years= night.groupby('year').mean()
nby_years['traffic_volume']
year 2012 1678.183559 2013 1776.179115 2014 1762.013115 2015 1793.439856 2016 1739.358719 2017 1857.150670 2018 1796.524133 Name: traffic_volume, dtype: float64
# generating the line plot for business and weekend days
night['hour'] = night['date_time'].dt.hour
nbussiness_days = night.copy()[night['dayofweek'] <= 4] # 4 == Friday
nweekend = night.copy()[night['dayofweek'] >= 5] # 5 == Saturday
nby_hour_business = nbussiness_days.groupby('hour').mean()
nby_hour_weekend = nweekend.groupby('hour').mean()
nby_hour_business['traffic_volume'].plot.line(label = 'Business Day')
nby_hour_weekend['traffic_volume'].plot.line(label = 'Weekend')
plt.legend(loc='lower right')
plt.title('Traffic by hour for night')
plt.xlabel('Hour')
plt.xticks(range(0,24))
plt.ylabel('Traffic Volume')
Text(0, 0.5, 'Traffic Volume')
The straight line between 7 am to 7 pm is for no data.
Between 6 and 7 am there is a sudden rise in the traffic on both business and week day while business day indicating more than 5000 cars. After 7 pm there is a gradual decrease in traffic for all days.
Another possible indicator of heavy traffic is weather. The dataset provides us with a few useful columns about weather: temp, rain_1h, snow_1h, clouds_all, weather_main, weather_description.
night.corr()['traffic_volume']
temp 0.094004 rain_1h -0.012972 snow_1h -0.007453 clouds_all 0.012832 traffic_volume 1.000000 month 0.001342 dayofweek -0.073636 year 0.018544 hour 0.454586 Name: traffic_volume, dtype: float64
The traffic volume shows maximum correlation with the hour. The correlation is already seen in the previous graphs.
Lets check the temperature and weather main together.
scatter=sns.relplot(data=day,x='traffic_volume',y='temp',hue='weather_main')
scatter.set(ylim=(240, 310))