We have analysed a dataset about Westbound traffic on I-94 Interstate Highway for the period 2 Oct 2012 at 09.00 through 30 Sept 2018 at 23.00.
The goal of the analysis is to determine a few indicators of heavy traffic on I-94. These indicators can be weather type, time of day, time of week and others.
We start off by reading the data and producing first 5 and last 5 rows of data as well as basic info on the dataset, including number of rows and columns.
import pandas as pd
import numpy as np
i94 = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
i94.head()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
i94.tail()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
48199 | None | 283.45 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2018-09-30 19:00:00 | 3543 |
48200 | None | 282.76 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 20:00:00 | 2781 |
48201 | None | 282.73 | 0.0 | 0.0 | 90 | Thunderstorm | proximity thunderstorm | 2018-09-30 21:00:00 | 2159 |
48202 | None | 282.09 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 22:00:00 | 1450 |
48203 | None | 282.12 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 23:00:00 | 954 |
i94.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48204 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null object 8 traffic_volume 48204 non-null int64 dtypes: float64(3), int64(2), object(4) memory usage: 3.3+ MB
import matplotlib.pyplot as plt
%matplotlib inline
i94['traffic_volume'].plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7f898ddb4130>
i94['traffic_volume'].describe()
count 48204.000000 mean 3259.818355 std 1986.860670 min 0.000000 25% 1193.000000 50% 3380.000000 75% 4933.000000 max 7280.000000 Name: traffic_volume, dtype: float64
The dataset shows the Westbound route of I-94 has over the dataset period has:
i94['date_time'] = pd.to_datetime(i94['date_time'])
i94.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48204 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null datetime64[ns] 8 traffic_volume 48204 non-null int64 dtypes: datetime64[ns](1), float64(3), int64(2), object(3) memory usage: 3.3+ MB
daytime = i94.copy()[(i94['date_time'].dt.hour >= 7) & (i94['date_time'].dt.hour < 19)]
print(daytime.shape)
nighttime = i94.copy()[(i94['date_time'].dt.hour >= 19) | (i94['date_time'].dt.hour < 7)]
print(nighttime.shape)
(23877, 9) (24327, 9)
We extracted the hour from date_time column and calculated how many rows reference each of daytime (12 hours) and nighttime (12 hours). According to the calculation, there are more rows of nighttime than daytime. The difference between the two is 450. We will investigate this further.
# hours = i94['date_time'].dt.hour
# i94['hours'] = hours
# i94['increment'] = i94['hours'] - i94['hours'].shift(+1)
# i94[i94['increment'] != 1.0]
We have investigated the difference between total rows of nighttime and daytime. We added an hours
column to the dataset and then calculated the difference between each row and its immediately next following row on the basis that each row is a recording for that hour and the next for the next hour. There are many more than 450 rows with a difference of 1. We noticed that there are some consecutive records of the same hour, i.e., a difference of zero and others with multiple hours of difference. This aspect may be needed to be taken into account in later analysis.
i94.copy().info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48204 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null datetime64[ns] 8 traffic_volume 48204 non-null int64 dtypes: datetime64[ns](1), float64(3), int64(2), object(3) memory usage: 3.3+ MB
plt.figure(figsize=(10, 8))
plt.subplot(1, 2, 1)
plt.hist(daytime['traffic_volume'])
plt.ylabel('Frequency')
plt.xlabel('Traffic Volume')
plt.ylim(0, 8000)
plt.xlim(0, 8000)
plt.title('Daytime')
plt.subplot(1, 2, 2)
plt.hist(nighttime['traffic_volume'])
plt.title('Nighttime')
plt.ylabel('Frequency')
plt.xlabel('Traffic Volume')
plt.ylim(0, 8000)
plt.xlim(0, 8000)
plt.show()
We have produced a grid of 2 histograms to show frequency of traffic_volume
in each of daytime and nighttime. We have also ensured both x and y axes have similar datapoints, respectively. Daytime histogram is as expected with a normal curve around the median traffic volume. Nighttime histogram shows as expected highest frequencies for lowest traffic volumes. There are some outliers around traffic_volume
of approximately 3000 and also stickiness of frequency for traffic_volume
of approximately 1500 through 3500.
daytime['traffic_volume'].describe()
count 23877.000000 mean 4762.047452 std 1174.546482 min 0.000000 25% 4252.000000 50% 4820.000000 75% 5559.000000 max 7280.000000 Name: traffic_volume, dtype: float64
nighttime['traffic_volume'].describe()
count 24327.000000 mean 1785.377441 std 1441.951197 min 0.000000 25% 530.000000 50% 1287.000000 75% 2819.000000 max 6386.000000 Name: traffic_volume, dtype: float64
We have provided basic data for each of daytime
and nighttime
traffic_volume
series. This shows some significant differences between traffic volume during daytime and nighttime. Mean values for each are less than half at night, though max values do not show such a difference, presumably because max and values that are close to max are much less frequent for nighttime than in case of daytime.
Considering the findings on nighttime
histogram and basic data, we could ignore nighttime
to determine indicators for heavy traffic.
daytime['month'] = daytime['date_time'].dt.month
by_month = daytime.groupby('month').mean()
by_month['traffic_volume']
month 1 4495.613727 2 4711.198394 3 4889.409560 4 4906.894305 5 4911.121609 6 4898.019566 7 4595.035744 8 4928.302035 9 4870.783145 10 4921.234922 11 4704.094319 12 4374.834566 Name: traffic_volume, dtype: float64
We have created a column in the daytime
dataset for monthly traffic_volume
and populated this column with mean values of traffic_volume
for each respective month.
plt.plot(by_month['traffic_volume'])
[<matplotlib.lines.Line2D at 0x7f898d8e8d00>]
A graph of daytime
traffic_volume
for each month shows as expected vastly reduced data for beginning January (post-Xmas), December (Xmas) and over the summer holiday season (part of each of July and August).
daytime['dayofweek'] = daytime['date_time'].dt.dayofweek
by_dayofweek = daytime.groupby('dayofweek').mean()
by_dayofweek['traffic_volume'] # 0 is Monday, 6 is Sunday
dayofweek 0 4893.551286 1 5189.004782 2 5284.454282 3 5311.303730 4 5291.600829 5 3927.249558 6 3436.541789 Name: traffic_volume, dtype: float64
plt.plot(by_dayofweek['traffic_volume'] )
[<matplotlib.lines.Line2D at 0x7f898eaa31c0>]
A graph of traffic_volume
for each day of a calendar week shows that from Friday through Sunday is a significant reduction on otherwise relatively stable levels from Monday through Thursday.
daytime['hour'] = daytime['date_time'].dt.hour
bussiness_days = daytime.copy()[daytime['dayofweek'] <= 4] # 4 == Friday
weekend = daytime.copy()[daytime['dayofweek'] >= 5] # 5 == Saturday
by_hour_business = bussiness_days.groupby('hour').mean()
by_hour_weekend = weekend.groupby('hour').mean()
plt.figure(figsize=(12, 8))
plt.subplot(1, 2, 1)
plt.plot(by_hour_business['traffic_volume'])
plt.ylabel('Traffic Volume')
plt.xlabel('Hours of Day')
plt.ylim(0, 6250)
plt.xlim(6, 19)
plt.title('Business')
plt.subplot(1, 2, 2)
plt.plot(by_hour_weekend['traffic_volume'])
plt.title('Weekend')
plt.ylabel('Traffic Volume')
plt.xlabel('Hours of Day')
plt.ylim(0, 6250)
plt.xlim(6, 19)
plt.show()
A graph of hours of the day between 7 through 18 during the business week and also on weekends shows that rush hours during the week are at 7 and 16. traffic_volume
starts to increase slowly from 9 onwards and accelerates from 13 through 16.
On weekends traffic_volume
increases from 7 through 11 and then plateaus through 16.
i94.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48204 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null datetime64[ns] 8 traffic_volume 48204 non-null int64 dtypes: datetime64[ns](1), float64(3), int64(2), object(3) memory usage: 3.3+ MB
daytime.copy().info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 23877 entries, 0 to 48198 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 23877 non-null object 1 temp 23877 non-null float64 2 rain_1h 23877 non-null float64 3 snow_1h 23877 non-null float64 4 clouds_all 23877 non-null int64 5 weather_main 23877 non-null object 6 weather_description 23877 non-null object 7 date_time 23877 non-null datetime64[ns] 8 traffic_volume 23877 non-null int64 9 month 23877 non-null int64 10 dayofweek 23877 non-null int64 11 hour 23877 non-null int64 dtypes: datetime64[ns](1), float64(3), int64(5), object(3) memory usage: 2.4+ MB
We have concluded so far:
Another possible indicator of heavy traffic is weather. The dataset has a number of weather columns: temp
, rain_1h
, snow_1h
, clouds_all
, weather_main
, weather_description
. All excspt the last 2 are numerical.
We will check the correlation of traffic_volume
with the various weather columns.
daytime.copy().corr()[['traffic_volume', 'temp', 'rain_1h', 'snow_1h', 'clouds_all']]
traffic_volume | temp | rain_1h | snow_1h | clouds_all | |
---|---|---|---|---|---|
temp | 0.128317 | 1.000000 | 0.010815 | -0.019286 | -0.135519 |
rain_1h | 0.003697 | 0.010815 | 1.000000 | -0.000091 | 0.004993 |
snow_1h | 0.001265 | -0.019286 | -0.000091 | 1.000000 | 0.027721 |
clouds_all | -0.032932 | -0.135519 | 0.004993 | 0.027721 | 1.000000 |
traffic_volume | 1.000000 | 0.128317 | 0.003697 | 0.001265 | -0.032932 |
month | -0.022337 | 0.222072 | 0.001176 | 0.026768 | 0.000595 |
dayofweek | -0.416453 | 0.000188 | -0.009683 | -0.008814 | -0.041980 |
hour | 0.172704 | 0.162691 | 0.008279 | 0.003923 | 0.023685 |
There is not a great deal of correlation between traffic_volume
and numerical weather column data. We will take a closer look at the correlation between the column with the greatest correlation that is temp
. We will produce a scatter graph for this purpose.
plt.scatter(daytime['traffic_volume'], daytime['temp'])
plt.xlabel('Temperature')
plt.ylabel('Traffic Volume')
plt.ylim(220, 350)
(220.0, 350.0)
The scatter graph between traffic_volume
and temp
does not show any particular correlation outcome and is not reliable as an indicator of what affects heavy traffic. We have also excluded from the graph some outlier data by limiting the data points on the y-axis.
by_weather_main = daytime.groupby('weather_main').mean()
by_weather_description = daytime.groupby('weather_description').mean()
by_weather_main
temp | rain_1h | snow_1h | clouds_all | traffic_volume | month | dayofweek | hour | |
---|---|---|---|---|---|---|---|---|
weather_main | ||||||||
Clear | 283.812078 | 0.000000 | 0.000000 | 1.670265 | 4778.416260 | 6.490599 | 3.138928 | 12.404248 |
Clouds | 282.929274 | 0.000000 | 0.000000 | 62.667548 | 4865.415996 | 6.393243 | 3.005631 | 12.911974 |
Drizzle | 284.456433 | 0.170804 | 0.000000 | 84.704417 | 4837.212911 | 7.105323 | 2.934315 | 12.308041 |
Fog | 277.579641 | 0.163840 | 0.001409 | 65.477901 | 4372.491713 | 6.646409 | 2.798343 | 10.325967 |
Haze | 275.319353 | 0.040036 | 0.000000 | 64.000000 | 4609.893285 | 5.832134 | 2.754197 | 12.467626 |
Mist | 279.420825 | 0.249992 | 0.000825 | 74.961435 | 4623.976475 | 6.734285 | 2.895102 | 11.078288 |
Rain | 287.089601 | 3.972943 | 0.000292 | 75.870116 | 4815.568462 | 6.774023 | 2.914467 | 12.642379 |
Smoke | 292.405833 | 0.878333 | 0.000000 | 53.333333 | 4564.583333 | 6.833333 | 2.416667 | 13.166667 |
Snow | 267.984505 | 0.014017 | 0.001768 | 80.501376 | 4396.321183 | 6.374828 | 2.750344 | 12.153370 |
Squall | 296.730000 | 1.020000 | 0.000000 | 75.000000 | 4211.000000 | 7.000000 | 2.000000 | 14.000000 |
Thunderstorm | 293.364678 | 1.146475 | 0.000000 | 75.184035 | 4648.212860 | 7.108647 | 2.955654 | 12.694013 |
by_weather_main['traffic_volume'].plot.barh()
<matplotlib.axes._subplots.AxesSubplot at 0x7f898eb3af70>
by_weather_description['traffic_volume'].plot.barh(figsize=(5, 10))
<matplotlib.axes._subplots.AxesSubplot at 0x7f898dc562b0>
There are some weather patterns that result in highest traffic_volume
including shower snow, light rain and snow, proximity thunderstorm with drizzle and thunderstorm with light drizzle. These may be a result of more people responding to inclement weather and using their cars.
nighttime['dayofweek'] = nighttime['date_time'].dt.dayofweek
by_dayofweek_night = nighttime.groupby('dayofweek').mean()
by_dayofweek_night['traffic_volume'] # 0 is Monday, 6 is Sunday
nighttime['hour'] = nighttime['date_time'].dt.hour
business_days_night = nighttime.copy()[nighttime['dayofweek'] <= 4] # 4 == Friday
weekend_night = nighttime.copy()[nighttime['dayofweek'] >= 5] # 5 == Saturday
by_hour_business_night = business_days_night.groupby('hour').mean()
by_hour_weekend_night = weekend_night.groupby('hour').mean()
by_hour_business_night
temp | rain_1h | snow_1h | clouds_all | traffic_volume | dayofweek | |
---|---|---|---|---|---|---|
hour | ||||||
0 | 279.983283 | 0.161295 | 0.000341 | 44.580777 | 651.528971 | 1.980913 |
1 | 279.626900 | 0.106480 | 0.000421 | 44.255349 | 396.913043 | 1.987578 |
2 | 279.453874 | 0.123588 | 0.000000 | 44.566323 | 301.982818 | 1.997251 |
3 | 279.046059 | 0.182486 | 0.000110 | 43.596841 | 362.289835 | 2.000687 |
4 | 278.580683 | 0.142667 | 0.000067 | 46.018717 | 832.661096 | 2.006016 |
5 | 278.603900 | 0.124663 | 0.000069 | 44.959478 | 2701.296703 | 1.982143 |
6 | 278.432872 | 0.164433 | 0.000067 | 45.685695 | 5365.983210 | 1.997314 |
19 | 283.439235 | 0.156652 | 0.000000 | 53.014184 | 3298.340426 | 1.989362 |
20 | 282.617152 | 0.119718 | 0.000000 | 51.443583 | 2842.433004 | 1.988717 |
21 | 281.830703 | 0.133628 | 0.000702 | 49.395088 | 2673.042807 | 1.972632 |
22 | 280.982875 | 0.119544 | 0.000617 | 46.618080 | 2125.913104 | 1.973371 |
23 | 280.385178 | 0.105409 | 0.000831 | 44.722752 | 1379.549728 | 1.978883 |
plt.figure(figsize=(12, 12))
plt.subplot(2, 2, 1)
plt.plot(by_hour_business_night['traffic_volume'])
plt.ylabel('Traffic Volume')
plt.xlabel('Hours of Night - predawn')
plt.ylim(0, 6250)
plt.xlim(-1, 7)
plt.title('Business')
plt.subplot(2, 2, 2)
plt.plot(by_hour_weekend_night['traffic_volume'])
plt.title('Weekend')
plt.ylabel('Traffic Volume')
plt.xlabel('Hours of Night - predawn')
plt.ylim(0, 6250)
plt.xlim(-1, 7)
plt.subplot(2, 2, 3)
plt.plot(by_hour_business_night['traffic_volume'])
plt.ylabel('Traffic Volume')
plt.xlabel('Hours of Night - to midnight')
plt.ylim(0, 6250)
plt.xlim(19, 24)
plt.title('Business')
plt.subplot(2, 2, 4)
plt.plot(by_hour_weekend_night['traffic_volume'])
plt.ylabel('Traffic Volume')
plt.xlabel('Hours of Night - to midnight')
plt.ylim(0, 6250)
plt.xlim(19, 24)
plt.title('Weekend')
plt.show()
We have plotted the nighttime
dataset by subsets business days and weekend days. The graph includes a grouping of each hour of each subset and a mean value of traffic_volume
. We have set out the graphs in a grid of two rows and two columns to take into account the two separate sets of hours that make up nighttime
, i.e., predawn from 24 through 6 and at night up to midnight from 19 through 24 (these are also two different days if they are run consecutively).
The business days' predawn period shows a significant increase in traffic_volume
between hours of 4-6 a.m. that reaches a peak around 6 a.m. The weekend predawn period is significantly different traffic_volume
, i.e. overall much less traffic with also two peaks (both much lower than business days) just after midnight and around 6 a.m.
The traffic_volume
at night up to midnight is mostly similar for both weekend and business days.
Conclusion:
In this project, we tried to find a few indicators of heavy traffic on the I-94 Interstate highway. We managed to find two types of indicators:
Time indicators
daytime
and end of predawn nighttime
on business days compared to the weekends.Weather indicators