The goal of this project is to find indicators of heavy traffic on I-94, such as weather and time of day. The data are for westbound traffic on I-94, at a point approximately halfway between Saint Paul and Minneapolis, MN. Westbound is traveling away from Saint Paul, towards Minneapolis. The dataset was created by John Hogue. It is available here.
#read csv file
import pandas as pd
traffic = pd.read_csv(
"Metro_Interstate_Traffic_Volume.csv")
#examine dataset
traffic.head()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
traffic.tail()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
48199 | None | 283.45 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2018-09-30 19:00:00 | 3543 |
48200 | None | 282.76 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 20:00:00 | 2781 |
48201 | None | 282.73 | 0.0 | 0.0 | 90 | Thunderstorm | proximity thunderstorm | 2018-09-30 21:00:00 | 2159 |
48202 | None | 282.09 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 22:00:00 | 1450 |
48203 | None | 282.12 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 23:00:00 | 954 |
traffic.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48204 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null object 8 traffic_volume 48204 non-null int64 dtypes: float64(3), int64(2), object(4) memory usage: 3.3+ MB
Data looks complete and ready to use.
import matplotlib.pyplot as plt
%matplotlib inline
#plot frequency histogram
traffic["traffic_volume"].plot.hist()
plt.xlabel ("Traffic Volume")
plt.title ("Frequency of Traffic Volume")
plt.show()
#get more info about traffic volume
traffic["traffic_volume"].describe()
count 48204.000000 mean 3259.818355 std 1986.860670 min 0.000000 25% 1193.000000 50% 3380.000000 75% 4933.000000 max 7280.000000 Name: traffic_volume, dtype: float64
The most common frequencies are 0-500 and 4500-5000, both reaching 8000. The maximum volume, around 7000 is the least frequent. The mean (3259.81) and median (3380.00) are similar suggesting that the data is relatively symmetrical. However, the median is larger than the mean, so the data has a left skew. This means that traffic is more likely to be heavy than light.
I hypothosize that the high frequency of low volume traffic occurs at night and the high volume traffic occurs during the daytime.
#change `date_time` to datetime type
traffic["date_time"]=pd.to_datetime(
traffic["date_time"])
#break into daytime and nighttime groups
daytime = traffic[(traffic["date_time"].dt.hour>=7) &
(traffic["date_time"].dt.hour<19)].copy()
print(daytime.shape)
nighttime = traffic[(traffic["date_time"].dt.hour>=19)|
(traffic["date_time"].dt.hour<7)].copy()
print(nighttime.shape)
#check that totals from two groups equals 48204, the total num of entries
daytime.shape[0]+nighttime.shape[0]
(23877, 9) (24327, 9)
48204
#plot day and night histograms
plt.figure(figsize = (10,3))
plt.subplot(1,2,1)
daytime["traffic_volume"].plot.hist()
plt.xlabel("Traffic Volume")
plt.title("Daytime Traffic Volume")
plt.ylim(0, 8000)
plt.xlim(0, 7500)
plt.subplot(1,2,2)
nighttime["traffic_volume"].plot.hist()
plt.xlabel("Traffic Volume")
plt.title("Nightime Traffic Volume")
plt.ylim(0,8000)
plt.xlim(0, 7500)
plt.show()
daytime.describe()
temp | rain_1h | snow_1h | clouds_all | traffic_volume | |
---|---|---|---|---|---|
count | 23877.000000 | 23877.00000 | 23877.000000 | 23877.000000 | 23877.000000 |
mean | 282.257596 | 0.53306 | 0.000253 | 53.122000 | 4762.047452 |
std | 13.298885 | 63.62932 | 0.008853 | 37.564588 | 1174.546482 |
min | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 |
25% | 272.680000 | 0.00000 | 0.000000 | 5.000000 | 4252.000000 |
50% | 283.780000 | 0.00000 | 0.000000 | 75.000000 | 4820.000000 |
75% | 293.440000 | 0.00000 | 0.000000 | 90.000000 | 5559.000000 |
max | 310.070000 | 9831.30000 | 0.510000 | 100.000000 | 7280.000000 |
nighttime.describe()
temp | rain_1h | snow_1h | clouds_all | traffic_volume | |
---|---|---|---|---|---|
count | 24327.000000 | 24327.000000 | 24327.000000 | 24327.000000 | 24327.000000 |
mean | 280.173600 | 0.139145 | 0.000192 | 45.672011 | 1785.377441 |
std | 13.296357 | 1.110872 | 0.007434 | 40.048382 | 1441.951197 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 271.700000 | 0.000000 | 0.000000 | 1.000000 | 530.000000 |
50% | 281.379000 | 0.000000 | 0.000000 | 40.000000 | 1287.000000 |
75% | 290.700000 | 0.000000 | 0.000000 | 90.000000 | 2819.000000 |
max | 307.680000 | 55.630000 | 0.510000 | 100.000000 | 6386.000000 |
As hypothosized, heavy traffic is more likely to occur during the daytime than at night. Daytime traffic, on average is about 3 time heavier. The daytime histogram is left skewed and the nightime histogram is right skewed. The nightime data is more heavily skewed than daytime. This is reinforced by the fact that the median is lower than the mean, and the difference between them (mean=1785, median=1287) is greater than the difference in the daytime (mean=4762, median=4820). So while daytime is more likely to be heavy it is also more uniform in distribution.
Since the goal is to find indicators of heavy traffic, I will focus on the daytime data going forward.
#Traffic Volume by Month
daytime["month"] = daytime["date_time"].dt.month
by_month = daytime.groupby("month").mean()
by_month["traffic_volume"].plot()
plt.ylabel("Traffic Volume")
plt.xticks(range(1,13))
plt.title("Traffic Volume by Month)")
plt.show()
Month of the year does appear to affect traffic volume. The volume is lower in January, July and December. This may be in part because people frequently take vacations during these months and are less likely to be commuting to work. The months with the highest volume are March - June, and August - October. This may be because poor weather and short daytime hours is more likely to deter traffic movement in the winter. For example, with trucking companies prioritizing runs during the spring - fall season.
#Traffic volume by day of the week(0=Monday, 6=Sunday)
daytime["dayofweek"] = daytime["date_time"].dt.dayofweek
by_dayofweek = daytime.groupby("dayofweek").mean()
by_dayofweek["traffic_volume"].plot()
plt.xlabel("Day of the Week")
plt.ylabel("Traffic Volume")
plt.title("Traffic Volume by Day of Week")
plt.show()
Traffic volume is highest during the week and lowest on Saturday and Sunday. The lowest volume day is Sunday. I will now break the data into weekday vs weekend to see how the traffic volume changes by hour through the day.
#Break daytime data into weekend and weekday groups
daytime["hour"] = daytime["date_time"].dt.hour
weekday = daytime.copy()[daytime["dayofweek"]<=4]
weekend = daytime.copy()[daytime["dayofweek"]>4]
by_hour_weekday = weekday.groupby("hour").mean()
by_hour_weekend = weekend.groupby("hour").mean()
#Graph weekday and weekend data
plt.figure(figsize=(10, 3))
plt.subplot(1,2,1)
plt.plot(by_hour_weekday["traffic_volume"])
plt.ylim(1500, 6250)
plt.xlabel("Hour of the Day")
plt.ylabel("Traffic Volume")
plt.title("Traffic Volume on Weekdays")
plt.subplot(1,2,2)
plt.plot(by_hour_weekend["traffic_volume"])
plt.ylim(1500, 6250)
plt.xlabel("Hour of the Day")
plt.ylabel("Traffic Volume")
plt.title("Traffic Volume on Weekend")
plt.show()
Traffic volume is heavier on weekdays than the weekend. The lowest points of traffic on a weekday are approximately the same as the highest points on the weekend.
Rush hour, the time during which the most workers are commuting to and from work, occurs on weekdays prior to 8am, and around 4pm. The lowest traffic volumes on weekdays are at 10am and after 6pm.
Traffic volume rises steadily through the morning on weekends, peaking at noon. It remains steady until about 4pm, then reduces into the evening.
#check correlation value of numeric weather columns
daytime.corr()[["traffic_volume"]]
traffic_volume | |
---|---|
temp | 0.128317 |
rain_1h | 0.003697 |
snow_1h | 0.001265 |
clouds_all | -0.032932 |
traffic_volume | 1.000000 |
month | -0.022337 |
dayofweek | -0.416453 |
hour | 0.172704 |
#temp has the highest correlation value.
#plot scatter plot to see what this looks like.
plt.scatter(daytime["temp"], daytime["traffic_volume"])
plt.xlabel("Temperature")
plt.ylabel("Traffic Volume")
plt.title("Traffic Volume and Temperature")
plt.show()
None of the weather columns have a significant correlation to traffic volume. The highest correlation is temp
which has a positive correlation of 0.128 but looking at the scatter plot above, it does not appear that this is a strong correlation that will indicate heavy traffic.
Before disregarding weather as an indicator altogether, lets take a deeper look at the two non-numeric columns: weater_main
and weather_description
.
#calculate average of each weather event
by_weather_main = daytime.groupby("weather_main").mean()
by_weather_description = daytime.groupby("weather_description").mean()
#graph by weather event
by_weather_main["traffic_volume"].plot.barh()
plt.title("Traffic Volume by Weather Type")
plt.xlabel("Traffic Volume")
plt.ylabel("Weather Type")
plt.show()
The weather patterns with the highest traffic volume are Clouds, Drizzle, and Rain. Interestingly, however, there is not much difference between the weather events in terms of traffic volume. Volume remains high regardless of weather suggesting that time is more important that what is happening outside.
#calculate average by weather description
by_weather_description = daytime.groupby("weather_description").mean()
#graph weather_description
by_weather_description["traffic_volume"].plot.barh(figsize = (10, 20))
plt.xlabel ("Traffic Volume")
plt.ylabel ("Weather Description")
plt.title ("Traffic Volume by Weather Description")
plt.show()
Considering the more detailed weather descriptions here, we can see that the two highest volume weather events are similar: light rain and snow
and shower snow
. To amend the previous observations. It does look like a combination of snow and rain may indicate higher traffic volume.
The highest indicator of high traffic volume was time. Traffic volume is highest:
There is also some indication that weather events involving both rain and snow may also increase traffic volumes. Beyond that, weather does not seem to have as large of an affect on traffic volume as time of day.