The goal of this project is to analyze and determine indicators of heavy traffic on I-94. These indicators can be weather type, time of the day, time of the week, etc.
The dataset that will be used for the analysis was made by John Hogue and stored here https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume# It contains hourly Interstate 94 Westbound traffic volume for MN DoT ATR station 301, roughly midway between Minneapolis and St Paul, MN, it also includes weather and holiday features from 2012-2018. Westbound traffic means that the dataset has information about cars moving from east to west only, and the results for the entire I-94 highway cann't be generalized.
import pandas as pd
pd.options.mode.chained_assignment = None # default='warn'
i_94_wb_traffic_volume = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
i_94_wb_traffic_volume.head(5)
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
i_94_wb_traffic_volume.tail(5)
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
48199 | None | 283.45 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2018-09-30 19:00:00 | 3543 |
48200 | None | 282.76 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 20:00:00 | 2781 |
48201 | None | 282.73 | 0.0 | 0.0 | 90 | Thunderstorm | proximity thunderstorm | 2018-09-30 21:00:00 | 2159 |
48202 | None | 282.09 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 22:00:00 | 1450 |
48203 | None | 282.12 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 23:00:00 | 954 |
i_94_wb_traffic_volume.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48204 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null object 8 traffic_volume 48204 non-null int64 dtypes: float64(3), int64(2), object(4) memory usage: 3.3+ MB
The results of observation are the following:
import matplotlib.pyplot as plt
%matplotlib inline
i_94_wb_traffic_volume['traffic_volume'].plot.hist()
plt.xlabel("I94 Westbound traffic volume. Vehicle per hour")
plt.title("Distribution of I94 Westbound traffic volume.")
plt.show()
i_94_wb_traffic_volume['traffic_volume'].describe()
count 48204.000000 mean 3259.818355 std 1986.860670 min 0.000000 25% 1193.000000 50% 3380.000000 75% 4933.000000 max 7280.000000 Name: traffic_volume, dtype: float64
The result of histogram observation is the following:
What are the specific factors that have influence to the traffic volume? It can be different time of the day, for example the daytime and nighttime, and it also can be the weather coitions, such as heavy rain or snow.
Let's check if the daytime and nighttime traffic volume are differ. The dataset will be divided in to two parts:
But first of all the values in the date_time column are needed to be formated as datetime type.
i_94_wb_traffic_volume['date_time'] = pd.to_datetime(i_94_wb_traffic_volume['date_time'])
daytime_traffic = i_94_wb_traffic_volume.loc[(i_94_wb_traffic_volume["date_time"].dt.hour>=7) & (i_94_wb_traffic_volume["date_time"].dt.hour<=19)]
daytime_traffic
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.00 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.00 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.00 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.00 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.00 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
48195 | None | 283.84 | 0.00 | 0.0 | 75 | Drizzle | light intensity drizzle | 2018-09-30 15:00:00 | 4302 |
48196 | None | 284.38 | 0.00 | 0.0 | 75 | Rain | light rain | 2018-09-30 16:00:00 | 4283 |
48197 | None | 284.79 | 0.00 | 0.0 | 75 | Clouds | broken clouds | 2018-09-30 17:00:00 | 4132 |
48198 | None | 284.20 | 0.25 | 0.0 | 75 | Rain | light rain | 2018-09-30 18:00:00 | 3947 |
48199 | None | 283.45 | 0.00 | 0.0 | 75 | Clouds | broken clouds | 2018-09-30 19:00:00 | 3543 |
25838 rows × 9 columns
nighttime_traffic = i_94_wb_traffic_volume.loc[(i_94_wb_traffic_volume['date_time'].dt.hour > 19) | (i_94_wb_traffic_volume['date_time'].dt.hour < 7)]
nighttime_traffic
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
11 | None | 289.38 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 20:00:00 | 2784 |
12 | None | 288.61 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 21:00:00 | 2361 |
13 | None | 287.16 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 22:00:00 | 1529 |
14 | None | 285.45 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 23:00:00 | 963 |
15 | None | 284.63 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 00:00:00 | 506 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
48184 | None | 280.17 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 06:00:00 | 802 |
48200 | None | 282.76 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 20:00:00 | 2781 |
48201 | None | 282.73 | 0.0 | 0.0 | 90 | Thunderstorm | proximity thunderstorm | 2018-09-30 21:00:00 | 2159 |
48202 | None | 282.09 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 22:00:00 | 1450 |
48203 | None | 282.12 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 23:00:00 | 954 |
22366 rows × 9 columns
Let's compare the traffic volume at night and during the day via histogram.
plt.figure(figsize = (24,12))
plt.subplot(3,2,1)
daytime_traffic['traffic_volume'].plot.hist()
plt.title('I94 Westbound Daytime traffic')
plt.xlabel("Traffic volume. Vehicle per hour")
plt.subplot(3,2,2)
nighttime_traffic['traffic_volume'].plot.hist()
plt.title('I94 Westbound Nighttime traffic')
plt.xlabel("Traffic volume. Vehicle per hour")
plt.show()
daytime_traffic['traffic_volume'].describe()
count 25838.000000 mean 4649.292360 std 1202.321987 min 0.000000 25% 4021.000000 50% 4736.000000 75% 5458.000000 max 7280.000000 Name: traffic_volume, dtype: float64
nighttime_traffic['traffic_volume'].describe()
count 22366.000000 mean 1654.648484 std 1425.175292 min 0.000000 25% 486.000000 50% 1056.500000 75% 2630.750000 max 6386.000000 Name: traffic_volume, dtype: float64
The results of the histogram analyzation are the following:
The goal of this project is to analyze the heavy traffic factors, from now we can focus on daytime traffic and after that analyze nighttime traffic volume.
The time is one of the possible indicators of heavy traffic.It might be a certain month, on a certain day, or at a certain time of the day.
Let's check how the traffic volume changed according to the following parameters:
# To get the traffic volume averages for each month
daytime_traffic['month'] = daytime_traffic["date_time"].dt.month
by_month = daytime_traffic.groupby('month').mean()
by_month['traffic_volume']
month 1 4385.217310 2 4593.187798 3 4761.529676 4 4771.232816 5 4788.966639 6 4791.087488 7 4502.628360 8 4818.434690 9 4755.709916 10 4809.481678 11 4588.910486 12 4276.567081 Name: traffic_volume, dtype: float64
by_month['traffic_volume'].plot.line()
plt.xticks(rotation=45, ticks=[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,], labels=["Jan", "Fev", "Mar", "Apr", "May", "Jun", "July", "Aug", "Sep","Oct", "Nov", "Dec"] )
plt.title("Average day traffic volume by month")
plt.ylabel("Average day traffic volume")
Text(0, 0.5, 'Average day traffic volume')
The line plot above showed the following:
The holidays, such as Xmas in December could be a suggested explanation of the minimum values of traffic volume. In July probably there were summer holiday, or summer breaks, or vacations, or it might be the one side road closure because of the major repair.
#To get the traffic volume averages for each day of the week.
daytime_traffic['dayofweek'] = daytime_traffic['date_time'].dt.dayofweek
by_dayofweek = daytime_traffic.groupby('dayofweek').mean()
by_dayofweek['traffic_volume'] # 0 is Monday, 6 is Sunday
dayofweek 0 4746.208029 1 5036.062431 2 5141.231163 3 5163.688063 4 5161.533588 5 3884.065668 6 3410.368091 Name: traffic_volume, dtype: float64
by_dayofweek['traffic_volume'].plot.line()
plt.xticks(rotation=45, ticks=[0, 1, 2, 3, 4, 5, 6], labels=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"] )
plt.title("Average traffic volume by day of the week")
plt.ylabel("Average day traffic volume")
Text(0, 0.5, 'Average day traffic volume')
As plot shows the traffic volume is significantly heavier from Monday to Friday which are business days, and goes down on weekends.
#To get the traffic volume averages for time of day.
daytime_traffic['hour'] = daytime_traffic['date_time'].dt.hour
bussiness_days = daytime_traffic.copy()[daytime_traffic['dayofweek'] <= 4] # 4 == Friday
weekend = daytime_traffic.copy()[daytime_traffic['dayofweek'] >= 5] # 5 == Saturday
by_hour_business = bussiness_days.groupby('hour').mean()
by_hour_weekend = weekend.groupby('hour').mean()
print(by_hour_business['traffic_volume'])
print(by_hour_weekend['traffic_volume'])
hour 7 6030.413559 8 5503.497970 9 4895.269257 10 4378.419118 11 4633.419470 12 4855.382143 13 4859.180473 14 5152.995778 15 5592.897768 16 6189.473647 17 5784.827133 18 4434.209431 19 3298.340426 Name: traffic_volume, dtype: float64 hour 7 1589.365894 8 2338.578073 9 3111.623917 10 3686.632302 11 4044.154955 12 4372.482883 13 4362.296564 14 4358.543796 15 4342.456881 16 4339.693805 17 4151.919929 18 3811.792279 19 3220.234120 Name: traffic_volume, dtype: float64
plt.figure(figsize = (10,12))
plt.subplot(3,2,1)
by_hour_business['traffic_volume'].plot.line()
plt.xlim(6,20)
plt.ylim(1500,6500)
plt.title('I94 Westbound Daytime traffic. Mon-Fri.')
plt.xlabel("Hour.")
plt.ylabel("Traffic volume. Vehicle per hour")
plt.subplot(3,2,2)
by_hour_weekend['traffic_volume'].plot.line()
plt.xlim(6,20)
plt.ylim(1500,6500)
plt.title('I94 Westbound Daytime traffic.Weekend.')
plt.xlabel("Hour.")
plt.ylabel("Traffic volume. Vehicle per hour")
plt.show()
The results of analysis the line plot above are the following:
Summarizing all three plots, there are few time-related indicators of heavy traffic:
Another possible indicator of heavy traffic is weather. The dataset has a few useful columns about weather:
A few of these columns are numerical so let's start by looking up their correlation values with traffic_volume.
i_94_wb_traffic_volume.corr()['traffic_volume'][['temp','rain_1h','snow_1h','clouds_all']]
temp 0.130299 rain_1h 0.004714 snow_1h 0.000733 clouds_all 0.067054 Name: traffic_volume, dtype: float64
plt.scatter(daytime_traffic['traffic_volume'],daytime_traffic['temp'])
plt.xlabel('Traffic volume, car per hour')
plt.ylabel('Temp,K')
plt.show()
Looking at the plot, we can conclude that temperature doesn't look like a solid indicator of heavy traffic.
#Calculate the average traffic volume associated
#with each unique value in the 'weather_main' and 'weather_description' columns.
by_weather_main = daytime_traffic.groupby('weather_main').mean()
by_weather_description = daytime_traffic.groupby('weather_description').mean()
by_weather_main['traffic_volume'].plot.barh()
plt.xlabel('Traffic volume,car per hour')
plt.title(label = "Main Weather Conditions")
plt.show()
It looks like there's no weather type where traffic volume exceeds 5,000 cars. This makes finding a heavy traffic indicator more difficult. Let's also group by weather_description, which has a more granular weather classification.
plt.figure(figsize = (10,12))
by_weather_description['traffic_volume'].plot.barh()
plt.xlabel('Traffic volume,car per hour')
plt.title(label = "Detailed Weather Conditions")
plt.show()
There are 3 weather category as potential indicators of heavy traffic in the daytime period where traffic volume reach 5,000 or exceeds it:
It's not clear why these weather types have the highest average traffic values because there are other more extreme days were people didn't seem to have the same behaviour. Probably with this to bad weather with rain and snow more people needs to use car, instead of bicking or walking - it's a suggestion.
Let's check the nighttime and weather data to look for heavy traffic indicators.
# To get the night traffic volume averages for each month
nighttime_traffic['month'] = nighttime_traffic["date_time"].dt.month
by_month_night = nighttime_traffic.groupby('month').mean()
by_month_night['traffic_volume']
month 1 1507.162089 2 1595.244973 3 1678.887584 4 1653.788423 5 1692.370461 6 1792.696987 7 1710.968582 8 1766.493392 9 1670.819290 10 1710.422642 11 1562.349509 12 1499.395094 Name: traffic_volume, dtype: float64
by_month_night['traffic_volume'].plot.line()
plt.xticks(rotation=45, ticks=[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,], labels=["Jan", "Fev", "Mar", "Apr", "May", "Jun", "July", "Aug", "Sep","Oct", "Nov", "Dec"] )
plt.title("Average night traffic volume by month")
plt.ylabel("Average night traffic volume")
Text(0, 0.5, 'Average night traffic volume')
The line plot above showed the following:
#To get the nighttime traffic volume averages for each day of the week.
nighttime_traffic['dayofweek'] = nighttime_traffic['date_time'].dt.dayofweek
by_dayofweek_night = nighttime_traffic.groupby('dayofweek').mean()
by_dayofweek_night['traffic_volume'] # 0 is Monday, 6 is Sunday
dayofweek 0 1606.159456 1 1752.048047 2 1781.810205 3 1867.965779 4 1919.764650 5 1484.403986 6 1172.446702 Name: traffic_volume, dtype: float64
by_dayofweek_night['traffic_volume'].plot.line()
plt.xticks(rotation=45, ticks=[0, 1, 2, 3, 4, 5, 6], labels=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"] )
plt.title("Average traffic nighttime volume by day of the week")
plt.ylabel("Average night traffic volume")
Text(0, 0.5, 'Average night traffic volume')
As plot shows the traffic volume is heavier Friday night and goes down on weekends. This is probably explained because people would prefer leisure activity on the end of working week, and choose this night to go to dinner in a restaurant or go to bar, or disco after the work.
#To get the traffic volume averages for time of day.
nighttime_traffic['hour'] = nighttime_traffic['date_time'].dt.hour
bussiness_days_night = nighttime_traffic.copy()[nighttime_traffic['dayofweek'] <= 4] # 4 == Friday
weekend_night = nighttime_traffic.copy()[nighttime_traffic['dayofweek'] >= 5] # 5 == Saturday
by_hour_business_night = bussiness_days_night.groupby('hour').mean()
by_hour_weekend_night = weekend_night.groupby('hour').mean()
print(by_hour_business_night['traffic_volume'])
print(by_hour_weekend_night['traffic_volume'])
hour 0 651.528971 1 396.913043 2 301.982818 3 362.289835 4 832.661096 5 2701.296703 6 5365.983210 20 2842.433004 21 2673.042807 22 2125.913104 23 1379.549728 Name: traffic_volume, dtype: float64 hour 0 1306.414035 1 805.128333 2 611.171986 3 393.611599 4 375.420168 5 639.237232 6 1089.100334 20 2815.039216 21 2658.445242 22 2384.368607 23 1699.050699 Name: traffic_volume, dtype: float64
#create function to help plot creation:
#create a plot to show what happen since 19h-23h then 0h-7h
#during business day vs weekend day
# plot_traffic_volume" function details:
# line_data = series that we want to plot
# start_hour = beginning hour of the plot **in 24hr format**
# end_hour = end hour of the plot **in 24hr format**
# day_of_the_week = string to define the kind of day in week: business or weekend
def plot_traffic_volume(line_data, start_hour, end_hour,day_of_the_week):
line_data.plot.line()
plt.title(f'Average traffic volume by hour of {day_of_the_week} {start_hour} to {end_hour}')
plt.ylabel('Average traffic volume')
plt.ylim(0,6300)
plt.xlim(start_hour, end_hour)
#creating the two line plots in one grid
#to visualize the average traffic volume by hour of daytime
plt.figure(figsize=(12, 10))
plt.subplot(2,2,1)
plot_traffic_volume(by_hour_business_night["traffic_volume"],19,23,"business day")
plt.subplot(2,2,2)
plot_traffic_volume(by_hour_business_night["traffic_volume"],0,7,"business day")
plt.subplot(2,2,3)
plot_traffic_volume(by_hour_weekend_night["traffic_volume"],19,23,"weekend day")
plt.subplot(2,2,4)
plot_traffic_volume(by_hour_weekend_night["traffic_volume"],0,7,"weekend day")
plt.show()
The results of analysis the line plot above are the following:
plt.scatter(nighttime_traffic['traffic_volume'],nighttime_traffic['temp'])
plt.xlabel('Traffic volume, car per hour')
plt.ylabel('Temp,K')
plt.show()
Similar to what we saw with daytime dataset, temperature doesn't look like a solid indicator of heavy traffic in nighttime either.
#Calculate the average traffic volume associated
#with each unique value in the 'weather_main' and 'weather_description' columns.
by_weather_main_night = nighttime_traffic.groupby('weather_main').mean()
by_weather_description_night = nighttime_traffic.groupby('weather_description').mean()
by_weather_main_night['traffic_volume'].plot.barh()
plt.xlabel('Traffic volume,car per hour')
plt.title(label = "Nighttime Main Weather Conditions.")
plt.show()
plt.figure(figsize = (10,12))
by_weather_description_night['traffic_volume'].plot.barh()
plt.xlabel('Traffic volume,car per hour')
plt.title(label = "Nighttime Detailed Weather Conditions")
plt.show()
Two categories can be identified as potential indicators of heavy traffic in the night period: where traffic volume reach 2500 or exceeds it:
Both categories are referring to bad weather with rain, and probably with this weather more people definetely needs to use car.
To summarize - there are following conclusions regarding the indicators of heavy traffic on westbound I-94 Interstate highway: