This project will analyse the traffic on the the I-94 interstate highway. The data was collected by a station loctaed approximately midway between Minneapolis and Saint Paul. The data recorded was the hourly westbound traffic and includes weather and holiday features from 2012 to 2018. This data was made available by John Hogue and can be downloaded from the UCI Machine Learning Repository
import pandas as pd
import numpy as np
traffic = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
traffic.head(5)
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
traffic.tail()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
48199 | None | 283.45 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2018-09-30 19:00:00 | 3543 |
48200 | None | 282.76 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 20:00:00 | 2781 |
48201 | None | 282.73 | 0.0 | 0.0 | 90 | Thunderstorm | proximity thunderstorm | 2018-09-30 21:00:00 | 2159 |
48202 | None | 282.09 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 22:00:00 | 1450 |
48203 | None | 282.12 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 23:00:00 | 954 |
traffic.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48204 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null object 8 traffic_volume 48204 non-null int64 dtypes: float64(3), int64(2), object(4) memory usage: 3.3+ MB
import matplotlib.pyplot as plt
%matplotlib inline
traffic['date_time'] = pd.to_datetime(traffic['date_time'])
plt.hist(traffic['traffic_volume'])
plt.xlabel('Traffic Volume')
plt.show()
traffic['traffic_volume'].describe()
count 48204.000000 mean 3259.818355 std 1986.860670 min 0.000000 25% 1193.000000 50% 3380.000000 75% 4933.000000 max 7280.000000 Name: traffic_volume, dtype: float64
The histogram has two main peaks, 0 and 5000. 0 likely is very late at night when there is no traffic. The peak of 5000 is likely the peak of traffic in the morning and evening. Approximately 25% of the time there are 1,193 cars or less on the road which probably corresponds to night time.
Since the above data suggest the traffic volume varies significantly we will investigate if this can be attributed to the time of day.
traffic['date_time'].dt.hour
0 9 1 10 2 11 3 12 4 13 .. 48199 19 48200 20 48201 21 48202 22 48203 23 Name: date_time, Length: 48204, dtype: int64
day=traffic[(traffic['date_time'].dt.hour>=7)&(traffic['date_time'].dt.hour<19)]
night=traffic[(traffic['date_time'].dt.hour<7)|(traffic['date_time'].dt.hour>=19)]
day.head()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
day.tail()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
48194 | None | 283.84 | 0.00 | 0.0 | 75 | Rain | proximity shower rain | 2018-09-30 15:00:00 | 4302 |
48195 | None | 283.84 | 0.00 | 0.0 | 75 | Drizzle | light intensity drizzle | 2018-09-30 15:00:00 | 4302 |
48196 | None | 284.38 | 0.00 | 0.0 | 75 | Rain | light rain | 2018-09-30 16:00:00 | 4283 |
48197 | None | 284.79 | 0.00 | 0.0 | 75 | Clouds | broken clouds | 2018-09-30 17:00:00 | 4132 |
48198 | None | 284.20 | 0.25 | 0.0 | 75 | Rain | light rain | 2018-09-30 18:00:00 | 3947 |
night.head()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
10 | None | 290.97 | 0.0 | 0.0 | 20 | Clouds | few clouds | 2012-10-02 19:00:00 | 3539 |
11 | None | 289.38 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 20:00:00 | 2784 |
12 | None | 288.61 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 21:00:00 | 2361 |
13 | None | 287.16 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 22:00:00 | 1529 |
14 | None | 285.45 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 23:00:00 | 963 |
night.tail()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
48199 | None | 283.45 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2018-09-30 19:00:00 | 3543 |
48200 | None | 282.76 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 20:00:00 | 2781 |
48201 | None | 282.73 | 0.0 | 0.0 | 90 | Thunderstorm | proximity thunderstorm | 2018-09-30 21:00:00 | 2159 |
48202 | None | 282.09 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 22:00:00 | 1450 |
48203 | None | 282.12 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 23:00:00 | 954 |
plt.figure(figsize=(10,12))
plt.subplot(3, 2, 1)
plt.hist(day['traffic_volume'])
plt.xlabel('Traffic Volume')
plt.ylabel('Frequency')
plt.title('Daytime')
plt.ylim([0, 8000])
plt.xlim([0, 8000])
plt.subplot(3, 2, 2)
plt.hist(night['traffic_volume'])
plt.xlabel('Traffic Volume')
plt.ylabel('Frequency')
plt.title('Nighttime')
plt.ylim([0, 8000])
plt.xlim([0, 8000])
plt.show()
day['traffic_volume'].describe()
count 23877.000000 mean 4762.047452 std 1174.546482 min 0.000000 25% 4252.000000 50% 4820.000000 75% 5559.000000 max 7280.000000 Name: traffic_volume, dtype: float64
night['traffic_volume'].describe()
count 24327.000000 mean 1785.377441 std 1441.951197 min 0.000000 25% 530.000000 50% 1287.000000 75% 2819.000000 max 6386.000000 Name: traffic_volume, dtype: float64
The daytime histogram is skewed to the left suggesting that in general traffic volumes are higher during this period. The nighttime histogram is skewed to the righth suggesting much lower traffic volumes during this period. Since our goal is to determine indicators of heavy traffic the night time data can be ignored.
#monthly traffic volume averages
day['month'] = day['date_time'].dt.month
by_month = day.groupby('month').mean()
by_month['traffic_volume']
<ipython-input-17-c4189a1adaba>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
month 1 4495.613727 2 4711.198394 3 4889.409560 4 4906.894305 5 4911.121609 6 4898.019566 7 4595.035744 8 4928.302035 9 4870.783145 10 4921.234922 11 4704.094319 12 4374.834566 Name: traffic_volume, dtype: float64
plt.plot(by_month['traffic_volume'])
plt.title('Traffic per month')
plt.xlabel('Month')
plt.xticks(ticks=[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
labels=['Jan', 'Feb', 'Mar', 'Apr',
'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
rotation=30)
([<matplotlib.axis.XTick at 0x7f3a97f24f10>, <matplotlib.axis.XTick at 0x7f3a97f24f40>, <matplotlib.axis.XTick at 0x7f3a974164f0>, <matplotlib.axis.XTick at 0x7f3ad5139d00>, <matplotlib.axis.XTick at 0x7f3ad5139cd0>, <matplotlib.axis.XTick at 0x7f3ad5150d60>, <matplotlib.axis.XTick at 0x7f3ad51502e0>, <matplotlib.axis.XTick at 0x7f3ad5150520>, <matplotlib.axis.XTick at 0x7f3ad5150f70>, <matplotlib.axis.XTick at 0x7f3ad5153bb0>, <matplotlib.axis.XTick at 0x7f3ad5153a90>, <matplotlib.axis.XTick at 0x7f3ad5153ee0>], [Text(0, 0, 'Jan'), Text(0, 0, 'Feb'), Text(0, 0, 'Mar'), Text(0, 0, 'Apr'), Text(0, 0, 'May'), Text(0, 0, 'Jun'), Text(0, 0, 'Jul'), Text(0, 0, 'Aug'), Text(0, 0, 'Sep'), Text(0, 0, 'Oct'), Text(0, 0, 'Nov'), Text(0, 0, 'Dec')])
Traffic volume is at its lowest in January and December. The traffic is graduall declining until december and starts increasing again from January to March. This is likely due to christmas holidays from both work and school decreasing the need for travel. The traffic is almost constantly hight between March to October but there is an unusual drop of traffic volume in July.
day['dayofweek'] = day['date_time'].dt.dayofweek
by_dayofweek = day.groupby('dayofweek').mean()
by_dayofweek['traffic_volume'] # 0 is Monday, 6 is Sunday
<ipython-input-19-e4e897828b56>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
dayofweek 0 4893.551286 1 5189.004782 2 5284.454282 3 5311.303730 4 5291.600829 5 3927.249558 6 3436.541789 Name: traffic_volume, dtype: float64
plt.plot(by_dayofweek['traffic_volume'])
plt.title('Traffic per Day')
plt.xlabel('Day')
plt.xticks(ticks=[0, 1, 2, 3, 4, 5, 6], labels=['Monday', 'Tuesday', 'Wednesday', 'Thursday',
'Friday', 'Saturday', 'Sunday'], rotation=30)
([<matplotlib.axis.XTick at 0x7f3a97d1f610>, <matplotlib.axis.XTick at 0x7f3a97d1f670>, <matplotlib.axis.XTick at 0x7f3a98e09700>, <matplotlib.axis.XTick at 0x7f3a97cf6400>, <matplotlib.axis.XTick at 0x7f3a97cf6670>, <matplotlib.axis.XTick at 0x7f3a97cf6220>, <matplotlib.axis.XTick at 0x7f3a97cf6bb0>], [Text(0, 0, 'Monday'), Text(0, 0, 'Tuesday'), Text(0, 0, 'Wednesday'), Text(0, 0, 'Thursday'), Text(0, 0, 'Friday'), Text(0, 0, 'Saturday'), Text(0, 0, 'Sunday')])
As expected traffic volume starts increasing on Monday and stays consistently high until Friday when the working week ends. Traffic volume is significantly lower on the weekend than it is on business days.
day['hour'] = day['date_time'].dt.hour
bussiness_days = day.copy()[day['dayofweek'] <= 4] # 4 == Friday
weekend = day.copy()[day['dayofweek'] >= 5] # 5 == Saturday
by_hour_business = bussiness_days.groupby('hour').mean()
by_hour_weekend = weekend.groupby('hour').mean()
print(by_hour_business['traffic_volume'])
print(by_hour_weekend['traffic_volume'])
hour 7 6030.413559 8 5503.497970 9 4895.269257 10 4378.419118 11 4633.419470 12 4855.382143 13 4859.180473 14 5152.995778 15 5592.897768 16 6189.473647 17 5784.827133 18 4434.209431 Name: traffic_volume, dtype: float64 hour 7 1589.365894 8 2338.578073 9 3111.623917 10 3686.632302 11 4044.154955 12 4372.482883 13 4362.296564 14 4358.543796 15 4342.456881 16 4339.693805 17 4151.919929 18 3811.792279 Name: traffic_volume, dtype: float64
<ipython-input-21-4244afafd4b2>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
plt.figure(figsize=(10,6))
plt.subplot(1, 2, 1)
plt.plot(by_hour_business['traffic_volume'])
plt.title('Traffic per hour (business)')
plt.xlabel('Time of day')
plt.ylabel('Traffic Volume')
plt.ylim([1000, 6500])
plt.subplot(1, 2, 2)
plt.plot(by_hour_weekend['traffic_volume'])
plt.title('Traffic per hour (weekend)')
plt.xlabel('Time of day')
plt.ylabel('Traffic Volume')
plt.ylim([1000, 6500])
(1000.0, 6500.0)
In general from the graphs its clear the traffic volume is higher throughout the day on business days. The traffic on business days peaks at 8am and 4pm which would coincide with people travelling to and from work or school. Traffic during the weekend is generally quieter in the morning and starts to increase and reaches a peak around midday.
In general the most common time indicators for heavy traffic are:
day.corr()['traffic_volume']
temp 0.128317 rain_1h 0.003697 snow_1h 0.001265 clouds_all -0.032932 traffic_volume 1.000000 month -0.022337 dayofweek -0.416453 hour 0.172704 Name: traffic_volume, dtype: float64
From above we can see that the strongest correlation is th epositive relationship between temp and travvic volume.
plt.figure(figsize=(12,10))
plt.subplot(2, 2, 1)
plt.scatter(day['traffic_volume'], day['temp'])
plt.xlabel('Traffic volume')
plt.ylabel('Temperature °C')
plt.title('Temperature')
plt.subplot(2, 2, 2)
plt.scatter(day['traffic_volume'], day['rain_1h'])
plt.xlabel('Traffic volume')
plt.ylabel('Rain')
plt.title('Rainfall')
plt.subplot(2, 2, 3)
plt.scatter(day['traffic_volume'], day['snow_1h'])
plt.xlabel('Traffic volume')
plt.ylabel('Snow')
plt.title('Snowfall')
plt.subplot(2, 2, 4)
plt.scatter(day['traffic_volume'], day['clouds_all'])
plt.xlabel('Traffic volume')
plt.ylabel('Clouds')
plt.title('Cloud cover')
plt.show()
While thr temp column did have a positive correlation with traffic volume it is clear from the sctter plot that this is not a useful indicator for heavy traffic. Similarily the other columns display no significant or useful correlation with traffic volume.
by_weather_main = day.groupby('weather_main').mean()
by_weather_description = day.groupby('weather_description').mean()
by_weather_main['traffic_volume'].plot.barh()
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a973779a0>
Weather_main is a short description of the weather conditions and contains 11 unique entries. From the bar chart we can see no single weather condition is associated with a trafic volume exceeding 5000 and overall there isn't a large amount of variation of traffic volume for different weather conditions.
We will look at the weather_description column as it has a much greater number of unique entries.
plt.figure(figsize=(9,15))
by_weather_description['traffic_volume'].plot.barh()
<matplotlib.axes._subplots.AxesSubplot at 0x7f3a92b7d400>
There are 3 weather types which have a traffic volume greater than 5000:
No one weather type stands out as a clear indicator of traffic volume.
The best time based indicators of heavey traffic are:
There are 3 weather types which can be associated with heavier traffic:
The type of weather appears to be a fairly poor indicator of heavy traffic as nearly all types are associated with volumes exceeding 4000 cars. Therefore the time based indicators especially the day and time of day metrics are the most useful inducators of heavy traffic.