In this project I look forward to selecting and applying various ways of graphically representing information derived specifically from I-94 westbound traffic data.
This effort to identify indicators of traffic slowdowns is an example of exploratory anaylsis.
The Nighttime (7pm - 7am) Traffic Volume Frequency Histogram is Right-Skewed indicating low traffic volume. Data from Nighttime was excluded from the current analysis because it does not offer much insight into high traffic volume indicators.
The Winter months (Nov-Feb) have lower traffic volume, with the lowest in Dec/Jan. This could be a combination of the holidays and the colder weather.
July was an exception to the higher traffic volumes observed in the Spring-Fall (Mar-Oct) with a noticeable dip. This dip is curious but does not provide insight on high traffic volume indicators.
Business Days (Mon-Fri) showed much higher traffic volume than Weekends (Sat-Sun), and the line graphs of the hourly changes in traffic volume for each displayed two very distinct trends.
Business Day Traffic Volume peaked very high at 7:00 and then again at 16:00.
Weekend Traffic Volume started low in the morning and increased steadily to maintain a steady middle level throughout the afternoon.
There was mostly just a weak correlation between the different types of weather events and traffic volume.
However, when looking a little more closely, weather events that specifically combined rain and snow had higher traffic volumes than other events that involved just rain or snow. An explanation could be icy road conditions.
import pandas as pd
dataset = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
print(dataset.head())
print(dataset.tail())
print(dataset.info())
holiday temp rain_1h snow_1h clouds_all weather_main \ 0 None 288.28 0.0 0.0 40 Clouds 1 None 289.36 0.0 0.0 75 Clouds 2 None 289.58 0.0 0.0 90 Clouds 3 None 290.13 0.0 0.0 90 Clouds 4 None 291.14 0.0 0.0 75 Clouds weather_description date_time traffic_volume 0 scattered clouds 2012-10-02 09:00:00 5545 1 broken clouds 2012-10-02 10:00:00 4516 2 overcast clouds 2012-10-02 11:00:00 4767 3 overcast clouds 2012-10-02 12:00:00 5026 4 broken clouds 2012-10-02 13:00:00 4918 holiday temp rain_1h snow_1h clouds_all weather_main \ 48199 None 283.45 0.0 0.0 75 Clouds 48200 None 282.76 0.0 0.0 90 Clouds 48201 None 282.73 0.0 0.0 90 Thunderstorm 48202 None 282.09 0.0 0.0 90 Clouds 48203 None 282.12 0.0 0.0 90 Clouds weather_description date_time traffic_volume 48199 broken clouds 2018-09-30 19:00:00 3543 48200 overcast clouds 2018-09-30 20:00:00 2781 48201 proximity thunderstorm 2018-09-30 21:00:00 2159 48202 overcast clouds 2018-09-30 22:00:00 1450 48203 overcast clouds 2018-09-30 23:00:00 954 <class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48204 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null object 8 traffic_volume 48204 non-null int64 dtypes: float64(3), int64(2), object(4) memory usage: 3.3+ MB None
import matplotlib.pyplot as plt
%matplotlib inline
dataset["traffic_volume"].plot.hist(bins=10)
plt.xlabel("Traffic Volume")
plt.show()
dataset["traffic_volume"].describe()
count 48204.000000 mean 3259.818355 std 1986.860670 min 0.000000 25% 1193.000000 50% 3380.000000 75% 4933.000000 max 7280.000000 Name: traffic_volume, dtype: float64
Based on the series desription and histogram generated above, the most common traffic volumes are less than 1000 cars or closer to 5000 cars.
Certainly time of day has an influence on the # of cars on the road. Given the shape of the histogram, the data may be indicating a nightly average around 1000 cars and a daytime average around 4500.
dataset["date_time"] = pd.to_datetime(dataset["date_time"])
dataset["hour"] = dataset["date_time"].dt.hour
print(dataset.head())
holiday temp rain_1h snow_1h clouds_all weather_main \ 0 None 288.28 0.0 0.0 40 Clouds 1 None 289.36 0.0 0.0 75 Clouds 2 None 289.58 0.0 0.0 90 Clouds 3 None 290.13 0.0 0.0 90 Clouds 4 None 291.14 0.0 0.0 75 Clouds weather_description date_time traffic_volume hour 0 scattered clouds 2012-10-02 09:00:00 5545 9 1 broken clouds 2012-10-02 10:00:00 4516 10 2 overcast clouds 2012-10-02 11:00:00 4767 11 3 overcast clouds 2012-10-02 12:00:00 5026 12 4 broken clouds 2012-10-02 13:00:00 4918 13
bool_day = (dataset["hour"] >= 7) & (dataset["hour"] < 19)
bool_night = ~bool_day
dayset = dataset.loc[bool_day].copy()
nightset = dataset.loc[bool_night].copy()
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.hist(dayset["traffic_volume"], bins=10)
plt.title("Daytime Traffic Volumes")
plt.xlabel("Traffic Volume")
plt.ylabel("Frequency")
plt.xlim([0,8000])
plt.ylim([0,8000])
plt.subplot(1,2,2)
plt.hist(nightset["traffic_volume"], bins=10)
plt.title("Nighttime Traffic Volumes")
plt.xlabel("Traffic Volume")
plt.ylabel("Frequency")
plt.xlim([0,8000])
plt.ylim([0,8000])
plt.show()
print("Daytime Traffic Volume Statistics:")
print(dayset["traffic_volume"].describe())
print()
print("Nighttime Traffic Volume Statistics:")
print(nightset["traffic_volume"].describe())
Daytime Traffic Volume Statistics: count 23877.000000 mean 4762.047452 std 1174.546482 min 0.000000 25% 4252.000000 50% 4820.000000 75% 5559.000000 max 7280.000000 Name: traffic_volume, dtype: float64 Nighttime Traffic Volume Statistics: count 24327.000000 mean 1785.377441 std 1441.951197 min 0.000000 25% 530.000000 50% 1287.000000 75% 2819.000000 max 6386.000000 Name: traffic_volume, dtype: float64
The dayset histogram is left-skewed with higher traffic volumes occuring more often during this period.
The nightset histogram is right-skewed with lower traffic volumes occurring more often.
Since nighttime data reflects lower traffic volumes, it will not be as useful to find indicators of heavy traffic which is significantly more common during the day.
Moving forward we'll focus on the dayset data.
dayset["month"] = dayset["date_time"].dt.month
by_month = dayset.groupby("month").mean()
by_month['traffic_volume']
month 1 4495.613727 2 4711.198394 3 4889.409560 4 4906.894305 5 4911.121609 6 4898.019566 7 4595.035744 8 4928.302035 9 4870.783145 10 4921.234922 11 4704.094319 12 4374.834566 Name: traffic_volume, dtype: float64
Plotting the monthly mean values we can see that traffic volume is pretty steady at 4900 each month but the traffic volume dips around the around when people often take vacation: Nov-Feb (Thanksgiving/Christmas) and July.
plt.plot(by_month["traffic_volume"])
plt.title("Traffic Volume by Month")
plt.xlabel("Month")
plt.ylabel("Mean Traffic Volume")
plt.show()
dayset['dayofweek'] = dayset['date_time'].dt.dayofweek
by_dayofweek = dayset.groupby('dayofweek').mean()
by_dayofweek['traffic_volume'] # 0 is Monday, 6 is Sunday
dayofweek 0 4893.551286 1 5189.004782 2 5284.454282 3 5311.303730 4 5291.600829 5 3927.249558 6 3436.541789 Name: traffic_volume, dtype: float64
plt.plot(by_dayofweek["traffic_volume"])
plt.title ("Traffic Volume by Day of Week")
plt.xlabel ("Day of Week")
plt.ylabel("Mean Traffic Volume")
plt.xticks(ticks=[0,1,2,3,4,5,6], labels=["Mon", "Tues", "Weds", "Thurs", "Fri", "Sat", "Sun"])
plt.show()
Split the dataset again, this time based on the day of week value, making a set for business days and a set for weekends.
Calculate the mean traffic volume per hour for each day of week set.
At first class we can already see that 7am has 4x higher traffic volume on weekdays (~6000) compared with weekends (~1500).
business = dayset[dayset["dayofweek"] <=4].copy()
weekend = dayset[dayset["dayofweek"] > 4].copy()
by_hour_business = business.groupby("hour").mean()
by_hour_weekend = weekend.groupby("hour").mean()
print("Traffic Volume per Hour on Business Days (Mon-Fri)")
print(by_hour_business["traffic_volume"])
print()
print("Traffic Volume per Hour on Weekends (Sat-Sun)")
print(by_hour_weekend["traffic_volume"])
Traffic Volume per Hour on Business Days (Mon-Fri) hour 7 6030.413559 8 5503.497970 9 4895.269257 10 4378.419118 11 4633.419470 12 4855.382143 13 4859.180473 14 5152.995778 15 5592.897768 16 6189.473647 17 5784.827133 18 4434.209431 Name: traffic_volume, dtype: float64 Traffic Volume per Hour on Weekends (Sat-Sun) hour 7 1589.365894 8 2338.578073 9 3111.623917 10 3686.632302 11 4044.154955 12 4372.482883 13 4362.296564 14 4358.543796 15 4342.456881 16 4339.693805 17 4151.919929 18 3811.792279 Name: traffic_volume, dtype: float64
The plot grid below shows much different hourly traffic volume levels throughout the day between Business Day and Weekends as represented by the different shapes in their line graphs.
On business days the traffic volume starts high early in the morning, comes down mid morning and then slowly builds to another peak late afternoon/early evening. This peak mean vehicle traffic represents the times of day people are most likely to be commuting.
On weekends, however, traffic volume early in the morning is low, climbing steadily until noon when it stabilizes at a medium level for the rest of the afternoon before tapering off slowly into evening.**
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.plot(by_hour_business["traffic_volume"])
plt.title("Business Day (M-F) Traffic Volume by Hour")
plt.xlabel("Hour")
plt.xticks(ticks=[8, 10, 12, 14, 16, 18], labels=["8:00", "10:00", "12:00", "14:00", "16:00", "18:00"], rotation=30)
plt.ylabel("Mean Traffic Volume")
plt.ylim([0,7000])
plt.subplot(1,2,2)
plt.plot(by_hour_weekend["traffic_volume"])
plt.title("Weekend (Sa-So) Traffic Volume by Hour")
plt.xlabel("Hour")
plt.xticks(ticks=[8, 10, 12, 14, 16, 18], labels=["8:00", "10:00", "12:00", "14:00", "16:00", "18:00"], rotation=30)
plt.ylabel("Mean Traffic Volume")
plt.ylim([0,8000])
plt.show()
print("Correlation with Traffic Volume:")
print(
dayset[["temp","rain_1h", "snow_1h", "clouds_all"]].corrwith(dayset["traffic_volume"])
)
Correlation with Traffic Volume: temp 0.128317 rain_1h 0.003697 snow_1h 0.001265 clouds_all -0.032932 dtype: float64
Create statistics based on weather qualifiers (type categories).
Looking at the Main Weather Events there are no values that stand out as above average.
However, when looking in more detail, Weather Events that involve both Rain and Snow have higher traffic volume. Specifically weather descriptoins called "shower snow" and "light rain and snow" have higher traffic volume. The explanation could involve icy conditions.
The only other even combining both types of preciptiation is "light shower and snow" but assuming severity of rain being having a heirarchy from least to most severe: [light shower, shower, light rain], "light shower" and snow may not lead to icy conditions.
by_weather_main = dayset.groupby('weather_main').mean()
by_weather_description = dayset.groupby('weather_description').mean()
by_weather_main["traffic_volume"].plot.barh()
plt.title("Traffic Volume for Main Weather Events")
plt.xlabel("Traffic Volume")
plt.ylabel("Main Weather Events")
plt.show()
by_weather_description["traffic_volume"].plot.barh(figsize=(5,10))
plt.show()
The Nighttime (7pm - 7am) Traffic Volume Frequency Histogram is Right-Skewed indicating low traffic volume. Data from Nighttime was excluded from the current analysis because it does not offer much insight into high traffic volume indicators.
The Winter months (Nov-Feb) have lower traffic volume, with the lowest in Dec/Jan. This could be a combination of the holidays and the colder weather.
July was an exception to the higher traffic volumes observed in the Spring-Fall (Mar-Oct) with a noticeable dip. This dip is curious but does not provide insight on high traffic volume indicators.
Business Days (Mon-Fri) showed much higher traffic volume than Weekends (Sat-Sun), and the line graphs of the hourly changes in traffic volume for each displayed two very distinct trends.
Business Day Traffic Volume peaked very high at 7:00 and then again at 16:00.
Weekend Traffic Volume started low in the morning and increased steadily to maintain a steady middle level throughout the afternoon.
There was mostly just a weak correlation between the different types of weather events and traffic volume.
However, when looking a little more closely, weather events that specifically combined rain and snow had higher traffic volumes than other events that involved just rain or snow. An explanation could be icy road conditions.