This is an exploratory analysis on the traffic data on I-94 interstate highways to find the main indicators of heavy traffic. These indicators can be weather conditions , time of week , any specific months etc.
The dataset documentation mentions that a station located approximately midway between Minneapolis and Saint Paul recorded the traffic data. Also, the station only records westbound traffic (cars moving from east to west).
The data was made available by John Hogue at UCI Machine Learning Repository.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
Reading the data set:
traffic = pd.read_csv("Metro_Interstate_Traffic_Volume.csv")
Let's take a quick look into the dataset , printing the top and last five rows:
print("Top five rows:")
traffic.head()
Top five rows:
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
print("Last five rows:")
traffic.tail()
Last five rows:
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
48199 | None | 283.45 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2018-09-30 19:00:00 | 3543 |
48200 | None | 282.76 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 20:00:00 | 2781 |
48201 | None | 282.73 | 0.0 | 0.0 | 90 | Thunderstorm | proximity thunderstorm | 2018-09-30 21:00:00 | 2159 |
48202 | None | 282.09 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 22:00:00 | 1450 |
48203 | None | 282.12 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 23:00:00 | 954 |
Looking into the content of the dataset:
traffic.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): holiday 48204 non-null object temp 48204 non-null float64 rain_1h 48204 non-null float64 snow_1h 48204 non-null float64 clouds_all 48204 non-null int64 weather_main 48204 non-null object weather_description 48204 non-null object date_time 48204 non-null object traffic_volume 48204 non-null int64 dtypes: float64(3), int64(2), object(4) memory usage: 3.3+ MB
From above we conclude the following points:
We will look into the traffic_volume column:
#Plotting the histogram
traffic["traffic_volume"].plot.hist()
plt.title("Traffic Volume on I-94")
plt.xlabel("Traffic volume")
plt.show()
#looking into the column stats for traffic_volume
traffic["traffic_volume"].describe()
count 48204.000000 mean 3259.818355 std 1986.860670 min 0.000000 25% 1193.000000 50% 3380.000000 75% 4933.000000 max 7280.000000 Name: traffic_volume, dtype: float64
Conclusions from above table and plot:
traffic_volume
in the ranges 0-1500 and 4500-5500. We can look into these ranges.We will explore further on the fourth point.
#Transforming the date_time column to datetime
traffic["date_time"] = pd.to_datetime(traffic["date_time"])
#Isolating the daytime and nighttime data
day_traffic = traffic[(traffic["date_time"].dt.hour >= 7) & (traffic["date_time"].dt.hour < 19)]
night_traffic = traffic[(traffic["date_time"].dt.hour < 7) | (traffic["date_time"].dt.hour >= 19)]
looking into the stats for traffic data by day and night time:
day_traffic["traffic_volume"].describe()
count 23877.000000 mean 4762.047452 std 1174.546482 min 0.000000 25% 4252.000000 50% 4820.000000 75% 5559.000000 max 7280.000000 Name: traffic_volume, dtype: float64
night_traffic["traffic_volume"].describe()
count 24327.000000 mean 1785.377441 std 1441.951197 min 0.000000 25% 530.000000 50% 1287.000000 75% 2819.000000 max 6386.000000 Name: traffic_volume, dtype: float64
#Plotting the histograms for day time and night time traffic
plt.figure(figsize = (10 , 5))
plt.subplot(1 , 2 , 1)
plt.hist(day_traffic["traffic_volume"])
plt.title("Traffic Volume : Day")
plt.xlabel("Traffic Volume")
plt.ylabel("Count")
plt.ylim(0 , 8000)
plt.xlim(0 , 7500)
plt.subplot(1 , 2 , 2)
plt.hist(night_traffic["traffic_volume"])
plt.title("Traffic Volume : Night")
plt.xlabel("Traffic Volume")
plt.ylabel("Count")
plt.ylim(0 , 8000)
plt.xlim(0 , 7500)
plt.show()
Since , our goal is to identify the causes/indicators of heavy traffic from here on we can solely focus on the day time traffic data. The traffic volume during night is already low.
Also , it is possible that there may be higher traffic on the road during certain day , certain time of the day or on certain months.
We will now look into the traffic by each month , adding a new column "month" to the day_traffic data.
day_traffic["month"] = day_traffic["date_time"].dt.month
#Grouping the data by month and calculating the monthly average:
avg_by_month = day_traffic.groupby("month").mean()["traffic_volume"]
avg_by_month
month 1 4495.613727 2 4711.198394 3 4889.409560 4 4906.894305 5 4911.121609 6 4898.019566 7 4595.035744 8 4928.302035 9 4870.783145 10 4921.234922 11 4704.094319 12 4374.834566 Name: traffic_volume, dtype: float64
#Plotting the line plot for above result
plt.plot(avg_by_month)
plt.xlabel("Month")
plt.ylabel("Average Traffic Volume")
plt.xticks(range(1, 13))
plt.title("Monthly Average Traffic Volume")
plt.show()
#Adding a new column "dayofweek" to the day time traffic dataset:
day_traffic["dayofweek"] = day_traffic["date_time"].dt.dayofweek
#Calculating average traffic by day of week , adding a column 'dayofweek'
avg_traffic_bydayofweek = day_traffic.groupby("dayofweek").mean()["traffic_volume"]
avg_traffic_bydayofweek
# 0 = Monday and 6 = Sunday
dayofweek 0 4893.551286 1 5189.004782 2 5284.454282 3 5311.303730 4 5291.600829 5 3927.249558 6 3436.541789 Name: traffic_volume, dtype: float64
#Plotting the line plot for above result
plt.plot(avg_traffic_bydayofweek)
plt.xlabel("day")
plt.ylabel("Average Traffic Volume")
plt.title("Average Traffic Volume : by day of week")
plt.show()
Conclusion from above plot:
As seen in last plot weekends have much lower average traffic than weekdays , therefore , we will calculate the average by time separately for weekdays and weekends.
#Adding a new column "hour" to the day time traffic dataset:
day_traffic["hour"] = day_traffic["date_time"].dt.hour
#Separating the business day and weekend data:
business_days_traffic = day_traffic.loc[day_traffic["dayofweek"] <= 4]
weekends_traffic = day_traffic.loc[day_traffic["dayofweek"] > 4]
#Calculating average traffic by hour for business days:
hourly_avg_traffic_business_days = business_days_traffic.groupby("hour").mean()["traffic_volume"]
hourly_avg_traffic_business_days
hour 7 6030.413559 8 5503.497970 9 4895.269257 10 4378.419118 11 4633.419470 12 4855.382143 13 4859.180473 14 5152.995778 15 5592.897768 16 6189.473647 17 5784.827133 18 4434.209431 Name: traffic_volume, dtype: float64
#Calculating average traffic by hour for weekends:
hourly_avg_traffic_weekends = weekends_traffic.groupby("hour").mean()["traffic_volume"]
hourly_avg_traffic_weekends
hour 7 1589.365894 8 2338.578073 9 3111.623917 10 3686.632302 11 4044.154955 12 4372.482883 13 4362.296564 14 4358.543796 15 4342.456881 16 4339.693805 17 4151.919929 18 3811.792279 Name: traffic_volume, dtype: float64
#Plotting the line plots
plt.figure(figsize = (10 , 5))
plt.subplot(1 , 2 , 1)
plt.plot(hourly_avg_traffic_business_days)
plt.title("Hourly average traffic : Business days")
plt.ylim(0 , 6500)
plt.subplot(1 , 2 , 2)
plt.plot(hourly_avg_traffic_weekends)
plt.title("Hourly average traffic : weekends")
plt.ylim(0 , 6500)
plt.show()
So far, we have looked into the time indicator of traffic volume and have following conclusions:
Now , we will look into the weather indicators for heavy traffic. We will start by looking into the correlation between traffic_volume and numeric weather related columns.
day_traffic.corr()["traffic_volume"]
temp 0.128317 rain_1h 0.003697 snow_1h 0.001265 clouds_all -0.032932 traffic_volume 1.000000 month -0.022337 dayofweek -0.416453 hour 0.172704 Name: traffic_volume, dtype: float64
#Plotting the scatter plot between temp and traffic_volume
plt.scatter(day_traffic["traffic_volume"] , day_traffic["temp"])
plt.xlabel("Traffic Volume")
plt.ylabel("Temp")
plt.show()
Now , we will look into the non numeric\categorical weather indicators for any relation with the traffic volume:
#Calculating average traffic_volume by each categorical value in columns - 'weather_main'
day_traffic.groupby('weather_main').mean()["traffic_volume"]
weather_main Clear 4778.416260 Clouds 4865.415996 Drizzle 4837.212911 Fog 4372.491713 Haze 4609.893285 Mist 4623.976475 Rain 4815.568462 Smoke 4564.583333 Snow 4396.321183 Squall 4211.000000 Thunderstorm 4648.212860 Name: traffic_volume, dtype: float64
#Plotting the horizontal bar plot
plt.barh(day_traffic.groupby('weather_main').mean()["traffic_volume"].index , day_traffic.groupby('weather_main').mean()["traffic_volume"].values)
plt.title("Traffic Volume by weather condition")
plt.show()
#Calculating average traffic_volume by each categorical value in column - weather_description'
day_traffic.groupby('weather_description').mean()['traffic_volume']
weather_description SQUALLS 4211.000000 Sky is Clear 4919.009390 broken clouds 4824.130326 drizzle 4737.330935 few clouds 4839.818023 fog 4372.491713 freezing rain 4314.000000 haze 4609.893285 heavy intensity drizzle 4738.586207 heavy intensity rain 4610.356164 heavy snow 4411.681250 light intensity drizzle 4890.164049 light intensity shower rain 4558.100000 light rain 4859.650849 light rain and snow 5579.750000 light shower snow 4618.636364 light snow 4430.858896 mist 4623.976475 moderate rain 4769.643312 overcast clouds 4861.124952 proximity shower rain 4901.756757 proximity thunderstorm 4684.356436 proximity thunderstorm with drizzle 5121.833333 proximity thunderstorm with rain 4501.611111 scattered clouds 4936.787712 shower drizzle 4932.666667 shower snow 5664.000000 sky is clear 4753.930294 sleet 4312.666667 smoke 4564.583333 snow 4054.065693 thunderstorm 4724.708333 thunderstorm with drizzle 2297.000000 thunderstorm with heavy rain 4555.760000 thunderstorm with light drizzle 4960.000000 thunderstorm with light rain 4336.130435 thunderstorm with rain 4522.950000 very heavy rain 4780.571429 Name: traffic_volume, dtype: float64
#Plotting the horizontal bar plot
plt.figure(figsize = (8,12))
plt.barh(day_traffic.groupby('weather_description').mean()["traffic_volume"].index , day_traffic.groupby('weather_description').mean()["traffic_volume"].values)
plt.title("Traffic Volume by weather condition")
plt.show()