#!/usr/bin/env python # coding: utf-8 # ## Finding Heavy Traffic Indicators on I-94 # ### Introduction: # This is an exploratory analysis on the traffic data on I-94 interstate highways to find the main indicators of heavy traffic. These indicators can be weather conditions , time of week , any specific months etc. # # The dataset documentation mentions that a station located approximately midway between Minneapolis and Saint Paul recorded the traffic data. Also, the station only records westbound traffic (cars moving from east to west). # # The data was made available by John Hogue at UCI Machine Learning Repository. # ### 1.0 Loading the Data: # In[1]: import pandas as pd import matplotlib.pyplot as plt get_ipython().run_line_magic('matplotlib', 'inline') import seaborn as sns import warnings warnings.filterwarnings('ignore') # Reading the data set: # In[2]: traffic = pd.read_csv("Metro_Interstate_Traffic_Volume.csv") # Let's take a quick look into the dataset , printing the top and last five rows: # In[3]: print("Top five rows:") traffic.head() # In[4]: print("Last five rows:") traffic.tail() # Looking into the content of the dataset: # In[5]: traffic.info() # * From above we conclude the following points: # # 1. There are total 9 columns in dataset. # 2. There are no null value in dataset. # # ### 2.0 Exploratory Data Analysis: # We will look into the **traffic_volume** column: # In[6]: #Plotting the histogram traffic["traffic_volume"].plot.hist() plt.title("Traffic Volume on I-94") plt.xlabel("Traffic volume") plt.show() # In[7]: #looking into the column stats for traffic_volume traffic["traffic_volume"].describe() # - Conclusions from above table and plot: # # 1. There are higher frequency of `traffic_volume` in the ranges 0-1500 and 4500-5500. We can look into these ranges. # 2. The distribution is not normally distributed. # 3. The average traffic volume is 3259. # 4. 25% of the time traffic volume is below 1193 , and 25% of the time the volume as high as 4933. This variation in the volume might be the daytime and nighttime variation. # # We will explore further on the fourth point. # #### 2.1 Traffic volume : Day vs Night # In[8]: #Transforming the date_time column to datetime traffic["date_time"] = pd.to_datetime(traffic["date_time"]) # In[9]: #Isolating the daytime and nighttime data day_traffic = traffic[(traffic["date_time"].dt.hour >= 7) & (traffic["date_time"].dt.hour < 19)] night_traffic = traffic[(traffic["date_time"].dt.hour < 7) | (traffic["date_time"].dt.hour >= 19)] # looking into the stats for traffic data by day and night time: # In[10]: day_traffic["traffic_volume"].describe() # In[11]: night_traffic["traffic_volume"].describe() # In[12]: #Plotting the histograms for day time and night time traffic plt.figure(figsize = (10 , 5)) plt.subplot(1 , 2 , 1) plt.hist(day_traffic["traffic_volume"]) plt.title("Traffic Volume : Day") plt.xlabel("Traffic Volume") plt.ylabel("Count") plt.ylim(0 , 8000) plt.xlim(0 , 7500) plt.subplot(1 , 2 , 2) plt.hist(night_traffic["traffic_volume"]) plt.title("Traffic Volume : Night") plt.xlabel("Traffic Volume") plt.ylabel("Count") plt.ylim(0 , 8000) plt.xlim(0 , 7500) plt.show() # - Conclusion: # 1. There's a clear difference in day's and night's traffic volume. # 2. Average traffic during day time is 4762 and that during the night time is 1785. # 3. The histograms confirm the above two points. # 4. Traffic distribution during day resembles normal distribution and for night it resembles logarithmic distribution , though the resemblence is not perfect. # Since , our goal is to identify the causes/indicators of heavy traffic from here on we can solely focus on the day time traffic data. The traffic volume during night is already low. # # Also , it is possible that there may be higher traffic on the road during certain day , certain time of the day or on certain months. # # - Now , we will explore the day traffic volume by: # - Month # - Day of the week # - Time of the day # #### 2.2 Traffic volume by month: # We will now look into the traffic by each month , adding a new column "month" to the day_traffic data. # In[13]: day_traffic["month"] = day_traffic["date_time"].dt.month # In[14]: #Grouping the data by month and calculating the monthly average: avg_by_month = day_traffic.groupby("month").mean()["traffic_volume"] avg_by_month # In[15]: #Plotting the line plot for above result plt.plot(avg_by_month) plt.xlabel("Month") plt.ylabel("Average Traffic Volume") plt.xticks(range(1, 13)) plt.title("Monthly Average Traffic Volume") plt.show() # - Conclusion from above line graph: # 1. The average traffic volume rages from nearly 4300 - 4900. # 2. December , January and July have the lowest average traffic. # 3. March through June and August through October have high average traffic. # #### 2.3 Traffic volume by day of the week: # In[16]: #Adding a new column "dayofweek" to the day time traffic dataset: day_traffic["dayofweek"] = day_traffic["date_time"].dt.dayofweek # In[17]: #Calculating average traffic by day of week , adding a column 'dayofweek' avg_traffic_bydayofweek = day_traffic.groupby("dayofweek").mean()["traffic_volume"] avg_traffic_bydayofweek # 0 = Monday and 6 = Sunday # In[18]: #Plotting the line plot for above result plt.plot(avg_traffic_bydayofweek) plt.xlabel("day") plt.ylabel("Average Traffic Volume") plt.title("Average Traffic Volume : by day of week") plt.show() # - Conclusion from above plot: # # - The average traffic volume on week days is much higher than on weekends , the average ranges from nearly 3400 on sundays to 5300 on thursdays. # #### 2.4 Traffic volume by time of the day: # As seen in last plot weekends have much lower average traffic than weekdays , therefore , we will calculate the average by time separately for weekdays and weekends. # In[19]: #Adding a new column "hour" to the day time traffic dataset: day_traffic["hour"] = day_traffic["date_time"].dt.hour # In[20]: #Separating the business day and weekend data: business_days_traffic = day_traffic.loc[day_traffic["dayofweek"] <= 4] weekends_traffic = day_traffic.loc[day_traffic["dayofweek"] > 4] # In[21]: #Calculating average traffic by hour for business days: hourly_avg_traffic_business_days = business_days_traffic.groupby("hour").mean()["traffic_volume"] hourly_avg_traffic_business_days # In[22]: #Calculating average traffic by hour for weekends: hourly_avg_traffic_weekends = weekends_traffic.groupby("hour").mean()["traffic_volume"] hourly_avg_traffic_weekends # In[23]: #Plotting the line plots plt.figure(figsize = (10 , 5)) plt.subplot(1 , 2 , 1) plt.plot(hourly_avg_traffic_business_days) plt.title("Hourly average traffic : Business days") plt.ylim(0 , 6500) plt.subplot(1 , 2 , 2) plt.plot(hourly_avg_traffic_weekends) plt.title("Hourly average traffic : weekends") plt.ylim(0 , 6500) plt.show() # - Conclusion from above plots: # - Clearly , traffic is much higher on business days than weekends. # - The rush hours on business days are in the morning before 10 am and then again later in the evening from 3 pm till after 5 pm. # - On weekends traffic rises till around 12 pm and then it remains nearly constant till 5 pm and starts decreasing after that. # - So far, we have looked into the time indicator of traffic volume and have following conclusions: # # - The traffic is usually heavier on during warm months (March–October) compared to cold months (November–February). # - The traffic is usually heavier on business days compared to weekends. # - On business days, the rush hours are around 7 and 16. # #### 2.5 Traffic Volume by weather conditions: # Now , we will look into the weather indicators for heavy traffic. We will start by looking into the correlation between traffic_volume and numeric weather related columns. # In[24]: day_traffic.corr()["traffic_volume"] # In[25]: #Plotting the scatter plot between temp and traffic_volume plt.scatter(day_traffic["traffic_volume"] , day_traffic["temp"]) plt.xlabel("Traffic Volume") plt.ylabel("Temp") plt.show() # - From the table above we can see that there's not much of correlation between traffic and weather conditions(numeric columns only). The strongest is with the temp column and that too is not very significant. # Now , we will look into the non numeric\categorical weather indicators for any relation with the traffic volume: # In[26]: #Calculating average traffic_volume by each categorical value in columns - 'weather_main' day_traffic.groupby('weather_main').mean()["traffic_volume"] # In[27]: #Plotting the horizontal bar plot plt.barh(day_traffic.groupby('weather_main').mean()["traffic_volume"].index , day_traffic.groupby('weather_main').mean()["traffic_volume"].values) plt.title("Traffic Volume by weather condition") plt.show() # In[28]: #Calculating average traffic_volume by each categorical value in column - weather_description' day_traffic.groupby('weather_description').mean()['traffic_volume'] # In[29]: #Plotting the horizontal bar plot plt.figure(figsize = (8,12)) plt.barh(day_traffic.groupby('weather_description').mean()["traffic_volume"].index , day_traffic.groupby('weather_description').mean()["traffic_volume"].values) plt.title("Traffic Volume by weather condition") plt.show() # ### 3. Final Conclusions: # 1. Traffic volume during day time is much higher than that in night. # 2. More so , during the day time traffic varies hugely between business days and weekends. # 3. The rush hours during business days are in the morning before 10am and then in the evening from 4pm till after 5pm. # 4. Overall , the traffic volume does not vary much by the weather conditions , but days with thunderstorm or snow have slightly higher traffic.