#!/usr/bin/env python # coding: utf-8 # # Determining Heavy Traffic Indicators on I-94 Interstate Highway # # The goal of this data analysis project is to determine the indicators of heavy traffic on [I-94 Interstate highway](https://en.wikipedia.org/wiki/Interstate_94). The indicators can be weather type, day of the week, time of the year, etc. # # We will be analysing a dataset about the westbound traffic (cars moving from east to west) on the I-94 Interstate highway recorded hourly by the station. You can download the dataset using this [link](https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume). # ## The I-94 Dataset # # Let's read and examine our dataset to find out more information. # # - The first cell will be used to import all the libraries needed for this project. # In[1]: import pandas as pd import matplotlib.pyplot as plt get_ipython().run_line_magic('matplotlib', 'inline') # In[2]: # reading the dataset traffic_data = pd.read_csv('Metro_Interstate_Traffic_Volume.csv') traffic_data.info() # - The dataset contains 9 columns and 48204 rows with zero null values. # - The `date_time` column is a string object instead of a datetime object. # # Let's examine the first and last five rows. # In[3]: # exploring the first five rows traffic_data.head() # In[4]: # exploring the first last rows traffic_data.tail() # ## Analyzing Traffic Volume # # To understand the traffic volume on the highway, we will visualize the `traffic_volume` column using a histogram. Then we will check the summary statistics of the same column. # In[5]: traffic_data['traffic_volume'].plot.hist() # In[6]: traffic_data['traffic_volume'].describe() # - We can see that at 25% of the time, only 1,193 cars passed the station. This could be during the night but our analysis will tell us more. # - At 75% of the time, 4,933 cars passed the station. That is almost 4 times the above. It seems that the time of the day influenced the traffic volume. Let's find out! # # ### Traffic Volume: Day vs. Night # # # Let's find out whether time of the day has an effect of the traffic volume. # - Remember that `date_time` is a string, thus we will first convert it to datetime object. # - Then we will divide the dataset into two parts: # - Daytime data: hours from 7 a.m. to 7 p.m. (12 hours) # - Nighttime data: hours from 7 p.m. to 7 a.m. (12 hours) # In[7]: # converting the date_time column from string to a datetime object traffic_data['date_time'] = pd.to_datetime(traffic_data['date_time']) # In[8]: # isolating daytime and nighttime dataframes from traffic_data daytime = traffic_data.copy()[(traffic_data['date_time'].dt.hour >= 7) & (traffic_data['date_time'].dt.hour < 19)] nighttime = traffic_data.copy()[(traffic_data['date_time'].dt.hour >= 19) | (traffic_data['date_time'].dt.hour < 7 )] # Now let's compare the traffic volume of day and night by plotting histogram of each. # In[9]: plt.figure(figsize=(15,5)) # setting the grid chart size plt.subplot(1,2,1) daytime['traffic_volume'].plot.hist() plt.title('Day time Traffic Volume') plt.xlabel('Traffic Volume') plt.ylim([0,8000]) # making sure both charts use same dimensions plt.xlim([0,7500]) plt.subplot(1,2,2) nighttime['traffic_volume'].plot.hist() plt.title('Night time Traffic Volume') plt.xlabel('Traffic Volume') plt.ylim([0,8000]) plt.xlim([0,7500]) # making sure both charts use same dimensions plt.show() # - During the day, traffic volume is high for most of the time (4500 - 5500 cars passing the station). # - At night, the traffic is rather light with only about 1,000 cars passing the station. # # Let's check the summary statistics of both daytime and nighttime traffic volume to see if we can find the same pattern. # In[10]: # finding summary statistics of daytime daytime['traffic_volume'].describe() # In[11]: # finding summary statistics of nighttime nighttime['traffic_volume'].describe() # - Our summary statistics show the same pattern as the histograms for both day and night. # # Remember that our goal is to find heavy traffic indicators, and we just found out that there is no so much traffic at night. So going forward, we will drop the nighttime data and focus on daytime data. # ### Time Indicators I # # Another indicator of traffic volume is time. It is possible that there may be more people and cars on certain month, day of the week or time of the day considering that there are events that occur periodically on certain times of the year. # # Let's find out the traffic volume by month, day of the week and time of the day. # # We will start with traffic volume by month. We will find the average traffic volume in each month. # In[12]: daytime['month'] = daytime['date_time'].dt.month # adding month column to our dataset by_month = daytime.groupby('month').mean() # grouping by month and taking the average of other columns by_month['traffic_volume'] # In[13]: by_month['traffic_volume'].plot.line(x='month') plt.title('Average Traffic per Month') plt.show() # - The traffic volume is low from November to February and highest from March to October with one exception (July), which has low traffic volume. # # Let's check whether July of each year has low traffic volume. # In[14]: daytime['year'] = daytime['date_time'].dt.year # adding year column to our dataset july_only = daytime.copy()[daytime['month'] == 7] july_only.groupby('year').mean()['traffic_volume'].plot.line() # grouping by year and taking the average of columns # - July always has high traffic volume in all the years with the exception of 2016. This might be as a result of road construction according to this [article](https://www.crainsdetroit.com/article/20160728/NEWS/160729841/weekend-construction-i-96-us-23-bridge-work-i-94-lane-closures-i-696). Headline picture is inserted for those without subscription to read the article. # ![Screenshot%20%28308%29.png](attachment:Screenshot%20%28308%29.png) # ### Time Indicators II # # Let's find out traffic volume by day of the week. # In[15]: daytime['day_of_week'] = daytime['date_time'].dt.dayofweek # adding day of the week column in our dataset by_day = daytime.groupby('day_of_week').mean() # grouping by day of the week and taking the average of other columns by_day['traffic_volume'] # 0 is Monday, 6 is Sunday # In[16]: by_day['traffic_volume'].plot.line() plt.title('Average Traffic Volume by Day') # - The traffic volume is highest during working days and lowest at weekend with sharp decrease from Friday to Saturday. # ### Time Indicators III # # Next, we will find traffic volume by hour of each day. But from the our previous graph, the weekend values will skew our results. So, we will find the averages differently. # In[17]: daytime['hour'] = daytime['date_time'].dt.hour working_days = daytime.copy()[daytime['day_of_week'] <= 4] # 4 is Friday by_work_hour = working_days.groupby('hour').mean() weekend = daytime.copy()[daytime['day_of_week'] >= 5] # 5 is Saturday by_weekend_hour = weekend.groupby('hour').mean() print(by_work_hour['traffic_volume']) print(by_weekend_hour['traffic_volume']) # In[18]: plt.figure(figsize=(15,5)) plt.subplot(1,2,1) by_work_hour['traffic_volume'].plot.line() plt.title('Working Hour Traffic Volume') plt.ylabel('Traffic Volume') plt.xlabel('Hour') plt.ylim([0,7000]) # making sure both charts use same dimensions plt.subplot(1,2,2) by_weekend_hour['traffic_volume'].plot.line() plt.title('Weekend Hour Traffic Volume') plt.ylabel('Traffic Volume') plt.xlabel('Hour') plt.ylim([0,7000]) # making sure both charts use same dimensions plt.show() # - At each hour of the day, the traffic volume is higher in working days than in weekend. # - The rush hours on working days are 8 and 16 with over 6,000 each. This is reasonable considering that people go to work from and return home during those hours respectively. # In summarizing traffic indicators, we found the following: # - Traffic volume is higher during March-October (with July the only exception) and lower during November-February. # - Traffic volume is highest in working days and lowest in weekend. # - The traffic rush hours in working days are 8 and 16 perhaps because people are going to work and returning back during those hours respectively. # ### Weather Indicators # Let's continue exploring the dataset to find if weather also affects the traffic volume. We will find the correlation between `traffic_volume` and numerical weather columns like `temp`, `rain_1h`, `snow_1h`, `clouds_all`. # In[19]: temp_corr = round(daytime['traffic_volume'].corr(traffic_data['temp']),3) rain_corr = round(daytime['traffic_volume'].corr(traffic_data['rain_1h']),3) snow_corr = round(daytime['traffic_volume'].corr(traffic_data['snow_1h']),3) cloud_corr = round(daytime['traffic_volume'].corr(traffic_data['clouds_all']),3) print('Temperature and Traffic Volume Correlation:', temp_corr) print('Rain and Traffic Volume Correlation:', rain_corr) print('Snow and Traffic Volume Correlation:', snow_corr) print('Cloud and Traffic Volume Correlation:', cloud_corr) # - The only weather column that has a significant correlation with traffic volume is `temp_corr` column. But with correlation of only 12%, it is difficult say that any weather condition is strongly correlated with traffic volume. # # Let's plot a scatter plot to see the correlation between temperature and traffic volume. # In[20]: plt.scatter(x = daytime['temp'], y = daytime['traffic_volume']) plt.title('Temperature vs. Traffic Volume') plt.xlabel('Temperature') plt.ylabel('Traffic Volume') plt.show() # - The scatter plot has also shown that there is no strong correlation between temperature and traffic volume. # # Let's check the other non-numerical weather columns: `weather_main` and `weather_description` to see if we can find correlations. # In[21]: by_weather_main = daytime.groupby('weather_main').mean() by_weather_main['traffic_volume'].sort_values().plot.barh() plt.show() # - There is no weather type where the traffic volume is 5,000 or more. There seems to be no weather type that indicate heavy traffic volume. # # Let's check the more granular `weather_description` column. # In[22]: by_weather_description = daytime.groupby('weather_description').mean() by_weather_description['traffic_volume'].sort_values().plot.barh(figsize=(5,10)) plt.show() # - There following weather descriptions that have traffic volume of 5,000 or more. # - Shower snow # - Light rain and snow # - Proximity thundestorm with drizzle # - It seems that as the weather gets a little disturbing, people bring our their cars instead of using bikes. # # ## Conclusion # # In this project, we analyzed the I-94 Interstate Highway dataset to identify heavy traffic indicators on the highway. # We made the following findings: # - The traffic volume is highest during the day. # - The traffic volume is high during the warm months (March-October) and low during cold months (November-February). # - The traffic volume is high on working days (Mondays-Fridays) than on weekend. The rush hours on workdays being 8 and 16 when people are going to work and returning back respectively. # - Shower rain, light rain and snow, and proximity thunderstorm are the weather descriptions with most traffic volume. Perhaps as the weather becomes disturbing, people resort to use their cars instead of bikes.