#!/usr/bin/env python
# coding: utf-8

# # Finding Heavy Traffic Indicators
# 
# In this project we will be examining a dataset about the westbound traffic on the [I-94 Interstate Highway](https://en.wikipedia.org/wiki/Interstate_94). The dataset comes from the UCI Machine Learning Repository. You can download the dataset [here](https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume).
# 
# The goal of this analysis is to determine a few indicators of heavy traffic on the I-94. These indicators can be weather type, time of day, time of week, etc. For instance, we may find out that the traffic is usually heavier in the summer or when it snows.

# In[1]:


#import our libraries
import pandas as pd
import matplotlib.pyplot as plt
from jupyterthemes import jtplot
jtplot.style(fscale=1.4,)
get_ipython().run_line_magic('matplotlib', 'inline')

#read in our data and assign to DataFrame 'traffic'
traffic = pd.read_csv(r'my_datasets\i94_traffic.csv')


# In[2]:


#examine the first five rows of data
traffic.head()


# In[3]:


#examine the last five rows of data
traffic.tail()


# ## Analyzing the data

# In[4]:


#find more information about our dataset
traffic.info()


# As you can see, we have 48,000+ rows of rata. Each row contains 9 different columns of data of types int, float, and object/string. According to the dataset documentation, a station located approximately midway between Minneapolis and Saint Paul recorded the traffic data. Also, the station only records westbound traffic (cars moving from east to west). This means that the results of our analysis will be about the westbound traffic in the proximity of that station. In other words, we should avoid generalizing our results for the entire I-94 highway.

# In[5]:


traffic['traffic_volume'].plot.hist()
plt.show()


# In[6]:


traffic['traffic_volume'].describe()


# Looking at the histogram above, as well as the description of our dataframe, we can make some simple observations. First, the histogram has to big spikes, then slowly falls off. This could be the big rush of traffic in the morning and afternoon, when people are travelling to and from work.
# 
# Next, we can see that the I-94 had, on average, about 3,260 cars per hour, with a maximum of 7,280, probably during rush hour. About 25% of the time, there are around 1,193 cars on the road, while another 25% of the time, there are more than four times that amound at 4,933. So this gives us an interesting point to think about: daytime data vs. nighttime data.

# ## Day vs. Night
# 
# We will proceed by dividing our dataset into two parts:
#  * Daytime data: hours from 7 a.m. to 7 p.m. (12 hours)
#  * Nighttime data: hours from 7 p.m. to 7 a.m. (12 hours)
#  
# While this isn't a perfect criterion, it gives us a good starting point.

# In[7]:


#converts our date-time data to the proper data type
traffic['date_time'] = pd.to_datetime(traffic['date_time'])
traffic['date_time'].head()


# In[8]:


#split up our data into daytime
daytime = traffic.copy()[(traffic['date_time'].dt.hour >= 7) &
                         (traffic['date_time'].dt.hour < 19)]
#and nighttime
nighttime = traffic.copy()[(traffic['date_time'].dt.hour >= 19) |
                           (traffic['date_time'].dt.hour < 7)]

#examine the shape
print('Daytime:',daytime.shape)
print('Nighttime:',nighttime.shape)


# As you can see, the daytime data does not have an equal amount of data compared to the nighttime data. This could be caused by missing data, or just variations in traffic.

# ### Traffic Volume: Day vs. Night

# In[9]:


#overall size of grid chart
plt.figure(figsize=(12, 5))

#daytime plot
plt.subplot(1, 2, 1)
plt.hist(daytime['traffic_volume'])
plt.xlim(-100,7500) #horizontal limit
plt.ylim(0,8000)    #vertical limit
plt.title('Traffic Volume: Day')
plt.xlabel('Traffic Volume')
plt.ylabel('Frequency')

#nighttime plot
plt.subplot(1, 2, 2)
plt.hist(nighttime['traffic_volume'])
plt.xlim(-100,7500) #horizontal limit
plt.ylim(0,8000)    #vertical limit
plt.title('Traffic Volume: Night')
plt.xlabel('Traffic Volume')
plt.ylabel('Frequency')

plt.show()


# In[10]:


daytime['traffic_volume'].describe()


# In[11]:


nighttime['traffic_volume'].describe()


# We can see from our histograms above, as well as the results of our describe() functions, that on average, the daytime has significantly more cars heading westbound than the nighttime. The average number of cars during the day is 4,762, while at night it's 1,785. This makes sense, as more people probably work throughout the day than they do at night. During the day, 25% of the time there's 5,559 cars, while at night there's only 2,819. 
# 
# The daytime histogram is left skewed, as most of the time the traffic volume values are rather high. The nighttime histogram is right skewed, as most of the traffic volume values are rather low. For example, during the day, 25% of the time there are more than 4,252 cars, while at night, 75% of the time there are fewer than 2,819 cars.
# 
# Since our goal is to help identify heavy traffic indicators, we should only focus on times that have heavy traffic. For this purpose, we will only be looking at the daytime data.

# ### Time Indicators
# 
# One of the possible indicators of heavy traffic is time. There might be more people on the road in a certain month, on a certain day, or at a certain time of the day.
# 
# We're going to look at a few line plots showing how the traffic volume changed according to the following parameters:
# 
# * Month
# * Day of the week
# * Time of day

# In[12]:


#create a new column for the month when the measurement was taken
daytime['month'] = daytime['date_time'].dt.month

#group the dataset by the new month column, and calculate the mean
by_month = daytime.groupby('month').mean()

#display the average traffic volume per month
by_month['traffic_volume']


# In[13]:


#create plot line graph of above data
by_month['traffic_volume'].plot.line()
plt.show()


# We see an interesting graph here. It seems that in the winter months, namely November through February, traffic decreases quite a bit. Also, in July it seems to see a very similar dip. What is so special about July? Let's examine it further.

# In[14]:


#create a new column for the year when the measurement was taken
daytime['year'] = daytime['date_time'].dt.year

#only take the month of july
july = daytime[daytime['month'] == 7]

#take all the julys and find the mean traffic volume
july_mean = july.groupby('year').mean()

#plot
july_mean['traffic_volume'].plot.line()
plt.show()


# We can see from our graph that the traffic in July is actually quite normal, except for 2016. This might be due to some natural disaster, or planned construction phase that limited or closed off a part of the I-94. A quick google search shows that in July 2016, there was also some incident involving a group of protestors that caused it to be shut down. However, since the majority of traffic was down throughout the entire month of July, it's more likely to have been caused by something more long term, like construction.
# 
# 
# 
# ### Time Indicators  (Part 2)
# We'll now continue with building line plots for another time unit: day of the week.

# In[15]:


#create a new column for the day of the week when the measurement was taken
daytime['dayofweek'] = daytime['date_time'].dt.dayofweek

#group the dataset by the new dayofweek column, and calculate the mean
by_dayofweek = daytime.groupby('dayofweek').mean()

#display the average traffic volume for each day of the week
#0 - Monday, 6 - Sunday
by_dayofweek['traffic_volume']


# In[16]:


#create plot line graph of above data
by_dayofweek['traffic_volume'].plot.line()
plt.show()


# Here we can see a nice pattern that lines up with what we'd expect. The I-94 had relatively high daytime traffic Monday-Friday, and then significantly less traffic on the weekends. People like to rest and stay in on weekends, which could be one reason for the significant drop.
# 
# ### Time Indicators (Part 3)
# Now were going to examine our data even further. We'll now generate a line plot for the time of day. The weekends, however, will drag down the average values, so we're going to look at the averages separately. To do that, we'll start by splitting the data based on the day type: business day or weekend.

# In[17]:


#create a new column for the hour of the day when the measurement was taken
daytime['hour'] = daytime['date_time'].dt.hour

#split into business days
business_days = daytime.copy()[daytime['dayofweek'] <= 4] # 4 == Friday
by_hour_business = business_days.groupby('hour').mean()
print(by_hour_business['traffic_volume'])

#split into weekends
weekend = daytime.copy()[daytime['dayofweek'] >= 5] # 5 == Saturday
by_hour_weekend = weekend.groupby('hour').mean()
print(by_hour_weekend['traffic_volume'])


# In[18]:


#overall size of grid chart
plt.figure(figsize=(12, 5))

#business day plot
plt.subplot(1, 2, 1)
plt.plot(by_hour_business['traffic_volume'])
plt.xlim(7,18) #horizontal limit
plt.ylim(1500,6500)    #vertical limit
plt.title('Traffic by Hour: Business Days')
plt.xlabel('Hour')
plt.ylabel('Volume')

#weekend plot
plt.subplot(1, 2, 2)
plt.plot(by_hour_weekend['traffic_volume'])
plt.xlim(7,18) #horizontal limit
plt.ylim(1500,6500)    #vertical limit
plt.title('Traffic by Hours: Weekends')
plt.xlabel('Hour')
plt.ylabel('Volume')

plt.show()


# From the graphs above, we can easily see that during business days, there are two big spikes around 7am and 4pm (aka, rush hour). These are the hours when most people are trying to get to/from work. Even outside these hours, during business days at every single hour, the traffic volume is higher than that of weekends. Also of interesting note, is that on weekends we can see a very gradual increase in traffic during the early hours of the day, notably from 7-9am - this is most likely because people are still waking up and relaxing and enjoying themselves.
# 
# To summarize, we found a few time-related indicators of heavy traffic:
# 
#  * The traffic is usually heavier during warm months (March–October) compared to cold months (November–February).
#  * The traffic is usually heavier on business days compared to weekends.
#  * On business days, the rush hours are around 7am and 4pm.

# ### Traffic vs. Weather
# 
# Another possible indicator of heavy traffic is weather. The dataset provides us with a few useful columns about weather: temp, rain_1h, snow_1h, clouds_all, weather_main, weather_description.
# 
# A few of these columns are numerical so let's start by looking up their correlation values with traffic_volume.

# In[19]:


#finds the Pearson r's between 
corr = daytime.corr()['traffic_volume']
corr


# We can see above that the weather-related column with the strongest correlation to our traffic_volume column is temp, with a positive correlation of about 0.13. If we make a scatter plot between traffic_volume and temp, we get the following:

# In[20]:


plt.scatter(daytime['traffic_volume'],daytime['temp'])
plt.ylim(230, 320) # two wrong 0K temperatures mess up the y-axis
plt.show()


# It's quite clear from our scatter plot that there is no discernible connection between temperature and traffic volume. Let's take a look at the other two weather columns: weather_main and weather_description.

# In[21]:


by_weather_main = daytime.groupby('weather_main').mean()
by_weather_description = daytime.groupby('weather_description').mean()

by_weather_main['traffic_volume'].plot.barh() # Horizontal bar plot
plt.show()


# It doesn't seem like there's any connection between traffic volume and weather type. They all fall between the 4,000-5,000 car range. Let's look at the more granular weather description.

# In[22]:


by_weather_description['traffic_volume'].plot.barh(figsize=(5,11)) # Horizontal bar plot
plt.show()


# It looks like there are three weather descriptions that cause traffic volume to exceed 5,000:
# * shower snow
# * proximity thunderstorm with drizzle
# * light rain and snow
# 
# It's not clear why these specific weather types cause more traffic. Even though they exceed 5,000, they aren't that far ahead of the other weather types. It's bad weather, but not terrible weather.

# ## Nighttime Analysis
# 
# Now that we've analyzed the daytime, let's look at the nighttime data just to see if there are any interesting points. To help simplify, we will create all the same columns we did for the daytime data. 
# 
# ### Nighttime Time Indicators - (Part 1)
# We'll start by adding the month column, and looking at the traffic volume by month.

# In[23]:


#creates a new column in our nighttime data for month
nighttime['month'] = nighttime['date_time'].dt.month

#group the dataset by the new month column, and calculate the mean
by_month_n = nighttime.groupby('month').mean()

#display the average traffic volume per month
by_month_n['traffic_volume']


# In[24]:


#create plot line graph of above data
by_month_n['traffic_volume'].plot.line()
plt.show()


# We can see a fairly similar graph to the daytime data - traffic drops off significantly during the winter months of November - February. It peaks in the early summer months with around 1,932 cars. Let's examine the average pear year as well.

# In[25]:


#creates a new column in our nighttime data for month
nighttime['year'] = nighttime['date_time'].dt.year

#group the dataset by the new month column, and calculate the mean
by_year_n = nighttime.groupby('year').mean()

#display the average traffic volume per month
by_year_n['traffic_volume']


# In[26]:


#create plot line graph of above data
by_year_n['traffic_volume'].plot.line()
plt.show()


# We see something interesting here, there was a decent spike in traffic from 2016 to 2017. It was a bit hard to find anything specific on Google, but it does seem like there were a lot of construction projects occuring in 2017. You can see in 2018 that the traffic drops down to a similar 2016 level.

# ### Nighttime Indicators (Part 2)
# Let's look at day of the week now.

# In[27]:


#create a new column for the day of the week when the measurement was taken
nighttime['dayofweek'] = nighttime['date_time'].dt.dayofweek

#group the dataset by the new dayofweek column, and calculate the mean
by_dayofweek_n = nighttime.groupby('dayofweek').mean()

#display the average traffic volume for each day of the week
#0 - Monday, 6 - Sunday
by_dayofweek_n['traffic_volume']


# In[28]:


#create plot line graph of above data
by_dayofweek_n['traffic_volume'].plot.line()
plt.show()


# Also a bit interesting, as we see an increase in traffic at night throughout the work week, and then a big dropoff during the weekend. Perhaps people are working slightly longer each day of the week, which would postpone when people go home from work?

# ### Nighttime Indicators - (Part 3
# Now we're going to look at the traffic per hour, and similar to last time, we'll split the data into business days and weekends.

# In[29]:


#create a new column for the hour of the day when the measurement was taken
nighttime['hour'] = nighttime['date_time'].dt.hour

#splits into two DataFrames to make it easier to plot by hour
nighttime1 = nighttime.copy()[(nighttime['date_time'].dt.hour < 7)]
nighttime2 = nighttime.copy()[(nighttime['date_time'].dt.hour >= 19)]

#next two sections are for the hours of 0-7
#split into business days
business_days_n1 = nighttime1.copy()[nighttime1['dayofweek'] <= 4] # 4 == Friday
by_hour_business_n1 = business_days_n1.groupby('hour').mean()
print(by_hour_business_n1['traffic_volume'])

#split into weekends
weekend_n1 = nighttime1.copy()[nighttime1['dayofweek'] >= 5] # 5 == Saturday
by_hour_weekend_n1 = weekend_n1.groupby('hour').mean()
print(by_hour_weekend_n1['traffic_volume'])

#next two sections are for the hours of 19-23
#split into business days
business_days_n2 = nighttime2.copy()[nighttime2['dayofweek'] <= 4] # 4 == Friday
by_hour_business_n2 = business_days_n2.groupby('hour').mean()
print(by_hour_business_n2['traffic_volume'])

#split into weekends
weekend_n2 = nighttime2.copy()[nighttime2['dayofweek'] >= 5] # 5 == Saturday
by_hour_weekend_n2 = weekend_n2.groupby('hour').mean()
print(by_hour_weekend_n2['traffic_volume'])


# In[47]:


#create 1x2 grid chart. Two graphs for business days, two for weekends
#overall size of grid chart
plt.figure(figsize=(12, 6))

#TO RECAP:
#by_hour_business_n1 = 0:00 - 7:00 am
#by_hour_weekend_n1  = 0:00 - 7:00 am

#by_hour_business_n2 = 19:00 - 23:00 pm
#by_hour_weekend_n2  = 19:00 - 23:00 pm

#0:00 - 7:00am (both business days and weekends)
#business day
plt.subplot(1, 2, 1)
plt.plot(by_hour_business_n1['traffic_volume'],label='Business Day')
plt.xlim(0,6) #horizontal limit
plt.ylim(0,5500)    #vertical limit
plt.title('Traffic Volume: 0:00 - 7:00am')
plt.xlabel('Hour')
plt.ylabel('Volume')

#weekend
plt.plot(by_hour_weekend_n1['traffic_volume'],label='Weekend Day')
plt.xlim(0,6) #horizontal limit
plt.ylim(0,5500)    #vertical limit
plt.xlabel('Hour')
plt.ylabel('Volume')
plt.legend() #shows legend for first graph

#19:00 - 23:00pm (both business days and weekends)
#business day
plt.subplot(1, 2, 2)
plt.plot(by_hour_business_n2['traffic_volume'],label='Business Day')
plt.xlim(19,23) #horizontal limit
plt.ylim(1300,3300)    #vertical limit
plt.title('Traffic Volume : 19:00 - 23:00pm')
plt.xlabel('Hour')
plt.ylabel('Volume')

#weekend
plt.plot(by_hour_weekend_n2['traffic_volume'],label='Weekend')
plt.xlim(19,23) #horizontal limit
plt.ylim(1300,3300)    #vertical limit
plt.xlabel('Hour')
plt.ylabel('Volume')
plt.legend() #shows legend for second graph

plt.show() #displays our lovely graphs


# There's a couple things we can point out  from the above graphs. First, the nighttime traffic on weekends from the hours of 0:00 - 7:00am is significantly lower than that during the week, presumably because people aren't waking up that early for work. The hours of 4:00 - 6:00am see a big increase during the week presumably because people are all heading to work during the same time.
# 
# From the hours of 19:00 - 23:00pm, there is only a minor increase in traffic from about 21:00 - 23:00, presumably because people are still enjoying their weekend activities. We also see a steady decrease during these hours because less people are driving after a day of work or activities.

# ### Nighttime Traffic vs. Weather
# We've examined the daytime traffic patterns against the weather. Let's now try the same thing with the nighttime data. Let's first quickly check to see if there are any strong correlations in our dataset.

# In[49]:


corr_n = nighttime.corr()['traffic_volume']
corr_n


# We can see from the weather-related columns that the temperature has the strongest correlation of +0.09. Let's make a scatter plot and see if there's any discerible connection between the temperature and the traffic volume.

# In[51]:


plt.scatter(nighttime['traffic_volume'],nighttime['temp'])
plt.ylim(230, 320) # two wrong 0K temperatures mess up the y-axis
plt.show()


# Similar to our daytime data, there isn't really any connection between the two. Now let's examine the other two weather-related columns: weather_main, and weather_description.

# In[52]:


by_weather_main_n = nighttime.groupby('weather_main').mean()
by_weather_description_n = nighttime.groupby('weather_description').mean()

by_weather_main_n['traffic_volume'].plot.barh() # Horizontal bar plot
plt.show()


# Again, similar to our daytime data, there is nothing too obvious that is a clear cause of heavy traffic. Let's examine the weather descriptions next.

# In[53]:


by_weather_description_n['traffic_volume'].plot.barh(figsize=(5,11)) # Horizontal bar plot
plt.show()


# Here we can definitely see that there are two weather descriptions that cause traffic volume to exceed 2,500:
# 
# * proximity shower with rain
# * light intensity shower rain
# 
# Compared to the other weather descriptions, these two cause the biggest increase in traffic volume, though it's not entirely clear why, because neither of those weather conditions are particularly severe. The weather that causes the least amount of traffic volume however, is light rain and snow, even though light rain and light snow by themselves more than double the traffic volume.

# # Conclusion
# 
# In this project, we tried to find a few indicators of heavy traffic on the I-94 Interstate highway. We managed to find the following types of indicators:
# 
# * Time indicators (daytime)
#     * The traffic is usually heavier during warm months (March–October) compared to cold months (November–February).
#     * The traffic is usually heavier on business days compared to the weekends.
#     * On business days, the rush hours are around 7 and 16.
# * Weather indicators (daytime)
#     * Shower snow
#     * Light rain and snow
#     * Proximity thunderstorm with drizzle
#    
#    
# * Time indicators (nighttime)
#     * Similar to the daytime data, the traffic is usually heavier in warmer months (March-October) compared to colder months (November-February).
#     * The traffic at night slowly increases throughout the working week, and then drops off on the weekends.
#     * The biggest difference is from 0:00-7:00am on weekends compared to business days. No more than 1,300 cars are on the road during these hours.
#     
# * Weather indicators (nighttime)
#     * Proximity shower with rain
#     * Light intensity shower rain