The goal of this data analysis project is to determine the indicators of heavy traffic on I-94 Interstate highway. The indicators can be weather type, day of the week, time of the year, etc.
We will be analysing a dataset about the westbound traffic (cars moving from east to west) on the I-94 Interstate highway recorded hourly by the station. You can download the dataset using this link.
Let's read and examine our dataset to find out more information.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# reading the dataset
traffic_data = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
traffic_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48204 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null object 8 traffic_volume 48204 non-null int64 dtypes: float64(3), int64(2), object(4) memory usage: 3.3+ MB
date_time
column is a string object instead of a datetime object.Let's examine the first and last five rows.
# exploring the first five rows
traffic_data.head()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
# exploring the first last rows
traffic_data.tail()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
48199 | None | 283.45 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2018-09-30 19:00:00 | 3543 |
48200 | None | 282.76 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 20:00:00 | 2781 |
48201 | None | 282.73 | 0.0 | 0.0 | 90 | Thunderstorm | proximity thunderstorm | 2018-09-30 21:00:00 | 2159 |
48202 | None | 282.09 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 22:00:00 | 1450 |
48203 | None | 282.12 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 23:00:00 | 954 |
To understand the traffic volume on the highway, we will visualize the traffic_volume
column using a histogram. Then we will check the summary statistics of the same column.
traffic_data['traffic_volume'].plot.hist()
<AxesSubplot:ylabel='Frequency'>
traffic_data['traffic_volume'].describe()
count 48204.000000 mean 3259.818355 std 1986.860670 min 0.000000 25% 1193.000000 50% 3380.000000 75% 4933.000000 max 7280.000000 Name: traffic_volume, dtype: float64
Let's find out whether time of the day has an effect of the traffic volume.
date_time
is a string, thus we will first convert it to datetime object.# converting the date_time column from string to a datetime object
traffic_data['date_time'] = pd.to_datetime(traffic_data['date_time'])
# isolating daytime and nighttime dataframes from traffic_data
daytime = traffic_data.copy()[(traffic_data['date_time'].dt.hour >= 7) &
(traffic_data['date_time'].dt.hour < 19)]
nighttime = traffic_data.copy()[(traffic_data['date_time'].dt.hour >= 19) |
(traffic_data['date_time'].dt.hour < 7 )]
Now let's compare the traffic volume of day and night by plotting histogram of each.
plt.figure(figsize=(15,5)) # setting the grid chart size
plt.subplot(1,2,1)
daytime['traffic_volume'].plot.hist()
plt.title('Day time Traffic Volume')
plt.xlabel('Traffic Volume')
plt.ylim([0,8000]) # making sure both charts use same dimensions
plt.xlim([0,7500])
plt.subplot(1,2,2)
nighttime['traffic_volume'].plot.hist()
plt.title('Night time Traffic Volume')
plt.xlabel('Traffic Volume')
plt.ylim([0,8000])
plt.xlim([0,7500]) # making sure both charts use same dimensions
plt.show()
Let's check the summary statistics of both daytime and nighttime traffic volume to see if we can find the same pattern.
# finding summary statistics of daytime
daytime['traffic_volume'].describe()
count 23877.000000 mean 4762.047452 std 1174.546482 min 0.000000 25% 4252.000000 50% 4820.000000 75% 5559.000000 max 7280.000000 Name: traffic_volume, dtype: float64
# finding summary statistics of nighttime
nighttime['traffic_volume'].describe()
count 24327.000000 mean 1785.377441 std 1441.951197 min 0.000000 25% 530.000000 50% 1287.000000 75% 2819.000000 max 6386.000000 Name: traffic_volume, dtype: float64
Remember that our goal is to find heavy traffic indicators, and we just found out that there is no so much traffic at night. So going forward, we will drop the nighttime data and focus on daytime data.
Another indicator of traffic volume is time. It is possible that there may be more people and cars on certain month, day of the week or time of the day considering that there are events that occur periodically on certain times of the year.
Let's find out the traffic volume by month, day of the week and time of the day.
We will start with traffic volume by month. We will find the average traffic volume in each month.
daytime['month'] = daytime['date_time'].dt.month # adding month column to our dataset
by_month = daytime.groupby('month').mean() # grouping by month and taking the average of other columns
by_month['traffic_volume']
month 1 4495.613727 2 4711.198394 3 4889.409560 4 4906.894305 5 4911.121609 6 4898.019566 7 4595.035744 8 4928.302035 9 4870.783145 10 4921.234922 11 4704.094319 12 4374.834566 Name: traffic_volume, dtype: float64
by_month['traffic_volume'].plot.line(x='month')
plt.title('Average Traffic per Month')
plt.show()
Let's check whether July of each year has low traffic volume.
daytime['year'] = daytime['date_time'].dt.year # adding year column to our dataset
july_only = daytime.copy()[daytime['month'] == 7]
july_only.groupby('year').mean()['traffic_volume'].plot.line() # grouping by year and taking the average of columns
<AxesSubplot:xlabel='year'>
Let's find out traffic volume by day of the week.
daytime['day_of_week'] = daytime['date_time'].dt.dayofweek # adding day of the week column in our dataset
by_day = daytime.groupby('day_of_week').mean() # grouping by day of the week and taking the average of other columns
by_day['traffic_volume'] # 0 is Monday, 6 is Sunday
day_of_week 0 4893.551286 1 5189.004782 2 5284.454282 3 5311.303730 4 5291.600829 5 3927.249558 6 3436.541789 Name: traffic_volume, dtype: float64
by_day['traffic_volume'].plot.line()
plt.title('Average Traffic Volume by Day')
Text(0.5, 1.0, 'Average Traffic Volume by Day')
Next, we will find traffic volume by hour of each day. But from the our previous graph, the weekend values will skew our results. So, we will find the averages differently.
daytime['hour'] = daytime['date_time'].dt.hour
working_days = daytime.copy()[daytime['day_of_week'] <= 4] # 4 is Friday
by_work_hour = working_days.groupby('hour').mean()
weekend = daytime.copy()[daytime['day_of_week'] >= 5] # 5 is Saturday
by_weekend_hour = weekend.groupby('hour').mean()
print(by_work_hour['traffic_volume'])
print(by_weekend_hour['traffic_volume'])
hour 7 6030.413559 8 5503.497970 9 4895.269257 10 4378.419118 11 4633.419470 12 4855.382143 13 4859.180473 14 5152.995778 15 5592.897768 16 6189.473647 17 5784.827133 18 4434.209431 Name: traffic_volume, dtype: float64 hour 7 1589.365894 8 2338.578073 9 3111.623917 10 3686.632302 11 4044.154955 12 4372.482883 13 4362.296564 14 4358.543796 15 4342.456881 16 4339.693805 17 4151.919929 18 3811.792279 Name: traffic_volume, dtype: float64
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
by_work_hour['traffic_volume'].plot.line()
plt.title('Working Hour Traffic Volume')
plt.ylabel('Traffic Volume')
plt.xlabel('Hour')
plt.ylim([0,7000]) # making sure both charts use same dimensions
plt.subplot(1,2,2)
by_weekend_hour['traffic_volume'].plot.line()
plt.title('Weekend Hour Traffic Volume')
plt.ylabel('Traffic Volume')
plt.xlabel('Hour')
plt.ylim([0,7000]) # making sure both charts use same dimensions
plt.show()
In summarizing traffic indicators, we found the following:
Let's continue exploring the dataset to find if weather also affects the traffic volume. We will find the correlation between traffic_volume
and numerical weather columns like temp
, rain_1h
, snow_1h
, clouds_all
.
temp_corr = round(daytime['traffic_volume'].corr(traffic_data['temp']),3)
rain_corr = round(daytime['traffic_volume'].corr(traffic_data['rain_1h']),3)
snow_corr = round(daytime['traffic_volume'].corr(traffic_data['snow_1h']),3)
cloud_corr = round(daytime['traffic_volume'].corr(traffic_data['clouds_all']),3)
print('Temperature and Traffic Volume Correlation:', temp_corr)
print('Rain and Traffic Volume Correlation:', rain_corr)
print('Snow and Traffic Volume Correlation:', snow_corr)
print('Cloud and Traffic Volume Correlation:', cloud_corr)
Temperature and Traffic Volume Correlation: 0.128 Rain and Traffic Volume Correlation: 0.004 Snow and Traffic Volume Correlation: 0.001 Cloud and Traffic Volume Correlation: -0.033
temp_corr
column. But with correlation of only 12%, it is difficult say that any weather condition is strongly correlated with traffic volume.Let's plot a scatter plot to see the correlation between temperature and traffic volume.
plt.scatter(x = daytime['temp'], y = daytime['traffic_volume'])
plt.title('Temperature vs. Traffic Volume')
plt.xlabel('Temperature')
plt.ylabel('Traffic Volume')
plt.show()
Let's check the other non-numerical weather columns: weather_main
and weather_description
to see if we can find correlations.
by_weather_main = daytime.groupby('weather_main').mean()
by_weather_main['traffic_volume'].sort_values().plot.barh()
plt.show()
Let's check the more granular weather_description
column.
by_weather_description = daytime.groupby('weather_description').mean()
by_weather_description['traffic_volume'].sort_values().plot.barh(figsize=(5,10))
plt.show()
In this project, we analyzed the I-94 Interstate Highway dataset to identify heavy traffic indicators on the highway. We made the following findings: