We are going to analyze a dataset about the westbound traffic on the I-94 Interstate highway. John Hogue made the dataset available, and it can be downloaded from the UCI Machine Learning Repository. The goal of our analysis is to determine a few indicators of heavy traffic on I-94. These indicators can be weather type, time of the day, time of the week, etc. For instance, we may find out that the traffic is usually heavier in the summer or when it snows.
import pandas as pd
#Load in the data
traffic = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
#Display first 5 rows
traffic.head()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
#Display last 5 rows
traffic.tail()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
48199 | None | 283.45 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2018-09-30 19:00:00 | 3543 |
48200 | None | 282.76 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 20:00:00 | 2781 |
48201 | None | 282.73 | 0.0 | 0.0 | 90 | Thunderstorm | proximity thunderstorm | 2018-09-30 21:00:00 | 2159 |
48202 | None | 282.09 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 22:00:00 | 1450 |
48203 | None | 282.12 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 23:00:00 | 954 |
traffic.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48204 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null object 8 traffic_volume 48204 non-null int64 dtypes: float64(3), int64(2), object(4) memory usage: 3.3+ MB
In this dataset, there are a total of 9 columns and 48204 rows. None of the rows have null values with a mixture of float, integer and object data types. The date_time
column shows that the record starts from 2012-10-02 09:00:00 and 2018-09-30 23:00:00.
The dataset documentation mentions that a station located approximately midway between Minneapolis and Saint Paul recorded the traffic data. Also, the station only records westbound traffic (cars moving from east to west). This means that the results of our analysis will be about the westbound traffic in the proximity of that station. In other words, we should avoid generalizing our results for the entire I-94 highway.
#Plotting a histogram to examine the distribution of the traffic_volume column. Using Pandas method.
import matplotlib.pyplot as plt
%matplotlib inline
traffic['traffic_volume'].plot.hist()
plt.title('Frequency of Traffic Volume')
plt.xlabel('Traffic Volume')
plt.show()
traffic['traffic_volume'].describe()
#creating a summary of statistics of traffic_volume column
count 48204.000000 mean 3259.818355 std 1986.860670 min 0.000000 25% 1193.000000 50% 3380.000000 75% 4933.000000 max 7280.000000 Name: traffic_volume, dtype: float64
From the summary of statistics of the traffic_volume
column, we see that the hourly traffic volume varied from 0 to 7,280 cars, with an average volume of 3259 cars. About 25% of the time, there were 1,193 cars or fewer passing the station each hour — this probably occurs during the night, or when a road is under construction. About 75% of the time, the traffic volume was four times as much (4,933 cars or more).
The potential outcome that nighttime and daytime traffic volumes may influence each other steers our analysis in an interesting direction: comparing daytime and nighttime data.
Next, We'll start by dividing the dataset into two parts:
-Daytime data
:- hours from 7 a.m. to 7 p.m. (12 hours).
-Nighttime data
:- hours from 7 p.m. to 7 a.m. (12 hours).
While this is not a perfect criterion for distinguishing between nighttime and daytime, it's a good starting point.
#transforming the column to a datetime datatype
traffic['date_time'] = pd.to_datetime(traffic['date_time'])
traffic['date_time']
0 2012-10-02 09:00:00 1 2012-10-02 10:00:00 2 2012-10-02 11:00:00 3 2012-10-02 12:00:00 4 2012-10-02 13:00:00 ... 48199 2018-09-30 19:00:00 48200 2018-09-30 20:00:00 48201 2018-09-30 21:00:00 48202 2018-09-30 22:00:00 48203 2018-09-30 23:00:00 Name: date_time, Length: 48204, dtype: datetime64[ns]
#copy dataframe for the isolation of daytime data
day_time = traffic.copy()[(traffic['date_time'].dt.hour >= 7) & (traffic['date_time'].dt.hour < 19)]
day_time.shape
(23877, 9)
night_time = traffic.copy()[(traffic['date_time'].dt.hour >= 19) | (traffic['date_time'].dt.hour < 7)]
night_time.shape
(24327, 9)
Next, Now we're going to compare the traffic volume at night and during day.
#plotting a histogram using a Pandas method
plt.figure(figsize=(10,4))
plt.subplot(1, 2, 1)
plt.hist(day_time['traffic_volume'])
plt.xlim([0,7500])
plt.ylim([0,8000])
plt.title('Daytime Traffic Volume')
plt.xlabel('Traffic Volume')
plt.ylabel('Frequency')
plt.subplot(1, 2, 2)
plt.hist(night_time['traffic_volume'])
plt.xlim([0,7500])
plt.ylim([0,8000])
plt.title('Nighttime Traffic Volume')
plt.xlabel('Traffic Volume')
plt.ylabel('Frequency')
plt.show()
day_time['traffic_volume'].describe()
count 23877.000000 mean 4762.047452 std 1174.546482 min 0.000000 25% 4252.000000 50% 4820.000000 75% 5559.000000 max 7280.000000 Name: traffic_volume, dtype: float64
night_time['traffic_volume'].describe()
count 24327.000000 mean 1785.377441 std 1441.951197 min 0.000000 25% 530.000000 50% 1287.000000 75% 2819.000000 max 6386.000000 Name: traffic_volume, dtype: float64
The daytime histogram is leftskewed, Most of the values pile up on the right side of the histogram and the median is higher than the mean. The nighttime histogram is rightskewed, Most of the values pile up on the left side of the histogram and the mean is higher than the median.
Traffic at night is light compared to the daytime when you look at the averages and our goal is to find the indicators of heavy traffic, so we will be using the daytime data going forward.
Although there are still measurements of over 5,000 cars per hour, the traffic at night is generally light. Our goal is to find indicators of heavy traffic, so we'll only focus on the daytime data moving forward.
day_time['month'] = day_time['date_time'].dt.month
by_month = day_time.groupby('month').mean()
by_month['traffic_volume']
month 1 4495.613727 2 4711.198394 3 4889.409560 4 4906.894305 5 4911.121609 6 4898.019566 7 4595.035744 8 4928.302035 9 4870.783145 10 4921.234922 11 4704.094319 12 4374.834566 Name: traffic_volume, dtype: float64
#plotting a line graph showing monthly traffic volume averages
by_month['traffic_volume'].plot.line()
plt.title('Monthly Traffic Volume Averages')
plt.xlabel('Month')
plt.ylabel('Traffic Volume Averages')
plt.xticks(range(1,13))
plt.show()
It shows from the line graph that the traffic volume has high averages in March - June, and August - October, they are also warm months while traffic volume with low averages are in January, February, November and December. But July has a low traffic volume average, which is quite unusual. let's see how the traffic volume changed each year in July.
#creating new column for traffic volume measured yearly
day_time['year'] = day_time['date_time'].dt.year
only_july = day_time[day_time['month'] == 7]
only_july = only_july.groupby('year').mean()
only_july['traffic_volume'].plot.line()
plt.show()
Typically, the traffic is pretty heavy in July, similar to the other warm months. The only exception we see is 2016, which had a high decrease in traffic volume. One possible reason for this is road construction — this article from 2016 supports this hypothesis.
As a tentative conclusion here, we can say that warm months generally show heavier traffic compared to cold months. In a warm month, you can can expect for each hour of daytime a traffic volume close to 5,000 cars.
we found that the traffic volume is significantly heavier on business days compared to the weekends.
We'll now generate a line plot for the time of day. The weekends, however, will drag down the average values, so we're going to look at the averages separately. To do that, we'll start by splitting the data based on the day type: business day or weekend.
#create new column for traffic volume measured daily
day_time['dayofweek'] = day_time['date_time'].dt.dayofweek
by_dayofweek = day_time.groupby('dayofweek').mean()
by_dayofweek['traffic_volume']
dayofweek 0 4893.551286 1 5189.004782 2 5284.454282 3 5311.303730 4 5291.600829 5 3927.249558 6 3436.541789 Name: traffic_volume, dtype: float64
#plotting a line graph showing daily traffic volume averages
by_dayofweek['traffic_volume'].plot.line()
plt.title('Daily Traffic Volume Averages')
plt.xlabel('Day')
plt.ylabel('Traffic Volume Averages')
plt.show()
On business days (Monday through Friday), traffic volume is significantly higher. Except for Monday, we only see values in exceeding 5,000 on business days. Weekend traffic is lighter, with fewer than 4,000 vehicles.
day_time['hour'] = day_time['date_time'].dt.hour
business_days = day_time.copy()[day_time['dayofweek'] <= 4] #4 == Friday
weekend = day_time.copy()[day_time['dayofweek'] > 4 ]
#getting average traffic volume for business days
by_hour_businessdays = business_days.groupby('hour').mean()
#getting average traffic volume for weekends
by_hour_weekends = weekend.groupby('hour').mean()
#Plotting two line plots showing average traffic volume changes by time of the day
plt.figure(figsize=(11,4))
plt.subplot(1, 2, 1)
by_hour_businessdays['traffic_volume'].plot.line()
plt.xlim(5,20)
plt.ylim(1500,6300)
plt.title('Hourly Businessday Traffic Volume')
plt.xlabel('Hour')
plt.ylabel('Average traffic volume')
plt.subplot(1, 2, 2)
by_hour_weekends['traffic_volume'].plot.line()
plt.xlim(5,20)
plt.ylim(1500,6300)
plt.title('Hourly Weekend Traffic Volume')
plt.xlabel('Hour')
plt.ylabel('Average traffic volume')
plt.show()
At each hour of the day, the traffic volume is generally higher during business days compared to the weekends. As somehow expected, the rush hours are around 7 and 16 — when most people travel from home to work and back. We see volumes of over 6,000 cars at rush hours.
To summarize, we found a few time-related indicators of heavy traffic:
-The traffic is usually heavier during warm months (March–October) compared to cold months (November–February). -The traffic is usually heavier on business days compared to weekends.
-On business days, the rush hours are around 7 and 16.
#find correlation
day_time.corr()['traffic_volume']
temp 0.128317 rain_1h 0.003697 snow_1h 0.001265 clouds_all -0.032932 traffic_volume 1.000000 month -0.022337 year -0.003557 dayofweek -0.416453 hour 0.172704 Name: traffic_volume, dtype: float64
Temperature shows the strongest correlation with a value of just +0.13. The other relevant columns (rain_1h, snow_1h, clouds_all) don't show any strong correlation with traffic_value.
Let's generate a scatter plot to visualize the correlation between temp and traffic_volume.
#plotting a scatter plot
day_time.plot.scatter('traffic_volume', 'temp')
plt.xlim()
plt.show()
We can conclude that temperature doesn't look like a solid indicator of heavy traffic.
by_weather_main = day_time.groupby('weather_main').mean()
by_weather_main['traffic_volume'].plot.barh()
plt.show()
by_weather_description = day_time.groupby('weather_description').mean()
#plotting horizontal bar plot
by_weather_description['traffic_volume'].plot.barh(figsize=(6,12))
plt.xlabel('Average traffic volume')
plt.ylabel('weather main')
plt.show()
Where traffic volume exceeds 5,000, it appears that three weather types exist: shower snow, light rain and snow, and proximity thunderstorm with drizzle. It's unclear why these weather types have the highest average traffic values — this seems to be bad weather, and not particularly bad. When the weather is bad, perhaps more people take their cars out of the garage instead of riding a bicycle or having to walk.
We attempted to identify a few indicators of heavy traffic on the I-94 Interstate highway in this project. We were successful in locating two types of indicators: Time Indicators and Weather Indicators.
Time indicators
Weather indicators