Finding Heavy Traffic Indicators on I-94¶

Introduction:¶

This is an exploratory analysis on the traffic data on I-94 interstate highways to find the main indicators of heavy traffic. These indicators can be weather conditions , time of week , any specific months etc.

The dataset documentation mentions that a station located approximately midway between Minneapolis and Saint Paul recorded the traffic data. Also, the station only records westbound traffic (cars moving from east to west).

The data was made available by John Hogue at UCI Machine Learning Repository.

1.0 Loading the Data:¶

In [1]:

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

Reading the data set:

In [2]:

traffic = pd.read_csv("Metro_Interstate_Traffic_Volume.csv")

Let's take a quick look into the dataset , printing the top and last five rows:

In [3]:

print("Top five rows:")
traffic.head()

Top five rows:

Out[3]:

	holiday	temp	clouds_all	weather_main	weather_description	date_time	traffic_volume
0	None	288.28	40	Clouds	scattered clouds	2012-10-02 09:00:00	5545
1	None	289.36	75	Clouds	broken clouds	2012-10-02 10:00:00	4516
2	None	289.58	90	Clouds	overcast clouds	2012-10-02 11:00:00	4767
3	None	290.13	90	Clouds	overcast clouds	2012-10-02 12:00:00	5026
4	None	291.14	75	Clouds	broken clouds	2012-10-02 13:00:00	4918

In [4]:

print("Last five rows:")
traffic.tail()

Last five rows:

Out[4]:

	holiday	temp	clouds_all	weather_main	weather_description	date_time	traffic_volume
48199	None	283.45	75	Clouds	broken clouds	2018-09-30 19:00:00	3543
48200	None	282.76	90	Clouds	overcast clouds	2018-09-30 20:00:00	2781
48201	None	282.73	90	Thunderstorm	proximity thunderstorm	2018-09-30 21:00:00	2159
48202	None	282.09	90	Clouds	overcast clouds	2018-09-30 22:00:00	1450
48203	None	282.12	90	Clouds	overcast clouds	2018-09-30 23:00:00	954

Looking into the content of the dataset:

In [5]:

traffic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48204 entries, 0 to 48203
Data columns (total 9 columns):
holiday                48204 non-null object
temp                   48204 non-null float64
rain_1h                48204 non-null float64
snow_1h                48204 non-null float64
clouds_all             48204 non-null int64
weather_main           48204 non-null object
weather_description    48204 non-null object
date_time              48204 non-null object
traffic_volume         48204 non-null int64
dtypes: float64(3), int64(2), object(4)
memory usage: 3.3+ MB

From above we conclude the following points:
1. There are total 9 columns in dataset.
2. There are no null value in dataset.

2.0 Exploratory Data Analysis:¶

We will look into the traffic_volume column:

In [6]:

#Plotting the histogram
traffic["traffic_volume"].plot.hist()
plt.title("Traffic Volume on I-94")
plt.xlabel("Traffic volume")
plt.show()

In [7]:

#looking into the column stats for traffic_volume
traffic["traffic_volume"].describe()

Out[7]:

count    48204.000000
mean      3259.818355
std       1986.860670
min          0.000000
25%       1193.000000
50%       3380.000000
75%       4933.000000
max       7280.000000
Name: traffic_volume, dtype: float64

Conclusions from above table and plot:
1. There are higher frequency of traffic_volume in the ranges 0-1500 and 4500-5500. We can look into these ranges.
2. The distribution is not normally distributed.
3. The average traffic volume is 3259.
4. 25% of the time traffic volume is below 1193 , and 25% of the time the volume as high as 4933. This variation in the volume might be the daytime and nighttime variation.

We will explore further on the fourth point.

2.1 Traffic volume : Day vs Night¶

In [8]:

#Transforming the date_time column to datetime
traffic["date_time"] = pd.to_datetime(traffic["date_time"])

In [9]:

#Isolating the daytime and nighttime data
day_traffic = traffic[(traffic["date_time"].dt.hour >= 7) & (traffic["date_time"].dt.hour < 19)]
night_traffic = traffic[(traffic["date_time"].dt.hour < 7) | (traffic["date_time"].dt.hour >= 19)]

looking into the stats for traffic data by day and night time:

In [10]:

day_traffic["traffic_volume"].describe()

Out[10]:

count    23877.000000
mean      4762.047452
std       1174.546482
min          0.000000
25%       4252.000000
50%       4820.000000
75%       5559.000000
max       7280.000000
Name: traffic_volume, dtype: float64

In [11]:

night_traffic["traffic_volume"].describe()

Out[11]:

count    24327.000000
mean      1785.377441
std       1441.951197
min          0.000000
25%        530.000000
50%       1287.000000
75%       2819.000000
max       6386.000000
Name: traffic_volume, dtype: float64

In [12]:

#Plotting the histograms for day time and night time traffic
plt.figure(figsize = (10 , 5))
plt.subplot(1 , 2 , 1)
plt.hist(day_traffic["traffic_volume"])
plt.title("Traffic Volume : Day")
plt.xlabel("Traffic Volume")
plt.ylabel("Count")
plt.ylim(0 , 8000)
plt.xlim(0 , 7500)

plt.subplot(1 , 2 , 2)
plt.hist(night_traffic["traffic_volume"])
plt.title("Traffic Volume : Night")
plt.xlabel("Traffic Volume")
plt.ylabel("Count")
plt.ylim(0 , 8000)
plt.xlim(0 , 7500)

plt.show()

Conclusion:
1. There's a clear difference in day's and night's traffic volume.
2. Average traffic during day time is 4762 and that during the night time is 1785.
3. The histograms confirm the above two points.
4. Traffic distribution during day resembles normal distribution and for night it resembles logarithmic distribution , though the resemblence is not perfect.

Since , our goal is to identify the causes/indicators of heavy traffic from here on we can solely focus on the day time traffic data. The traffic volume during night is already low.

Also , it is possible that there may be higher traffic on the road during certain day , certain time of the day or on certain months.

Now , we will explore the day traffic volume by:
- Month
- Day of the week
- Time of the day

2.2 Traffic volume by month:¶

We will now look into the traffic by each month , adding a new column "month" to the day_traffic data.

In [13]:

day_traffic["month"] = day_traffic["date_time"].dt.month

In [14]:

#Grouping the data by month and calculating the monthly average:
avg_by_month = day_traffic.groupby("month").mean()["traffic_volume"]
avg_by_month

Out[14]:

month
1     4495.613727
2     4711.198394
3     4889.409560
4     4906.894305
5     4911.121609
6     4898.019566
7     4595.035744
8     4928.302035
9     4870.783145
10    4921.234922
11    4704.094319
12    4374.834566
Name: traffic_volume, dtype: float64

In [15]:

#Plotting the line plot for above result
plt.plot(avg_by_month)
plt.xlabel("Month")
plt.ylabel("Average Traffic Volume")
plt.xticks(range(1, 13))
plt.title("Monthly Average Traffic Volume")
plt.show()

Conclusion from above line graph:
1. The average traffic volume rages from nearly 4300 - 4900.
2. December , January and July have the lowest average traffic.
3. March through June and August through October have high average traffic.

2.3 Traffic volume by day of the week:¶

In [16]:

#Adding a new column "dayofweek" to the day time traffic dataset:
day_traffic["dayofweek"] = day_traffic["date_time"].dt.dayofweek

In [17]:

#Calculating average traffic by day of week , adding a column 'dayofweek'
avg_traffic_bydayofweek = day_traffic.groupby("dayofweek").mean()["traffic_volume"]
avg_traffic_bydayofweek
# 0 = Monday and 6 = Sunday

Out[17]:

dayofweek
0    4893.551286
1    5189.004782
2    5284.454282
3    5311.303730
4    5291.600829
5    3927.249558
6    3436.541789
Name: traffic_volume, dtype: float64

In [18]:

#Plotting the line plot for above result
plt.plot(avg_traffic_bydayofweek)
plt.xlabel("day")
plt.ylabel("Average Traffic Volume")
plt.title("Average Traffic Volume : by day of week")
plt.show()

Conclusion from above plot:
- The average traffic volume on week days is much higher than on weekends , the average ranges from nearly 3400 on sundays to 5300 on thursdays.

2.4 Traffic volume by time of the day:¶

As seen in last plot weekends have much lower average traffic than weekdays , therefore , we will calculate the average by time separately for weekdays and weekends.

In [19]:

#Adding a new column "hour" to the day time traffic dataset:
day_traffic["hour"] = day_traffic["date_time"].dt.hour

In [20]:

#Separating the business day and weekend data:
business_days_traffic = day_traffic.loc[day_traffic["dayofweek"] <= 4]
weekends_traffic = day_traffic.loc[day_traffic["dayofweek"] > 4]

In [21]:

#Calculating average traffic by hour for business days:
hourly_avg_traffic_business_days = business_days_traffic.groupby("hour").mean()["traffic_volume"]
hourly_avg_traffic_business_days

Out[21]:

hour
7     6030.413559
8     5503.497970
9     4895.269257
10    4378.419118
11    4633.419470
12    4855.382143
13    4859.180473
14    5152.995778
15    5592.897768
16    6189.473647
17    5784.827133
18    4434.209431
Name: traffic_volume, dtype: float64

In [22]:

#Calculating average traffic by hour for weekends:
hourly_avg_traffic_weekends = weekends_traffic.groupby("hour").mean()["traffic_volume"]
hourly_avg_traffic_weekends

Out[22]:

hour
7     1589.365894
8     2338.578073
9     3111.623917
10    3686.632302
11    4044.154955
12    4372.482883
13    4362.296564
14    4358.543796
15    4342.456881
16    4339.693805
17    4151.919929
18    3811.792279
Name: traffic_volume, dtype: float64

In [23]:

#Plotting the line plots
plt.figure(figsize = (10 , 5))
plt.subplot(1 , 2 , 1)
plt.plot(hourly_avg_traffic_business_days)
plt.title("Hourly average traffic : Business days")
plt.ylim(0 , 6500)

plt.subplot(1 , 2 , 2)
plt.plot(hourly_avg_traffic_weekends)
plt.title("Hourly average traffic : weekends")
plt.ylim(0 , 6500)
plt.show()

Conclusion from above plots:
- Clearly , traffic is much higher on business days than weekends.
- The rush hours on business days are in the morning before 10 am and then again later in the evening from 3 pm till after 5 pm.
- On weekends traffic rises till around 12 pm and then it remains nearly constant till 5 pm and starts decreasing after that.

So far, we have looked into the time indicator of traffic volume and have following conclusions:
- The traffic is usually heavier on during warm months (March–October) compared to cold months (November–February).
- The traffic is usually heavier on business days compared to weekends.
- On business days, the rush hours are around 7 and 16.

2.5 Traffic Volume by weather conditions:¶

Now , we will look into the weather indicators for heavy traffic. We will start by looking into the correlation between traffic_volume and numeric weather related columns.

In [24]:

day_traffic.corr()["traffic_volume"]

Out[24]:

temp              0.128317
rain_1h           0.003697
snow_1h           0.001265
clouds_all       -0.032932
traffic_volume    1.000000
month            -0.022337
dayofweek        -0.416453
hour              0.172704
Name: traffic_volume, dtype: float64

In [25]:

#Plotting the scatter plot between temp and traffic_volume
plt.scatter(day_traffic["traffic_volume"] , day_traffic["temp"])
plt.xlabel("Traffic Volume")
plt.ylabel("Temp")
plt.show()

From the table above we can see that there's not much of correlation between traffic and weather conditions(numeric columns only). The strongest is with the temp column and that too is not very significant.

Now , we will look into the non numeric\categorical weather indicators for any relation with the traffic volume:

In [26]:

#Calculating average traffic_volume by each categorical value in columns - 'weather_main'
day_traffic.groupby('weather_main').mean()["traffic_volume"]

Out[26]:

weather_main
Clear           4778.416260
Clouds          4865.415996
Drizzle         4837.212911
Fog             4372.491713
Haze            4609.893285
Mist            4623.976475
Rain            4815.568462
Smoke           4564.583333
Snow            4396.321183
Squall          4211.000000
Thunderstorm    4648.212860
Name: traffic_volume, dtype: float64

In [27]:

#Plotting the horizontal bar plot
plt.barh(day_traffic.groupby('weather_main').mean()["traffic_volume"].index , day_traffic.groupby('weather_main').mean()["traffic_volume"].values)
plt.title("Traffic Volume by weather condition")
plt.show()

In [28]:

#Calculating average traffic_volume by each categorical value in column - weather_description'
day_traffic.groupby('weather_description').mean()['traffic_volume']

Out[28]:

weather_description
SQUALLS                                4211.000000
Sky is Clear                           4919.009390
broken clouds                          4824.130326
drizzle                                4737.330935
few clouds                             4839.818023
fog                                    4372.491713
freezing rain                          4314.000000
haze                                   4609.893285
heavy intensity drizzle                4738.586207
heavy intensity rain                   4610.356164
heavy snow                             4411.681250
light intensity drizzle                4890.164049
light intensity shower rain            4558.100000
light rain                             4859.650849
light rain and snow                    5579.750000
light shower snow                      4618.636364
light snow                             4430.858896
mist                                   4623.976475
moderate rain                          4769.643312
overcast clouds                        4861.124952
proximity shower rain                  4901.756757
proximity thunderstorm                 4684.356436
proximity thunderstorm with drizzle    5121.833333
proximity thunderstorm with rain       4501.611111
scattered clouds                       4936.787712
shower drizzle                         4932.666667
shower snow                            5664.000000
sky is clear                           4753.930294
sleet                                  4312.666667
smoke                                  4564.583333
snow                                   4054.065693
thunderstorm                           4724.708333
thunderstorm with drizzle              2297.000000
thunderstorm with heavy rain           4555.760000
thunderstorm with light drizzle        4960.000000
thunderstorm with light rain           4336.130435
thunderstorm with rain                 4522.950000
very heavy rain                        4780.571429
Name: traffic_volume, dtype: float64

In [29]:

#Plotting the horizontal bar plot
plt.figure(figsize = (8,12))
plt.barh(day_traffic.groupby('weather_description').mean()["traffic_volume"].index , day_traffic.groupby('weather_description').mean()["traffic_volume"].values)
plt.title("Traffic Volume by weather condition")
plt.show()

3. Final Conclusions:¶

Traffic volume during day time is much higher than that in night.
More so , during the day time traffic varies hugely between business days and weekends.
The rush hours during business days are in the morning before 10am and then in the evening from 4pm till after 5pm.
Overall , the traffic volume does not vary much by the weather conditions , but days with thunderstorm or snow have slightly higher traffic.