Guided Project: Data Visualization Fun[damentals]

In this project I look forward to selecting and applying various ways of graphically representing information derived specifically from I-94 westbound traffic data.

This effort to identify indicators of traffic slowdowns is an example of exploratory anaylsis.

Summary of Conculsions

Daytime Vs. Nighttime

The Nighttime (7pm - 7am) Traffic Volume Frequency Histogram is Right-Skewed indicating low traffic volume. Data from Nighttime was excluded from the current analysis because it does not offer much insight into high traffic volume indicators.

Month

The Winter months (Nov-Feb) have lower traffic volume, with the lowest in Dec/Jan. This could be a combination of the holidays and the colder weather.

July was an exception to the higher traffic volumes observed in the Spring-Fall (Mar-Oct) with a noticeable dip. This dip is curious but does not provide insight on high traffic volume indicators.

Day of the Week

Business Days (Mon-Fri) showed much higher traffic volume than Weekends (Sat-Sun), and the line graphs of the hourly changes in traffic volume for each displayed two very distinct trends.

Business Day Traffic Volume peaked very high at 7:00 and then again at 16:00.

Weekend Traffic Volume started low in the morning and increased steadily to maintain a steady middle level throughout the afternoon.

Weather

There was mostly just a weak correlation between the different types of weather events and traffic volume.

However, when looking a little more closely, weather events that specifically combined rain and snow had higher traffic volumes than other events that involved just rain or snow. An explanation could be icy road conditions.

Full Exploratory Analysis

Let's get probing!!

First Peek at the Data Set

There are only 9 columns in the dataset, mostly related to weather. Otherwise it also tells us the date and time, whether it was a holiday and of course the traffic volume.

The data set is complete with no values missing.

In [1]:
import pandas as pd
dataset = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
print(dataset.head())
print(dataset.tail())
print(dataset.info())
  holiday    temp  rain_1h  snow_1h  clouds_all weather_main  \
0    None  288.28      0.0      0.0          40       Clouds   
1    None  289.36      0.0      0.0          75       Clouds   
2    None  289.58      0.0      0.0          90       Clouds   
3    None  290.13      0.0      0.0          90       Clouds   
4    None  291.14      0.0      0.0          75       Clouds   

  weather_description            date_time  traffic_volume  
0    scattered clouds  2012-10-02 09:00:00            5545  
1       broken clouds  2012-10-02 10:00:00            4516  
2     overcast clouds  2012-10-02 11:00:00            4767  
3     overcast clouds  2012-10-02 12:00:00            5026  
4       broken clouds  2012-10-02 13:00:00            4918  
      holiday    temp  rain_1h  snow_1h  clouds_all  weather_main  \
48199    None  283.45      0.0      0.0          75        Clouds   
48200    None  282.76      0.0      0.0          90        Clouds   
48201    None  282.73      0.0      0.0          90  Thunderstorm   
48202    None  282.09      0.0      0.0          90        Clouds   
48203    None  282.12      0.0      0.0          90        Clouds   

          weather_description            date_time  traffic_volume  
48199           broken clouds  2018-09-30 19:00:00            3543  
48200         overcast clouds  2018-09-30 20:00:00            2781  
48201  proximity thunderstorm  2018-09-30 21:00:00            2159  
48202         overcast clouds  2018-09-30 22:00:00            1450  
48203         overcast clouds  2018-09-30 23:00:00             954  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48204 entries, 0 to 48203
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   holiday              48204 non-null  object 
 1   temp                 48204 non-null  float64
 2   rain_1h              48204 non-null  float64
 3   snow_1h              48204 non-null  float64
 4   clouds_all           48204 non-null  int64  
 5   weather_main         48204 non-null  object 
 6   weather_description  48204 non-null  object 
 7   date_time            48204 non-null  object 
 8   traffic_volume       48204 non-null  int64  
dtypes: float64(3), int64(2), object(4)
memory usage: 3.3+ MB
None

Traffic Volume - Histogram

In [2]:
import matplotlib.pyplot as plt
%matplotlib inline

dataset["traffic_volume"].plot.hist(bins=10)
plt.xlabel("Traffic Volume")
plt.show()

Traffic Volume - Quick Stats

In [3]:
dataset["traffic_volume"].describe()
Out[3]:
count    48204.000000
mean      3259.818355
std       1986.860670
min          0.000000
25%       1193.000000
50%       3380.000000
75%       4933.000000
max       7280.000000
Name: traffic_volume, dtype: float64

Traffic Volume - Initial Observations

Based on the series desription and histogram generated above, the most common traffic volumes are less than 1000 cars or closer to 5000 cars.

Certainly time of day has an influence on the # of cars on the road. Given the shape of the histogram, the data may be indicating a nightly average around 1000 cars and a daytime average around 4500.

Traffic Volume : Daytime vs. Nighttime

Preparing the Dataset : Append Hour

The below code converts the date_time column string into datetime format and then appends the hour values as a 10th column of the dataset.

In [4]:
dataset["date_time"] = pd.to_datetime(dataset["date_time"])
dataset["hour"] = dataset["date_time"].dt.hour
print(dataset.head())
  holiday    temp  rain_1h  snow_1h  clouds_all weather_main  \
0    None  288.28      0.0      0.0          40       Clouds   
1    None  289.36      0.0      0.0          75       Clouds   
2    None  289.58      0.0      0.0          90       Clouds   
3    None  290.13      0.0      0.0          90       Clouds   
4    None  291.14      0.0      0.0          75       Clouds   

  weather_description           date_time  traffic_volume  hour  
0    scattered clouds 2012-10-02 09:00:00            5545     9  
1       broken clouds 2012-10-02 10:00:00            4516    10  
2     overcast clouds 2012-10-02 11:00:00            4767    11  
3     overcast clouds 2012-10-02 12:00:00            5026    12  
4       broken clouds 2012-10-02 13:00:00            4918    13  

Traffic Volume : Daytime vs. Nighttime

Preparing the Dataset : Split by Hour

Next we'll separate the data into a dayset and a nightset based on 7am onwards being daytime and 7pm onwards being nighttime.

In [5]:
bool_day = (dataset["hour"] >= 7) & (dataset["hour"] < 19)
bool_night = ~bool_day

dayset = dataset.loc[bool_day].copy()
nightset = dataset.loc[bool_night].copy()

Traffic Volume: Daytime vs. Nighttime

Histogram Grid

In [6]:
plt.figure(figsize=(12,5))

plt.subplot(1,2,1)
plt.hist(dayset["traffic_volume"], bins=10)
plt.title("Daytime Traffic Volumes")
plt.xlabel("Traffic Volume")
plt.ylabel("Frequency")
plt.xlim([0,8000])
plt.ylim([0,8000])

plt.subplot(1,2,2)
plt.hist(nightset["traffic_volume"], bins=10)
plt.title("Nighttime Traffic Volumes")
plt.xlabel("Traffic Volume")
plt.ylabel("Frequency")
plt.xlim([0,8000])
plt.ylim([0,8000])

plt.show()

Traffic Volume: Daytime vs. Nighttime

Quick Stats

In [7]:
print("Daytime Traffic Volume Statistics:")
print(dayset["traffic_volume"].describe())
print()
print("Nighttime Traffic Volume Statistics:")
print(nightset["traffic_volume"].describe())
Daytime Traffic Volume Statistics:
count    23877.000000
mean      4762.047452
std       1174.546482
min          0.000000
25%       4252.000000
50%       4820.000000
75%       5559.000000
max       7280.000000
Name: traffic_volume, dtype: float64

Nighttime Traffic Volume Statistics:
count    24327.000000
mean      1785.377441
std       1441.951197
min          0.000000
25%        530.000000
50%       1287.000000
75%       2819.000000
max       6386.000000
Name: traffic_volume, dtype: float64

Traffic Volume: Daytime vs. Nighttime

Observations

The dayset histogram is left-skewed with higher traffic volumes occuring more often during this period.

The nightset histogram is right-skewed with lower traffic volumes occurring more often.

Since nighttime data reflects lower traffic volumes, it will not be as useful to find indicators of heavy traffic which is significantly more common during the day.

Moving forward we'll focus on the dayset data.

Daytime Traffic Volume : Month Influence

Preparing the Data : Append Month

Adding an 11th column to store the month using similar method as for hour:

In [8]:
dayset["month"] = dayset["date_time"].dt.month

Daytime Traffic Volume : Month Influence

Generate Statistics : Traffic Volume Monthly Means

Generate the monthly traffic volume averages and plot them.

In [9]:
by_month = dayset.groupby("month").mean()
by_month['traffic_volume']
Out[9]:
month
1     4495.613727
2     4711.198394
3     4889.409560
4     4906.894305
5     4911.121609
6     4898.019566
7     4595.035744
8     4928.302035
9     4870.783145
10    4921.234922
11    4704.094319
12    4374.834566
Name: traffic_volume, dtype: float64

Daytime Traffic Volume : Month Influence

Visualize Data : Line Graph of Monthly Mean Traffic Volume

Plotting the monthly mean values we can see that traffic volume is pretty steady at 4900 each month but the traffic volume dips around the around when people often take vacation: Nov-Feb (Thanksgiving/Christmas) and July.

In [10]:
plt.plot(by_month["traffic_volume"])
plt.title("Traffic Volume by Month")
plt.xlabel("Month")
plt.ylabel("Mean Traffic Volume")
plt.show()

Daytime Traffic Volume : Day of Week Influence

Preparing the Data : per Month

Use similar techinque to identify the day of the week of each data row and generating mean traffic volumes for each DOW.

In [11]:
dayset['dayofweek'] = dayset['date_time'].dt.dayofweek
by_dayofweek = dayset.groupby('dayofweek').mean()
by_dayofweek['traffic_volume'] # 0 is Monday, 6 is Sunday
Out[11]:
dayofweek
0    4893.551286
1    5189.004782
2    5284.454282
3    5311.303730
4    5291.600829
5    3927.249558
6    3436.541789
Name: traffic_volume, dtype: float64

Daytime Traffic Volume : Day of Week Influence

Visualize Data : Line Graph of Day of Week Mean Traffic Volume

Plotting the mean values for each day of the week we can see that traffic volume is lower on Saturday and Sunday than it is from Monday-Friday.

In [12]:
plt.plot(by_dayofweek["traffic_volume"])
plt.title ("Traffic Volume by Day of Week")
plt.xlabel ("Day of Week")
plt.ylabel("Mean Traffic Volume")
plt.xticks(ticks=[0,1,2,3,4,5,6], labels=["Mon", "Tues", "Weds", "Thurs", "Fri", "Sat", "Sun"])
plt.show()

Daytime Traffic Volume - Business Days vs. Weekends

Generate Hourly Statistics

Split the dataset again, this time based on the day of week value, making a set for business days and a set for weekends.

Calculate the mean traffic volume per hour for each day of week set.

At first class we can already see that 7am has 4x higher traffic volume on weekdays (~6000) compared with weekends (~1500).

In [13]:
business = dayset[dayset["dayofweek"] <=4].copy()
weekend = dayset[dayset["dayofweek"] > 4].copy()
by_hour_business = business.groupby("hour").mean()
by_hour_weekend = weekend.groupby("hour").mean()

print("Traffic Volume per Hour on Business Days (Mon-Fri)")
print(by_hour_business["traffic_volume"])
print()
print("Traffic Volume per Hour on Weekends (Sat-Sun)")
print(by_hour_weekend["traffic_volume"])
Traffic Volume per Hour on Business Days (Mon-Fri)
hour
7     6030.413559
8     5503.497970
9     4895.269257
10    4378.419118
11    4633.419470
12    4855.382143
13    4859.180473
14    5152.995778
15    5592.897768
16    6189.473647
17    5784.827133
18    4434.209431
Name: traffic_volume, dtype: float64

Traffic Volume per Hour on Weekends (Sat-Sun)
hour
7     1589.365894
8     2338.578073
9     3111.623917
10    3686.632302
11    4044.154955
12    4372.482883
13    4362.296564
14    4358.543796
15    4342.456881
16    4339.693805
17    4151.919929
18    3811.792279
Name: traffic_volume, dtype: float64

Daytime Traffic Volume - Business Days vs. Weekends

Visualize Data - Plot Grid of Hourly Traffic Volume Comparison

The plot grid below shows much different hourly traffic volume levels throughout the day between Business Day and Weekends as represented by the different shapes in their line graphs.

On business days the traffic volume starts high early in the morning, comes down mid morning and then slowly builds to another peak late afternoon/early evening. This peak mean vehicle traffic represents the times of day people are most likely to be commuting.

On weekends, however, traffic volume early in the morning is low, climbing steadily until noon when it stabilizes at a medium level for the rest of the afternoon before tapering off slowly into evening.**

In [14]:
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.plot(by_hour_business["traffic_volume"])
plt.title("Business Day (M-F) Traffic Volume by Hour")
plt.xlabel("Hour")
plt.xticks(ticks=[8, 10, 12, 14, 16, 18], labels=["8:00", "10:00", "12:00", "14:00", "16:00", "18:00"], rotation=30)
plt.ylabel("Mean Traffic Volume")
plt.ylim([0,7000])

plt.subplot(1,2,2)
plt.plot(by_hour_weekend["traffic_volume"])
plt.title("Weekend (Sa-So) Traffic Volume by Hour")
plt.xlabel("Hour")
plt.xticks(ticks=[8, 10, 12, 14, 16, 18], labels=["8:00", "10:00", "12:00", "14:00", "16:00", "18:00"], rotation=30)
plt.ylabel("Mean Traffic Volume")
plt.ylim([0,8000])

plt.show()

Daytime Traffic Volume - Weather Influence

Explore Correlation with Weather Metrics

Viewing the correlation value we find mostly a weak [absolute] correlation of traffic volume with the numerical weather metrics. Temperature shows to have slighly more influence than precipitation or cloud cover.

In [15]:
print("Correlation with Traffic Volume:")
print(
dayset[["temp","rain_1h", "snow_1h", "clouds_all"]].corrwith(dayset["traffic_volume"])
)
Correlation with Traffic Volume:
temp          0.128317
rain_1h       0.003697
snow_1h       0.001265
clouds_all   -0.032932
dtype: float64

Daytime Traffic Volume - Weather Influence

Bar Plot of Weather Descriptors

Create statistics based on weather qualifiers (type categories).

Looking at the Main Weather Events there are no values that stand out as above average.

However, when looking in more detail, Weather Events that involve both Rain and Snow have higher traffic volume. Specifically weather descriptoins called "shower snow" and "light rain and snow" have higher traffic volume. The explanation could involve icy conditions.

The only other even combining both types of preciptiation is "light shower and snow" but assuming severity of rain being having a heirarchy from least to most severe: [light shower, shower, light rain], "light shower" and snow may not lead to icy conditions.

In [16]:
by_weather_main = dayset.groupby('weather_main').mean()
by_weather_description = dayset.groupby('weather_description').mean()

by_weather_main["traffic_volume"].plot.barh()
plt.title("Traffic Volume for Main Weather Events")
plt.xlabel("Traffic Volume")
plt.ylabel("Main Weather Events")
plt.show()

by_weather_description["traffic_volume"].plot.barh(figsize=(5,10))
plt.show()

Summary of Conculsions

Daytime Vs. Nighttime

The Nighttime (7pm - 7am) Traffic Volume Frequency Histogram is Right-Skewed indicating low traffic volume. Data from Nighttime was excluded from the current analysis because it does not offer much insight into high traffic volume indicators.

Month

The Winter months (Nov-Feb) have lower traffic volume, with the lowest in Dec/Jan. This could be a combination of the holidays and the colder weather.

July was an exception to the higher traffic volumes observed in the Spring-Fall (Mar-Oct) with a noticeable dip. This dip is curious but does not provide insight on high traffic volume indicators.

Day of the Week

Business Days (Mon-Fri) showed much higher traffic volume than Weekends (Sat-Sun), and the line graphs of the hourly changes in traffic volume for each displayed two very distinct trends.

Business Day Traffic Volume peaked very high at 7:00 and then again at 16:00.

Weekend Traffic Volume started low in the morning and increased steadily to maintain a steady middle level throughout the afternoon.

Weather

There was mostly just a weak correlation between the different types of weather events and traffic volume.

However, when looking a little more closely, weather events that specifically combined rain and snow had higher traffic volumes than other events that involved just rain or snow. An explanation could be icy road conditions.

In [ ]: