We're going to analyze a dataset about the westbound traffic on the I-94 Interstate highway.
John Hogue made the dataset available, and it can be downloaded from the UCI Machine Learning Repository. The recorded data refer to hourly Interstate 94 Westbound traffic volume for MN DoT ATR station 301, roughly midway between Minneapolis and St Paul, MN. Hourly weather features and holidays included for impacts on traffic volume.
The goal of our analysis is to determine a few indicators of heavy traffic on I-94. These indicators can be weather type, time of the day, time of the week, etc. For instance, we may find out that the traffic is usually heavier in the summer or when it snows.
We'll start by reading the csv file and looking at the first and the last five rows, together with some information about the dataset. The different columns refer to:
import pandas as pd
metro = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
print('First 5 rows:\n', metro.head())
print('Last 5 rows:\n', metro.tail())
print('Additional info:\n', metro.info())
First 5 rows: holiday temp rain_1h snow_1h clouds_all weather_main \ 0 None 288.28 0.0 0.0 40 Clouds 1 None 289.36 0.0 0.0 75 Clouds 2 None 289.58 0.0 0.0 90 Clouds 3 None 290.13 0.0 0.0 90 Clouds 4 None 291.14 0.0 0.0 75 Clouds weather_description date_time traffic_volume 0 scattered clouds 2012-10-02 09:00:00 5545 1 broken clouds 2012-10-02 10:00:00 4516 2 overcast clouds 2012-10-02 11:00:00 4767 3 overcast clouds 2012-10-02 12:00:00 5026 4 broken clouds 2012-10-02 13:00:00 4918 Last 5 rows: holiday temp rain_1h snow_1h clouds_all weather_main \ 48199 None 283.45 0.0 0.0 75 Clouds 48200 None 282.76 0.0 0.0 90 Clouds 48201 None 282.73 0.0 0.0 90 Thunderstorm 48202 None 282.09 0.0 0.0 90 Clouds 48203 None 282.12 0.0 0.0 90 Clouds weather_description date_time traffic_volume 48199 broken clouds 2018-09-30 19:00:00 3543 48200 overcast clouds 2018-09-30 20:00:00 2781 48201 proximity thunderstorm 2018-09-30 21:00:00 2159 48202 overcast clouds 2018-09-30 22:00:00 1450 48203 overcast clouds 2018-09-30 23:00:00 954 <class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48204 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null object 8 traffic_volume 48204 non-null int64 dtypes: float64(3), int64(2), object(4) memory usage: 3.3+ MB Additional info: None
In the above information, we can see that the 'date_time' column ranges from '2012-10-02 09:00:00' to '2018-09-30 23:00:00'. Therefore, we have traffic and weather information for 6 years in 1h periods.
No null values appear in the data, and the different columns' types vary: strings, float and integers.
Next, we're going to plot a histogram to visualize the distribution of the traffic_volume column, and see some statistics on this variable.
import matplotlib.pyplot as plt
%matplotlib inline
#Histogram of 'taffic_volume' column
metro['traffic_volume'].plot.hist(title='Traffic volume frequency',xlabel='Traffic volume', ylabel='Frequency', bins=10)
plt.show()
#Additional info
print('Additional statistics:\n', metro['traffic_volume'].describe())
Additional statistics: count 48204.000000 mean 3259.818355 std 1986.860670 min 0.000000 25% 1193.000000 50% 3380.000000 75% 4933.000000 max 7280.000000 Name: traffic_volume, dtype: float64
From the previous histogram and statistics, we can observe that the most frequent traffic volumes on our dataset are either lower than 1000 cars per hour, or around 5000 cars per hour. As this data does not follow a normal distribution -it's not even centered around the mean-, the mean value does not represent well the data.
Probably, day- and night-times affect this distribution - maybe during night-time less than 1000 cars per hour cross this road, while during the day it might become more crowded.
This possibility that nighttime and daytime might influence traffic volume gives our analysis an interesting direction: comparing daytime with nighttime data.
We'll start by dividing the dataset into two parts:
While this is not a perfect criterion for distinguishing between nighttime and daytime, it's a good starting point.
#Transform the 'date_time' column to datetime type
metro['date_time'] = pd.to_datetime(metro['date_time'])
#Isolate the daytime and nighttime data in two different datasets
daytime_data = metro[(metro['date_time'].dt.hour >= 7) & (metro['date_time'].dt.hour < 19)]
nighttime_data = metro[(metro['date_time'].dt.hour < 7) | (metro['date_time'].dt.hour >= 19)]
Now we're going to compare the traffic volume at night and during day.
plt.figure(figsize=(8,3), constrained_layout=True)
#Subplot for daytime data
plt.subplot(1, 2, 1)
plt.hist(daytime_data['traffic_volume'])
plt.xlim(0,8000)
plt.ylim(0,8000)
plt.title('Daytime traffic volume distribution')
plt.xlabel('Cars per hour')
plt.ylabel('Frequency')
#Subplot for nighttime data
plt.subplot(1, 2, 2)
plt.hist(nighttime_data['traffic_volume'])
plt.xlim(0,8000)
plt.ylim(0,8000)
plt.title('Nighttime traffic volume distribution')
plt.xlabel('Cars per hour')
plt.ylabel('Frequency')
plt.show()
#Aditional statistics
print('Daytime data:\n', daytime_data['traffic_volume'].describe(), '\n')
print('Nighttime data:\n', nighttime_data['traffic_volume'].describe())
Daytime data: count 23877.000000 mean 4762.047452 std 1174.546482 min 0.000000 25% 4252.000000 50% 4820.000000 75% 5559.000000 max 7280.000000 Name: traffic_volume, dtype: float64 Nighttime data: count 24327.000000 mean 1785.377441 std 1441.951197 min 0.000000 25% 530.000000 50% 1287.000000 75% 2819.000000 max 6386.000000 Name: traffic_volume, dtype: float64
From the plots and statistics above, we can conclude that traffic differs heavily in daytime and nighttime.
While daytime traffic volume follows a normal distribution, nighttime traffic volume resembles a negative exponential distribution (higher frequencies on the lower values, which decrease as we increase value).
Therefore, as the traffic at night is very light, we'll better use daytime data to continue our analysis.
Previously, we determined that the traffic at night is generally light. Our goal is to find indicators of heavy traffic, so we decided to only focus on the daytime data moving forward.
One of the possible indicators of heavy traffic is time. There might be more people on the road in a certain month, on a certain day, or at a certain time of the day.
We're going to look at a few line plots showing how the traffic volume changed according to the following parameters:
#Create a new column containing the month
daytime_data['month'] = daytime_data['date_time'].dt.month
#Aggregate the data and avearge it by month
by_month = daytime_data.groupby('month').mean()
print(by_month['traffic_volume'])
"""We get a warning message here. Still needed to fix it"""
#Line plot on 'traffic_volume' x month data
plt.plot(by_month['traffic_volume'])
plt.xlabel('Month')
plt.title('Traffic volume x month')
plt.show()
month 1 4495.613727 2 4711.198394 3 4889.409560 4 4906.894305 5 4911.121609 6 4898.019566 7 4595.035744 8 4928.302035 9 4870.783145 10 4921.234922 11 4704.094319 12 4374.834566 Name: traffic_volume, dtype: float64
C:\Users\Alvaro\AppData\Local\Temp/ipykernel_4076/1848645698.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy daytime_data['month'] = daytime_data['date_time'].dt.month
The frequency table and line plot above show that while traffic volume is high and equally distributed during March-June and August-October, the lowest values appear in January, July and December - probably because of vacation periods.
We'll now continue with building line plots for another time unit: day of the week.
#Create a new column containing the day of the week
daytime_data['dayofweek'] = daytime_data['date_time'].dt.dayofweek
#Aggregate the data and avearge it by day of the week
of_dayofweek = daytime_data.groupby('dayofweek').mean()
print(of_dayofweek['traffic_volume']) # 0 is Monday, 6 is Sunday
#Line plot on 'traffic_volume' x day of the week data
plt.plot(of_dayofweek['traffic_volume'])
plt.xlabel('Day of week')
plt.title('Traffic volume x day of week')
plt.show()
C:\Users\Alvaro\AppData\Local\Temp/ipykernel_4076/1456384355.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy daytime_data['dayofweek'] = daytime_data['date_time'].dt.dayofweek
dayofweek 0 4893.551286 1 5189.004782 2 5284.454282 3 5311.303730 4 5291.600829 5 3927.249558 6 3436.541789 Name: traffic_volume, dtype: float64
The frequency table and line plot above show that while traffic volume is high and equally distributed during Monday and Friday (slightly lower on Mondays), the lowest values appear in the weekend - probably because of the influence of commuting to work by car.
We'll now generate a line plot for the time of day. The weekends, however, will drag down the average values, so we're going to look at the averages separately. To do that, we'll start by splitting the data based on the day type: business day or weekend.
#Create a new column containing the hour of the day
daytime_data['hour'] = daytime_data['date_time'].dt.hour
#Split the data in business days and weekends datasets
business_days = daytime_data.copy()[daytime_data['dayofweek'] <= 4] # 4 == Friday
weekend = daytime_data.copy()[daytime_data['dayofweek'] >= 5] # 5 == Saturday
#Aggregate the data and avearge it by hour of the day
by_hour_business = business_days.groupby('hour').mean()
by_hour_weekend = weekend.groupby('hour').mean()
print(by_hour_business['traffic_volume'])
print(by_hour_weekend['traffic_volume'])
#Grid with line plots on 'traffic_volume' x day of the week data
plt.figure(figsize=(8,3), constrained_layout=True)
#Line plot for business days
plt.subplot(1, 2, 1)
plt.ylim(0,6500)
plt.plot(by_hour_business['traffic_volume'])
plt.xlabel('Hour of the day')
plt.title('Traffic volume x hour (business days)')
#Line plot for weekends
plt.subplot(1, 2, 2)
plt.ylim(0,6500)
plt.plot(by_hour_weekend['traffic_volume'])
plt.xlabel('Hour of the day')
plt.title('Traffic volume x hour (weekend)')
plt.show()
C:\Users\Alvaro\AppData\Local\Temp/ipykernel_4076/2115677114.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy daytime_data['hour'] = daytime_data['date_time'].dt.hour
hour 7 6030.413559 8 5503.497970 9 4895.269257 10 4378.419118 11 4633.419470 12 4855.382143 13 4859.180473 14 5152.995778 15 5592.897768 16 6189.473647 17 5784.827133 18 4434.209431 Name: traffic_volume, dtype: float64 hour 7 1589.365894 8 2338.578073 9 3111.623917 10 3686.632302 11 4044.154955 12 4372.482883 13 4362.296564 14 4358.543796 15 4342.456881 16 4339.693805 17 4151.919929 18 3811.792279 Name: traffic_volume, dtype: float64
The previous frequency tables and line plots show that, while on business days the distribution reaches its maximum values at 7h and 16h, on weekends the distribution resembles a logarithmic distribution, where latter hours have higher values. This effect is probably caused by the use of the car (commuting to work on business days, leisure on weekends).
So far, we've focused on finding time indicators for heavy traffic, and we reached the following conclusions:
Another possible indicator of heavy traffic is weather. The dataset provides us with a few useful columns about weather: temp, rain_1h, snow_1h, clouds_all, weather_main, weather_description.
A few of these columns are numerical so let's start by looking up their correlation values with traffic_volume.
print(daytime_data.info())
corr_traffic_temp = daytime_data['traffic_volume'].corr(daytime_data['temp'])
corr_traffic_rain = daytime_data['traffic_volume'].corr(daytime_data['rain_1h'])
corr_traffic_snow = daytime_data['traffic_volume'].corr(daytime_data['snow_1h'])
corr_traffic_clouds = daytime_data['traffic_volume'].corr(daytime_data['clouds_all'])
print('The correlation of traffic volume with temperature is:', round(corr_traffic_temp,2))
print('The correlation of traffic volume with rain is:', round(corr_traffic_rain,2))
print('The correlation of traffic volume with snow is:', round(corr_traffic_snow,2))
print('The correlation of traffic volume with clouds is:', round(corr_traffic_clouds,2))
<class 'pandas.core.frame.DataFrame'> Int64Index: 23877 entries, 0 to 48198 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 23877 non-null object 1 temp 23877 non-null float64 2 rain_1h 23877 non-null float64 3 snow_1h 23877 non-null float64 4 clouds_all 23877 non-null int64 5 weather_main 23877 non-null object 6 weather_description 23877 non-null object 7 date_time 23877 non-null datetime64[ns] 8 traffic_volume 23877 non-null int64 9 month 23877 non-null int64 10 dayofweek 23877 non-null int64 11 hour 23877 non-null int64 dtypes: datetime64[ns](1), float64(3), int64(5), object(3) memory usage: 2.4+ MB None The correlation of traffic volume with temperature is: 0.13 The correlation of traffic volume with rain is: 0.0 The correlation of traffic volume with snow is: 0.0 The correlation of traffic volume with clouds is: -0.03
It's very rare to find correlations of plain 0 - such as traffic volume with rain and snow. Therefore, we'll inspect these distributions in scatterplots to see what's happening:
#Scatter plots for snow and rain variables
plt.scatter(x=daytime_data['snow_1h'], y=daytime_data['traffic_volume'])
plt.xlabel('Numeric Amount in mm of snow that occurred in the hour')
plt.ylabel('Traffic volume')
plt.show()
plt.scatter(x=daytime_data['rain_1h'], y=daytime_data['traffic_volume'])
plt.xlabel('Numeric Amount in mm of rain that occurred in the hour')
plt.ylabel('Traffic volume')
plt.show()
The graphs above show that there's a clear outlier in the rain data. Therefore, we'll remove that row and calculate that particular correlation again:
#Correcting data regarding outliers in rain data
daytime_data = daytime_data[daytime_data['rain_1h'] < 50]
#Scatter plot for rain x traffic
plt.scatter(x=daytime_data['rain_1h'], y=daytime_data['traffic_volume'])
plt.xlabel('Numeric Amount in mm of rain that occurred in the hour')
plt.ylabel('Traffic volume')
plt.show()
corr_traffic_rain = daytime_data['traffic_volume'].corr(daytime_data['rain_1h'])
print('The correlation of traffic volume with temperature is:', round(corr_traffic_temp,2))
print('The correlation of traffic volume with rain is:', round(corr_traffic_rain,2))
print('The correlation of traffic volume with snow is:', round(corr_traffic_snow,2))
print('The correlation of traffic volume with clouds is:', round(corr_traffic_clouds,2))
The correlation of traffic volume with temperature is: 0.13 The correlation of traffic volume with rain is: -0.04 The correlation of traffic volume with snow is: 0.0 The correlation of traffic volume with clouds is: -0.03
It seems that the weather variable with the highest correlation with traffic volume is temperature: the higher the temperature, the higher the traffic (although it's a very weak relation - r = 0.13). We'll proceed now to check the scatterplot of these two variables:
#Scatter plot for trafic x temperature
plt.scatter(x=daytime_data['temp'], y=daytime_data['traffic_volume'])
plt.xlabel('Numeric Average temp in kelvin')
plt.ylabel('Traffic volume')
plt.show()
The graph above shows two outliers that should be removed with values of temp = 0º K (life would not be possible then). We'll remove those and correct the graph and correlation:
#Correcting data regarding outliers in temperature
daytime_data = daytime_data[daytime_data['temp'] > 200]
#Scatter plot for traffic x temperature
plt.scatter(x=daytime_data['temp'], y=daytime_data['traffic_volume'])
plt.xlabel('Numeric Average temp in kelvin')
plt.ylabel('Traffic volume')
plt.show()
#Corrected correlation of traffic x temperature
corr_traffic_temp = daytime_data['traffic_volume'].corr(daytime_data['temp'])
print('The correlation of traffic volume with temperature is:', round(corr_traffic_temp,2))
print('The correlation of traffic volume with rain is:', round(corr_traffic_rain,2))
print('The correlation of traffic volume with snow is:', round(corr_traffic_snow,2))
print('The correlation of traffic volume with clouds is:', round(corr_traffic_clouds,2))
The correlation of traffic volume with temperature is: 0.13 The correlation of traffic volume with rain is: -0.04 The correlation of traffic volume with snow is: 0.0 The correlation of traffic volume with clouds is: -0.03
Therefore, the weather variable with the highest correlation with traffic volume is temperature: the higher the temperature, the higher the traffic (although it's a very weak relation - r = 0.13). So, it's not a reliable indicator of heavy traffic.
To see if we can find more useful data, we'll look next at the categorical weather-related columns: weather_main and weather_description.
#Grouping and averaging for categorical weather variables
by_weather_main = daytime_data.groupby('weather_main').mean()
by_weather_description = daytime_data.groupby('weather_description').mean()
#Horizontal bar plot for weather_main: Categorical Short textual description of the current weather
by_weather_main['traffic_volume'].plot(kind='barh')
plt.show()
#Horizontal bar plot for weather_description: Categorical Longer textual description of the current weather
by_weather_description['traffic_volume'].plot(kind='barh', figsize=(6, 12))
plt.show()
The second graph shows that 'shower snow' and 'light rain and snow' exceed 5000 cars, being the registered weathers with the highests traffic volumes.
From the analysis we have conducted, we can conclude that traffic volume on this road is affected by: