The goal of this project is to obtain indicators, and possible causes or effects, of heavy or light traffic in the Minneapolis-St Paul area in Minnesota on I-94. The factors considered are rain, snow, clouds, temperature, categories and descriptions of weather, holidays, day of the week, hour of the day, and month of the year.
I will be using the Metro Interstae Traffic Volume dataset, which can be found here: Data Set
John Hauge made the dataset available. The dataset comes from MN DoT ATR station 301, half way between St Paul and Minneapolis (which is mid-metro area and an area of dense population). The dataset analyzed west-bound traffic, toward Minneapolis from St Paul.
Minnesota is a cold state with long snowy winters and short moderately hot summers in the central northern USA along the border with Canada. I have lived in Minnesota for the majority of my life, including in the western half of the Twin Cities metro area.
Although the data set contains years from 2012-2014 and again from 2015-2018 in two separate sets.
The data set has the following columns:
index | column | description |
---|---|---|
0 | holiday | Name of holiday, if any, found only at the midnight hour of the day |
1 | temp | Average temperature in Kelvin (K) |
2 | rain_1h | mm/hour of rain |
3 | snow_1h | mm/hour of snow |
4 | clouds_all | Percentage of cloud cover |
5 | weather_main | Main weather type |
6 | weather_description | Description of the weather |
7 | date_time | Date and time in datetime format |
8 | traffic_volume | Traffic volume in vehicles/hour |
# Read in the data
import pandas as pd
traffic=pd.read_csv("Metro_Interstate_Traffic_Volume.csv",delimiter=',')
# Display the columns and data type, determine if there are any null values
traffic.info()
# No null values, all columns have the same length
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48204 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null object 8 traffic_volume 48204 non-null int64 dtypes: float64(3), int64(2), object(4) memory usage: 3.3+ MB
# Examine the head of the data set to see examples of the type of data
traffic.head(5)
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
# Examine the possible values for 'weather_main' and the number of times they occur
traffic['weather_main'].value_counts()
Clouds 15164 Clear 13391 Mist 5950 Rain 5672 Snow 2876 Drizzle 1821 Haze 1360 Thunderstorm 1034 Fog 912 Smoke 20 Squall 4 Name: weather_main, dtype: int64
# Examine the possible values for 'holiday' and the number of times they occur
traffic['holiday'].value_counts()
None 48143 Labor Day 7 Thanksgiving Day 6 Christmas Day 6 New Years Day 6 Martin Luther King Jr Day 6 Columbus Day 5 Veterans Day 5 Washingtons Birthday 5 Memorial Day 5 Independence Day 5 State Fair 5 Name: holiday, dtype: int64
# Examine the possible values for 'weather_description' and the number of times they occur
traffic['weather_description'].value_counts()
sky is clear 11665 mist 5950 overcast clouds 5081 broken clouds 4666 scattered clouds 3461 light rain 3372 few clouds 1956 light snow 1946 Sky is Clear 1726 moderate rain 1664 haze 1360 light intensity drizzle 1100 fog 912 proximity thunderstorm 673 drizzle 651 heavy snow 616 heavy intensity rain 467 snow 293 proximity shower rain 136 thunderstorm 125 heavy intensity drizzle 64 thunderstorm with heavy rain 63 thunderstorm with light rain 54 proximity thunderstorm with rain 52 thunderstorm with rain 37 smoke 20 very heavy rain 18 thunderstorm with light drizzle 15 light intensity shower rain 13 proximity thunderstorm with drizzle 13 light shower snow 11 light rain and snow 6 shower drizzle 6 SQUALLS 4 sleet 3 thunderstorm with drizzle 2 freezing rain 2 shower snow 1 Name: weather_description, dtype: int64
# Convert the traffic datetime column to a datetime object
traffic['date_time']=pd.to_datetime(traffic['date_time'])
# Create a numerical year column
traffic['year']=traffic['date_time'].dt.strftime('%Y').astype(int)
# Create numerical month column
traffic['month']=traffic['date_time'].dt.strftime("%m").astype(int)
# Create numerical date column
traffic['day']=traffic['date_time'].dt.strftime('%d').astype(int)
# Create numerical hour column
traffic['hour']=traffic['date_time'].dt.strftime('%H').astype(int)
# Create numerical minute column
traffic['minute']=traffic['date_time'].dt.strftime('%M').astype(int)
# Create numerical second column
traffic['second']=traffic['date_time'].dt.strftime('%S').astype(int)
# Create a weekday column in the format Sunday, Monday... as strings
traffic['weekday']=traffic['date_time'].dt.strftime('%A').astype(str)
# Confirm creation and formatting. Notice Row 21 to 22 has a gap hour between them. We will return to this later
traffic.head(30)
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | year | month | day | hour | minute | second | weekday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 | 2012 | 10 | 2 | 9 | 0 | 0 | Tuesday |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 | 2012 | 10 | 2 | 10 | 0 | 0 | Tuesday |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 | 2012 | 10 | 2 | 11 | 0 | 0 | Tuesday |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 | 2012 | 10 | 2 | 12 | 0 | 0 | Tuesday |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 | 2012 | 10 | 2 | 13 | 0 | 0 | Tuesday |
5 | None | 291.72 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 14:00:00 | 5181 | 2012 | 10 | 2 | 14 | 0 | 0 | Tuesday |
6 | None | 293.17 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 15:00:00 | 5584 | 2012 | 10 | 2 | 15 | 0 | 0 | Tuesday |
7 | None | 293.86 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 16:00:00 | 6015 | 2012 | 10 | 2 | 16 | 0 | 0 | Tuesday |
8 | None | 294.14 | 0.0 | 0.0 | 20 | Clouds | few clouds | 2012-10-02 17:00:00 | 5791 | 2012 | 10 | 2 | 17 | 0 | 0 | Tuesday |
9 | None | 293.10 | 0.0 | 0.0 | 20 | Clouds | few clouds | 2012-10-02 18:00:00 | 4770 | 2012 | 10 | 2 | 18 | 0 | 0 | Tuesday |
10 | None | 290.97 | 0.0 | 0.0 | 20 | Clouds | few clouds | 2012-10-02 19:00:00 | 3539 | 2012 | 10 | 2 | 19 | 0 | 0 | Tuesday |
11 | None | 289.38 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 20:00:00 | 2784 | 2012 | 10 | 2 | 20 | 0 | 0 | Tuesday |
12 | None | 288.61 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 21:00:00 | 2361 | 2012 | 10 | 2 | 21 | 0 | 0 | Tuesday |
13 | None | 287.16 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 22:00:00 | 1529 | 2012 | 10 | 2 | 22 | 0 | 0 | Tuesday |
14 | None | 285.45 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 23:00:00 | 963 | 2012 | 10 | 2 | 23 | 0 | 0 | Tuesday |
15 | None | 284.63 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 00:00:00 | 506 | 2012 | 10 | 3 | 0 | 0 | 0 | Wednesday |
16 | None | 283.47 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 01:00:00 | 321 | 2012 | 10 | 3 | 1 | 0 | 0 | Wednesday |
17 | None | 281.18 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 02:00:00 | 273 | 2012 | 10 | 3 | 2 | 0 | 0 | Wednesday |
18 | None | 281.09 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 03:00:00 | 367 | 2012 | 10 | 3 | 3 | 0 | 0 | Wednesday |
19 | None | 279.53 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 04:00:00 | 814 | 2012 | 10 | 3 | 4 | 0 | 0 | Wednesday |
20 | None | 278.62 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 05:00:00 | 2718 | 2012 | 10 | 3 | 5 | 0 | 0 | Wednesday |
21 | None | 278.23 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 06:00:00 | 5673 | 2012 | 10 | 3 | 6 | 0 | 0 | Wednesday |
22 | None | 278.12 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 08:00:00 | 6511 | 2012 | 10 | 3 | 8 | 0 | 0 | Wednesday |
23 | None | 282.48 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 09:00:00 | 5471 | 2012 | 10 | 3 | 9 | 0 | 0 | Wednesday |
24 | None | 291.97 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 12:00:00 | 5097 | 2012 | 10 | 3 | 12 | 0 | 0 | Wednesday |
25 | None | 293.23 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 13:00:00 | 4887 | 2012 | 10 | 3 | 13 | 0 | 0 | Wednesday |
26 | None | 294.31 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 14:00:00 | 5337 | 2012 | 10 | 3 | 14 | 0 | 0 | Wednesday |
27 | None | 295.17 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 15:00:00 | 5692 | 2012 | 10 | 3 | 15 | 0 | 0 | Wednesday |
28 | None | 295.13 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 16:00:00 | 6137 | 2012 | 10 | 3 | 16 | 0 | 0 | Wednesday |
29 | None | 293.66 | 0.0 | 0.0 | 20 | Clouds | few clouds | 2012-10-03 18:00:00 | 4623 | 2012 | 10 | 3 | 18 | 0 | 0 | Wednesday |
# View rows where the time is not in the form of HH:MM:00 (seconds is not zero)
traffic[traffic['second']!=0]
# There are none of them, so all rows have seconds=0
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | year | month | day | hour | minute | second | weekday |
---|
# Drop the seconds column since we don't need it
traffic=traffic.drop(columns='second')
# View rows where minutes is not equal to zer
traffic[traffic['minute']!=0]
# All rows are of the form HH:00:00, so all rows have minutes=0 and seconds =0
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | year | month | day | hour | minute | weekday |
---|
# Drop the minute column, we don't need it
traffic=traffic.drop(columns='minute')
#Examine the number of unique times in traffic
len(traffic['date_time'].unique())
40575
# Drop the duplicate times and assign it back to the traffic data frame, reindexing so that the index goes from 1...length
# keeping the first of the duplicates
traffic.drop_duplicates(subset='date_time',keep='first',inplace=True,ignore_index=False)
# Confirm the number of unique of the unique iimes is the same
len(traffic['date_time'].unique())
# it is
40575
# Confirm the number of unique times matches the length of traffic
len(traffic.index)
# it does
40575
# Years between 2012 and 2018 are present in the data
traffic['year'].unique()
array([2012, 2013, 2014, 2015, 2016, 2017, 2018])
# The number of days per month look reasonable with 31 being the maximum found in the dataset
len(traffic['day'].unique())
31
# There are 12 months accounted for between all the years
traffic['month'].unique()
array([10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9])
# There are 24 hours of traffic a day between the whole dataset
len(traffic['hour'].unique())
24
5*365*24 #If every day from 12/31/2012 to 1/1/2018 were represented in this dataset, there would be 43800 distinct times
# and there are 40575 distinct times. So clearly there is some missing data.
43800
# Examine the number of hours per day that are present in the data set for each year, month and day
num_hours_per_day=traffic.groupby(['year','month','day']).agg({'hour':'count'})
# Wow! There are a lot of hours for which we have no data on many days!
num_hours_per_day.head(5)
hour | |||
---|---|---|---|
year | month | day | |
2012 | 10 | 2 | 15 |
3 | 20 | ||
4 | 24 | ||
5 | 22 | ||
6 | 23 |
# There are whole weeks in places where we have essentially no information about the traffic patterns
num_hours_per_day[num_hours_per_day['hour']<6]
# While this is a matter of taste, I feel like it might be beneficial to drop these rows.
hour | |||
---|---|---|---|
year | month | day | |
2013 | 1 | 29 | 1 |
8 | 31 | 1 | |
9 | 1 | 1 | |
3 | 3 | ||
10 | 1 | 2 | |
2 | 1 | ||
27 | 1 | ||
2014 | 6 | 12 | 5 |
16 | 2 | ||
17 | 2 | ||
19 | 4 | ||
21 | 3 | ||
22 | 3 | ||
23 | 4 | ||
24 | 4 | ||
25 | 4 | ||
27 | 3 | ||
28 | 4 | ||
29 | 5 | ||
30 | 2 | ||
7 | 24 | 3 | |
8 | 2 | 3 | |
8 | 2 | ||
2015 | 6 | 11 | 1 |
12 | 3 | ||
13 | 2 | ||
14 | 1 | ||
19 | 1 | ||
20 | 1 | ||
22 | 2 |
# Set the index of traffic to the same keys used here so I can use num_hours_per_day as a mask for dropping the rows
traffic=traffic.set_index(['year','month','day'])
# Drop the rows where there are fewer than 6 hours of traffic data per day
traffic=traffic.drop(index=traffic[num_hours_per_day['hour']<6].index)
C:\Users\Steve\AppData\Local\Temp\ipykernel_1172\411133967.py:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index. traffic=traffic.drop(index=traffic[num_hours_per_day['hour']<6].index)
#examine the length of the new table. Rows were dropped
len(traffic.index)
40501
# Examine how many hours of traffic there are per year
num_hours_per_year=traffic.groupby('year')['hour'].count()
# No years seem to be altogether missing data
num_hours_per_year
year 2012 2103 2013 7284 2014 4448 2015 3582 2016 7838 2017 8713 2018 6533 Name: hour, dtype: int64
# Reset the index so that year, month, and day are columns again
traffic=traffic.reset_index()
# Some of these years clearly had a fair amount of missing data. Let's create a pivot with the days counted.
num_days_per_month_and_year=pd.pivot_table(data =traffic, values='day', index='year',columns='month',aggfunc='nunique',dropna=True)
# Examine distinct days per month
num_days_per_month_and_year
month | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
year | ||||||||||||
2012 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 30.0 | 30.0 | 31.0 |
2013 | 30.0 | 28.0 | 31.0 | 30.0 | 31.0 | 30.0 | 31.0 | 30.0 | 28.0 | 24.0 | 25.0 | 31.0 |
2014 | 31.0 | 28.0 | 31.0 | 29.0 | 28.0 | 17.0 | 28.0 | 6.0 | NaN | NaN | NaN | NaN |
2015 | NaN | NaN | NaN | NaN | NaN | 7.0 | 31.0 | 31.0 | 30.0 | 28.0 | 30.0 | 31.0 |
2016 | 31.0 | 29.0 | 31.0 | 30.0 | 31.0 | 30.0 | 31.0 | 31.0 | 30.0 | 31.0 | 30.0 | 31.0 |
2017 | 31.0 | 28.0 | 31.0 | 30.0 | 31.0 | 30.0 | 31.0 | 31.0 | 30.0 | 31.0 | 30.0 | 31.0 |
2018 | 31.0 | 28.0 | 31.0 | 30.0 | 31.0 | 30.0 | 31.0 | 31.0 | 30.0 | NaN | NaN | NaN |
Data does not begin until October 2012 and ends in September 2017. There is missing data from August 2014 to June 2015. There are a few other missing days here and there, but most notably July 2014 has about 13 days missing. It might be helpful to split this into a 2012-2014 and a 2015-2018 data set.
During 2012-2014 I was living in St Cloud Minnesota until June 2014. During 2015-2018, I spent nearly all of that time living in Baton Rouge Louisiana and was unable to drive due to frequent seizures but living in Minnesota the last few months.
# Other than possibly haze, fog, smoke, and mist, these all look distinct and well labelled without any typoes or
# capitalization issues. There is no missing data. It's unclear whether haze should be combined with fog, smoke, or mist.
# It's also unclear whether a Thunderstorm and a Squall are the same, in Minnesota. Probably not. But I would have
# expected to some something like "straight line winds", "tornado", or "heavy flooding" instead of "Squall".
# I am going to leave these values alone. I don't have enough information to change them and they are probably reasonably
# accurate by the definitions of this study
traffic.value_counts('weather_main')
weather_main Clouds 15110 Clear 13333 Rain 4750 Mist 3065 Snow 2297 Haze 763 Drizzle 515 Thunderstorm 452 Fog 200 Smoke 15 Squall 1 dtype: int64
# I'm going to use personal taste and combine some of these.
# Sky is Clear should become sky is clear
# light intensity drizzle should become drizzle
# light rain and snow should become freezing rain
# sleet should become freezing rain
# light intensity shower rain should become light rain
# light shower snow should become light snow
# heavy intensity drizzle should become light rain
traffic.value_counts("weather_description")
weather_description sky is clear 11629 overcast clouds 5071 broken clouds 4644 scattered clouds 3448 mist 3065 light rain 2824 few clouds 1947 Sky is Clear 1704 light snow 1588 moderate rain 1372 haze 763 heavy snow 500 heavy intensity rain 404 light intensity drizzle 360 proximity thunderstorm 338 fog 200 snow 192 drizzle 138 proximity shower rain 119 thunderstorm 59 thunderstorm with light rain 23 very heavy rain 18 heavy intensity drizzle 17 smoke 15 thunderstorm with heavy rain 14 light intensity shower rain 12 light shower snow 11 proximity thunderstorm with rain 9 thunderstorm with rain 5 light rain and snow 3 sleet 3 thunderstorm with light drizzle 3 proximity thunderstorm with drizzle 1 freezing rain 1 SQUALLS 1 dtype: int64
# I'm going to use personal taste and combine some of these.
# Sky is Clear should become sky is clear
# light intensity drizzle should become drizzle
# light rain and snow should become freezing rain
# sleet should become freezing rain
# light intensity shower rain should become light rain
# light shower snow should become light snow
# heavy intensity drizzle should become light rain
weather_desc_change_dict={'Sky is Clear':'sky is clear','light intensity drizzle':'drizzle','light rain and snow':'freezing rain',
'sleet':'freezing rain','light intensity shower rain':'light rain','light shower snow':'light snow',
'heavy intensity drizzle':'light rain','proximity thunderstorm with rain':'proximity thunderstorm',
'proximity thunderstorm with drizzle':'proximity thunderstorm','thunderstorm with rain':'thunderstorm',
'thunderstorm with light rain':'thunderstorm','thunderstorm with heavy rain':'thunderstorm',
'heavy intensity rain':'heavy rain','thunderstorm with light drizzle':'thunderstorm'}
for key in weather_desc_change_dict.keys():
key_bool=traffic['weather_description']==key
traffic.loc[key_bool,['weather_description']]=weather_desc_change_dict[key]
traffic['weather_description'].value_counts()
sky is clear 13333 overcast clouds 5071 broken clouds 4644 scattered clouds 3448 mist 3065 light rain 2853 few clouds 1947 light snow 1599 moderate rain 1372 haze 763 heavy snow 500 drizzle 498 heavy rain 404 proximity thunderstorm 348 fog 200 snow 192 proximity shower rain 119 thunderstorm 104 very heavy rain 18 smoke 15 freezing rain 7 SQUALLS 1 Name: weather_description, dtype: int64
#view the SQUALLS row
traffic[traffic['weather_description']=='SQUALLS']
year | month | day | holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | hour | weekday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4879 | 2013 | 5 | 12 | None | 284.28 | 0.0 | 0.0 | 75 | Squall | SQUALLS | 2013-05-12 02:00:00 | 420 | 2 | Sunday |
# Create a Farenheit tempearture column
traffic['tempF']=1.8*(traffic['temp']-273.15)+32
traffic.describe()
year | month | day | temp | rain_1h | snow_1h | clouds_all | traffic_volume | hour | tempF | |
---|---|---|---|---|---|---|---|---|---|---|
count | 40501.000000 | 40501.000000 | 40501.000000 | 40501.000000 | 40501.000000 | 40501.000000 | 40501.000000 | 40501.000000 | 40501.000000 | 40501.000000 |
mean | 2015.482408 | 6.489000 | 15.667267 | 281.296086 | 0.318494 | 0.000118 | 44.227970 | 3291.463939 | 11.513913 | 46.662956 |
std | 1.889364 | 3.376214 | 8.762967 | 13.818862 | 48.857183 | 0.005681 | 38.682702 | 1984.939812 | 6.946877 | 24.873951 |
min | 2012.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -459.670000 |
25% | 2014.000000 | 4.000000 | 8.000000 | 271.820000 | 0.000000 | 0.000000 | 1.000000 | 1249.000000 | 5.000000 | 29.606000 |
50% | 2016.000000 | 7.000000 | 16.000000 | 282.810000 | 0.000000 | 0.000000 | 40.000000 | 3430.000000 | 12.000000 | 49.388000 |
75% | 2017.000000 | 9.000000 | 23.000000 | 292.270000 | 0.000000 | 0.000000 | 90.000000 | 4953.000000 | 18.000000 | 66.416000 |
max | 2018.000000 | 12.000000 | 31.000000 | 310.070000 | 9831.300000 | 0.510000 | 100.000000 | 7280.000000 | 23.000000 | 98.456000 |
# A temperature value of -460 F is definitely not possible. That's absolute zero (see the Kelvin temperature column).
# We should fix this by making a cut on the temperature. A reasanable absolute lowest temperature for Minnesota in the winter
# is -35 F. Even considering windchill, this would be outstandingly cold for Minneapolis. Schools would be shut down state wide.
# A windchill of -35 F has happened once in my lifetime, but the actual temperature was not that low.
# To be safe, we will make the cut at -40 F.
traffic=traffic[traffic['tempF']>-40]
traffic.describe()
# The new minimum temperature is -21.5 F. That corresponds to an ordinary but extremely chilly January day.
# It seems the incorrectly entered data is likely gone. The maximum temperature of 98.4 F is very reasonable for
# Minnesota in July. An average temperature of about 50 F is also very reaosonable.
year | month | day | temp | rain_1h | snow_1h | clouds_all | traffic_volume | hour | tempF | |
---|---|---|---|---|---|---|---|---|---|---|
count | 40491.000000 | 40491.000000 | 40491.000000 | 40491.000000 | 40491.000000 | 40491.000000 | 40491.000000 | 40491.000000 | 40491.000000 | 40491.000000 |
mean | 2015.482774 | 6.490208 | 15.667778 | 281.365558 | 0.318573 | 0.000118 | 44.238893 | 3291.951273 | 11.515497 | 46.788004 |
std | 1.889454 | 3.375748 | 8.761144 | 13.094304 | 48.863216 | 0.005682 | 38.681233 | 1984.812181 | 6.946959 | 23.569747 |
min | 2012.000000 | 1.000000 | 1.000000 | 243.390000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -21.568000 |
25% | 2014.000000 | 4.000000 | 8.000000 | 271.838000 | 0.000000 | 0.000000 | 1.000000 | 1250.000000 | 5.000000 | 29.638400 |
50% | 2016.000000 | 7.000000 | 16.000000 | 282.820000 | 0.000000 | 0.000000 | 40.000000 | 3431.000000 | 12.000000 | 49.406000 |
75% | 2017.000000 | 9.000000 | 23.000000 | 292.275000 | 0.000000 | 0.000000 | 90.000000 | 4953.000000 | 18.000000 | 66.425000 |
max | 2018.000000 | 12.000000 | 31.000000 | 310.070000 | 9831.300000 | 0.510000 | 100.000000 | 7280.000000 | 23.000000 | 98.456000 |
# Let's rename the kelvin column to indicate that it's in kelvin.
traffic=traffic.rename(columns={'temp':'tempK'})
# Drop all rows with temperatures below -30 F which is a reasonable lower bound for MN
traffic=traffic.drop(index=traffic[traffic['tempF']<-30].index)
#Display temperatures below -20 F. There should be only a few and they should not be much below it.
traffic[traffic['tempF']<-20]
year | month | day | holiday | tempK | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | hour | weekday | tempF | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
24928 | 2016 | 12 | 18 | None | 244.22 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2016-12-18 06:00:00 | 800 | 6 | Sunday | -20.074 |
24929 | 2016 | 12 | 18 | None | 243.62 | 0.0 | 0.0 | 1 | Haze | haze | 2016-12-18 07:00:00 | 1037 | 7 | Sunday | -21.154 |
24930 | 2016 | 12 | 18 | None | 243.39 | 0.0 | 0.0 | 1 | Haze | haze | 2016-12-18 08:00:00 | 1462 | 8 | Sunday | -21.568 |
Rain has an enormous standard deviation and if we examine the maximum value, we see that it is almost 1000 mm in an hour. That's an entire meter! Clearly that is wrong! Let's histogram this with 100 bins and see where the outliers are.
# The histogram of the number of mm of rain per hour indicates that all values were under 100 mm/hour except the one where it
# supposedly rained 1000 mm/hour. Let's cut anything over 300 mm/hour.
import matplotlib.pyplot as plt
import matplotlib.style as style
style.use('fivethirtyeight')
traffic['rain_1h'].hist(bins=100)
plt.ylim([0,10])
(0.0, 10.0)
#view the data where rain_1h is large
traffic[traffic['rain_1h']>300]
# It's completely impossible 9 meters of rain per hour fell in Minneapolis/St Paul Minnesota. I'm not sure what this should
# be so we'll remove it from the table.
year | month | day | holiday | tempK | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | hour | weekday | tempF | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
21150 | 2016 | 7 | 11 | None | 302.11 | 9831.3 | 0.0 | 75 | Rain | very heavy rain | 2016-07-11 17:00:00 | 5535 | 17 | Monday | 84.128 |
# make that cut
traffic=traffic[traffic['rain_1h']<300]
# rain_1h has a new maximum value of 55.63 mm/h and new minimal values of 0. 0 is a reasonable minimal value. It's mean is
# well under 1 mm/h which is possible.
traffic.describe()
year | month | day | tempK | rain_1h | snow_1h | clouds_all | traffic_volume | hour | tempF | |
---|---|---|---|---|---|---|---|---|---|---|
count | 40490.000000 | 40490.000000 | 40490.000000 | 40490.000000 | 40490.000000 | 40490.000000 | 40490.000000 | 40490.000000 | 40490.000000 | 40490.000000 |
mean | 2015.482761 | 6.490195 | 15.667893 | 281.365045 | 0.075773 | 0.000118 | 44.238133 | 3291.895876 | 11.515362 | 46.787082 |
std | 1.889475 | 3.375789 | 8.761221 | 13.094060 | 0.768339 | 0.005682 | 38.681409 | 1984.805387 | 6.946991 | 23.569307 |
min | 2012.000000 | 1.000000 | 1.000000 | 243.390000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -21.568000 |
25% | 2014.000000 | 4.000000 | 8.000000 | 271.838000 | 0.000000 | 0.000000 | 1.000000 | 1250.000000 | 5.000000 | 29.638400 |
50% | 2016.000000 | 7.000000 | 16.000000 | 282.820000 | 0.000000 | 0.000000 | 40.000000 | 3431.000000 | 12.000000 | 49.406000 |
75% | 2017.000000 | 9.000000 | 23.000000 | 292.270000 | 0.000000 | 0.000000 | 90.000000 | 4953.000000 | 18.000000 | 66.416000 |
max | 2018.000000 | 12.000000 | 31.000000 | 310.070000 | 55.630000 | 0.510000 | 100.000000 | 7280.000000 | 23.000000 | 98.456000 |
Snow now looks okay as well. It ranges from 0 to .51 mm/hour with an average of much less than a mm/hour. Clouds range from 0 to 100 percent with an average of 44 percent which is very reasonable. Hours range from 0 to 23 with an average of 11.5. Very reasonable. Days range from 1 to 31 with an average of 15. Very reasonable. Months range from 1 to 12 with an average of 7. Very reasonable. Years range from 2012 to 2018 with an average of 2015.5. Very reasonable.
# Display a few holidays that are actually holidays.
traffic[traffic['holiday']!='None'].head()
year | month | day | holiday | tempK | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | hour | weekday | tempF | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
126 | 2012 | 10 | 8 | Columbus Day | 273.08 | 0.0 | 0.0 | 20 | Clouds | few clouds | 2012-10-08 | 455 | 0 | Monday | 31.874 |
926 | 2012 | 11 | 12 | Veterans Day | 288.12 | 0.0 | 0.0 | 87 | Clear | sky is clear | 2012-11-12 | 1000 | 0 | Monday | 58.946 |
1164 | 2012 | 11 | 22 | Thanksgiving Day | 278.54 | 0.0 | 0.0 | 20 | Mist | mist | 2012-11-22 | 919 | 0 | Thursday | 41.702 |
1938 | 2012 | 12 | 25 | Christmas Day | 264.40 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-12-25 | 803 | 0 | Tuesday | 16.250 |
2103 | 2013 | 1 | 1 | New Years Day | 263.49 | 0.0 | 0.0 | 58 | Clouds | broken clouds | 2013-01-01 | 1439 | 0 | Tuesday | 14.612 |
All of these times are at midnight. So I need to make holidays be labelled throughout the day.
traffic['date_only']=traffic['date_time'].dt.strftime('%Y-%m-%d')
# Create a dataframe of holidays that are actually holidays, at midnight
holidays=traffic[(traffic['holiday']!='None')&(traffic['hour']==0)]
# Use a left join to create a column where every day that's a holiday has 24 hours of holiday.
traffic_holidays_all_day=traffic.merge(holidays,how='left',left_on='date_only',right_on='date_only')
#Select cells with non holidays in holiday_y
not_holiday_bool=traffic_holidays_all_day['holiday_y'].isna()
traffic_holidays_all_day[not_holiday_bool].head(5)
year_x | month_x | day_x | holiday_x | tempK_x | rain_1h_x | snow_1h_x | clouds_all_x | weather_main_x | weather_description_x | ... | rain_1h_y | snow_1h_y | clouds_all_y | weather_main_y | weather_description_y | date_time_y | traffic_volume_y | hour_y | weekday_y | tempF_y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012 | 10 | 2 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | ... | NaN | NaN | NaN | NaN | NaN | NaT | NaN | NaN | NaN | NaN |
1 | 2012 | 10 | 2 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | ... | NaN | NaN | NaN | NaN | NaN | NaT | NaN | NaN | NaN | NaN |
2 | 2012 | 10 | 2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | ... | NaN | NaN | NaN | NaN | NaN | NaT | NaN | NaN | NaN | NaN |
3 | 2012 | 10 | 2 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | ... | NaN | NaN | NaN | NaN | NaN | NaT | NaN | NaN | NaN | NaN |
4 | 2012 | 10 | 2 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | ... | NaN | NaN | NaN | NaN | NaN | NaT | NaN | NaN | NaN | NaN |
5 rows × 31 columns
#Set cells with non holidays in holiday y equal to None instead of Nan.
traffic_holidays_all_day.loc[not_holiday_bool,'holiday_y']='None'
#Select the rows where there was a real holiday during that day, and confirm that the hours are now over the whole day
traffic_holidays_all_day['holiday_y'].value_counts()
None 39287 Independence Day 120 State Fair 119 Labor Day 118 Memorial Day 117 Washingtons Birthday 115 Christmas Day 113 New Years Day 112 Veterans Day 108 Thanksgiving Day 107 Columbus Day 105 Martin Luther King Jr Day 69 Name: holiday_y, dtype: int64
# Determine which columns should be renamed or dropped
traffic_holidays_all_day.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 40490 entries, 0 to 40489 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year_x 40490 non-null int64 1 month_x 40490 non-null int64 2 day_x 40490 non-null int64 3 holiday_x 40490 non-null object 4 tempK_x 40490 non-null float64 5 rain_1h_x 40490 non-null float64 6 snow_1h_x 40490 non-null float64 7 clouds_all_x 40490 non-null int64 8 weather_main_x 40490 non-null object 9 weather_description_x 40490 non-null object 10 date_time_x 40490 non-null datetime64[ns] 11 traffic_volume_x 40490 non-null int64 12 hour_x 40490 non-null int32 13 weekday_x 40490 non-null object 14 tempF_x 40490 non-null float64 15 date_only 40490 non-null object 16 year_y 1203 non-null float64 17 month_y 1203 non-null float64 18 day_y 1203 non-null float64 19 holiday_y 40490 non-null object 20 tempK_y 1203 non-null float64 21 rain_1h_y 1203 non-null float64 22 snow_1h_y 1203 non-null float64 23 clouds_all_y 1203 non-null float64 24 weather_main_y 1203 non-null object 25 weather_description_y 1203 non-null object 26 date_time_y 1203 non-null datetime64[ns] 27 traffic_volume_y 1203 non-null float64 28 hour_y 1203 non-null float64 29 weekday_y 1203 non-null object 30 tempF_y 1203 non-null float64 dtypes: datetime64[ns](2), float64(14), int32(1), int64(5), object(9) memory usage: 9.7+ MB
#Drop columns with mostly NaNs. also drop holiday x rather than holiday y because the goal was to create a new table of holidays in holiday y with the hours filled in
traffic_holidays_all_day=traffic_holidays_all_day.drop(columns=['tempF_y','weekday_y','hour_y','traffic_volume_y','date_time_y','weather_description_y','weather_main_y','snow_1h_y','rain_1h_y','tempK_y','holiday_x','day_y','month_y','year_y','clouds_all_y'])
#Print the info again
traffic_holidays_all_day.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 40490 entries, 0 to 40489 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year_x 40490 non-null int64 1 month_x 40490 non-null int64 2 day_x 40490 non-null int64 3 tempK_x 40490 non-null float64 4 rain_1h_x 40490 non-null float64 5 snow_1h_x 40490 non-null float64 6 clouds_all_x 40490 non-null int64 7 weather_main_x 40490 non-null object 8 weather_description_x 40490 non-null object 9 date_time_x 40490 non-null datetime64[ns] 10 traffic_volume_x 40490 non-null int64 11 hour_x 40490 non-null int32 12 weekday_x 40490 non-null object 13 tempF_x 40490 non-null float64 14 date_only 40490 non-null object 15 holiday_y 40490 non-null object dtypes: datetime64[ns](1), float64(4), int32(1), int64(5), object(5) memory usage: 5.1+ MB
#Examine Data. Looks okay
traffic_holidays_all_day.head(5)
year_x | month_x | day_x | tempK_x | rain_1h_x | snow_1h_x | clouds_all_x | weather_main_x | weather_description_x | date_time_x | traffic_volume_x | hour_x | weekday_x | tempF_x | date_only | holiday_y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012 | 10 | 2 | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 | 9 | Tuesday | 59.234 | 2012-10-02 | None |
1 | 2012 | 10 | 2 | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 | 10 | Tuesday | 61.178 | 2012-10-02 | None |
2 | 2012 | 10 | 2 | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 | 11 | Tuesday | 61.574 | 2012-10-02 | None |
3 | 2012 | 10 | 2 | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 | 12 | Tuesday | 62.564 | 2012-10-02 | None |
4 | 2012 | 10 | 2 | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 | 13 | Tuesday | 64.382 | 2012-10-02 | None |
#Create a dictionary for renaming columns
col_dict={}
for col in traffic_holidays_all_day.columns:
col_modified=col.strip('_x')
col_dict[col]=col_modified
col_dict['holiday_y']='holiday'
#Display that dictionary to confirm success
col_dict
{'year_x': 'year', 'month_x': 'month', 'day_x': 'day', 'tempK_x': 'tempK', 'rain_1h_x': 'rain_1h', 'snow_1h_x': 'snow_1h', 'clouds_all_x': 'clouds_all', 'weather_main_x': 'weather_main', 'weather_description_x': 'weather_description', 'date_time_x': 'date_time', 'traffic_volume_x': 'traffic_volume', 'hour_x': 'hour', 'weekday_x': 'weekday', 'tempF_x': 'tempF', 'date_only': 'date_only', 'holiday_y': 'holiday'}
#Rename columns
traffic_holidays_all_day=traffic_holidays_all_day.rename(mapper=col_dict, axis=1)
#Display head of dataframe
traffic_holidays_all_day.head(5)
year | month | day | tempK | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | hour | weekday | tempF | date_only | holiday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012 | 10 | 2 | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 | 9 | Tuesday | 59.234 | 2012-10-02 | None |
1 | 2012 | 10 | 2 | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 | 10 | Tuesday | 61.178 | 2012-10-02 | None |
2 | 2012 | 10 | 2 | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 | 11 | Tuesday | 61.574 | 2012-10-02 | None |
3 | 2012 | 10 | 2 | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 | 12 | Tuesday | 62.564 | 2012-10-02 | None |
4 | 2012 | 10 | 2 | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 | 13 | Tuesday | 64.382 | 2012-10-02 | None |
traffic_holidays_all_day.groupby('holiday').agg({'hour':['min','max','count']})
hour | |||
---|---|---|---|
min | max | count | |
holiday | |||
Christmas Day | 0 | 23 | 113 |
Columbus Day | 0 | 23 | 105 |
Independence Day | 0 | 23 | 120 |
Labor Day | 0 | 23 | 118 |
Martin Luther King Jr Day | 0 | 23 | 69 |
Memorial Day | 0 | 23 | 117 |
New Years Day | 0 | 23 | 112 |
None | 0 | 23 | 39287 |
State Fair | 0 | 23 | 119 |
Thanksgiving Day | 0 | 23 | 107 |
Veterans Day | 0 | 23 | 108 |
Washingtons Birthday | 0 | 23 | 115 |
# Assign cleaned dataframe back to traffic
traffic = traffic_holidays_all_day.copy()
traffic.groupby('holiday').agg({'hour':['min','max','count']})
hour | |||
---|---|---|---|
min | max | count | |
holiday | |||
Christmas Day | 0 | 23 | 113 |
Columbus Day | 0 | 23 | 105 |
Independence Day | 0 | 23 | 120 |
Labor Day | 0 | 23 | 118 |
Martin Luther King Jr Day | 0 | 23 | 69 |
Memorial Day | 0 | 23 | 117 |
New Years Day | 0 | 23 | 112 |
None | 0 | 23 | 39287 |
State Fair | 0 | 23 | 119 |
Thanksgiving Day | 0 | 23 | 107 |
Veterans Day | 0 | 23 | 108 |
Washingtons Birthday | 0 | 23 | 115 |
# Examine distinct days per month
num_days_per_month_and_year
month | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
year | ||||||||||||
2012 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 30.0 | 30.0 | 31.0 |
2013 | 30.0 | 28.0 | 31.0 | 30.0 | 31.0 | 30.0 | 31.0 | 30.0 | 28.0 | 24.0 | 25.0 | 31.0 |
2014 | 31.0 | 28.0 | 31.0 | 29.0 | 28.0 | 17.0 | 28.0 | 6.0 | NaN | NaN | NaN | NaN |
2015 | NaN | NaN | NaN | NaN | NaN | 7.0 | 31.0 | 31.0 | 30.0 | 28.0 | 30.0 | 31.0 |
2016 | 31.0 | 29.0 | 31.0 | 30.0 | 31.0 | 30.0 | 31.0 | 31.0 | 30.0 | 31.0 | 30.0 | 31.0 |
2017 | 31.0 | 28.0 | 31.0 | 30.0 | 31.0 | 30.0 | 31.0 | 31.0 | 30.0 | 31.0 | 30.0 | 31.0 |
2018 | 31.0 | 28.0 | 31.0 | 30.0 | 31.0 | 30.0 | 31.0 | 31.0 | 30.0 | NaN | NaN | NaN |
# set boundary dates and times based on times where data exists
import datetime as dt
start_date_first_era=dt.datetime(year=2012,month=10,day=1,hour=0)
end_date_first_era=dt.datetime(year=2014,month=7, day=31,hour=23)
start_date_second_era=dt.datetime(year=2015,month=7,day=1,hour=0)
end_date_second_era=dt.datetime(year=2018,month=9,day=30,hour=23)
# create a dataframe for the years 2012-2014
first_era=traffic[(traffic['date_time']>start_date_first_era)&(traffic['date_time']<end_date_first_era)]
# Yep, it's got the right years
first_era.describe()
year | month | day | tempK | rain_1h | snow_1h | clouds_all | traffic_volume | hour | tempF | |
---|---|---|---|---|---|---|---|---|---|---|
count | 13706.000000 | 13706.000000 | 13706.000000 | 13706.000000 | 13706.000000 | 13706.0 | 13706.000000 | 13706.000000 | 13706.000000 | 13706.000000 |
mean | 2013.161681 | 6.227929 | 15.584489 | 277.633343 | 0.098579 | 0.0 | 50.030206 | 3282.542390 | 11.574639 | 40.070018 |
std | 0.665166 | 3.607685 | 8.640059 | 13.556025 | 0.955713 | 0.0 | 37.280729 | 2013.190157 | 6.946558 | 24.400844 |
min | 2012.000000 | 1.000000 | 1.000000 | 244.820000 | 0.000000 | 0.0 | 0.000000 | 125.000000 | 0.000000 | -18.994000 |
25% | 2013.000000 | 3.000000 | 8.000000 | 267.980000 | 0.000000 | 0.0 | 1.000000 | 1221.250000 | 6.000000 | 22.694000 |
50% | 2013.000000 | 6.000000 | 15.000000 | 276.793000 | 0.000000 | 0.0 | 64.000000 | 3341.500000 | 12.000000 | 38.557400 |
75% | 2014.000000 | 10.000000 | 23.000000 | 288.860000 | 0.000000 | 0.0 | 90.000000 | 5000.000000 | 18.000000 | 60.278000 |
max | 2014.000000 | 12.000000 | 31.000000 | 308.240000 | 55.630000 | 0.0 | 100.000000 | 7217.000000 | 23.000000 | 95.162000 |
# Create a dataframe for 2015-2018
second_era=traffic[(traffic['date_time']>start_date_second_era)&(traffic['date_time']<end_date_second_era)]
# This dataframe also has the right years
second_era.describe()
year | month | day | tempK | rain_1h | snow_1h | clouds_all | traffic_volume | hour | tempF | |
---|---|---|---|---|---|---|---|---|---|---|
count | 26525.000000 | 26525.000000 | 26525.000000 | 26525.000000 | 26525.000000 | 26525.000000 | 26525.000000 | 26525.000000 | 26525.000000 | 26525.000000 |
mean | 2016.691197 | 6.621414 | 15.700811 | 283.162597 | 0.062665 | 0.000179 | 41.443242 | 3294.036946 | 11.477135 | 50.022675 |
std | 0.982650 | 3.256759 | 8.789997 | 12.426552 | 0.645017 | 0.007020 | 39.072992 | 1969.737938 | 6.946017 | 22.367793 |
min | 2015.000000 | 1.000000 | 1.000000 | 243.390000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -21.568000 |
25% | 2016.000000 | 4.000000 | 8.000000 | 273.840000 | 0.000000 | 0.000000 | 1.000000 | 1260.000000 | 5.000000 | 33.242000 |
50% | 2017.000000 | 7.000000 | 16.000000 | 285.530000 | 0.000000 | 0.000000 | 40.000000 | 3468.000000 | 11.000000 | 54.284000 |
75% | 2017.000000 | 9.000000 | 23.000000 | 293.210000 | 0.000000 | 0.000000 | 90.000000 | 4931.000000 | 18.000000 | 68.108000 |
max | 2018.000000 | 12.000000 | 31.000000 | 310.070000 | 31.750000 | 0.510000 | 100.000000 | 7280.000000 | 23.000000 | 98.456000 |
# Consider the distribution of the traffic volume, counted over each unique time (year, month, day, hour, minute) set
import matplotlib.pyplot as plt
%matplotlib inline
first_era.hist('traffic_volume')
plt.xlabel('Traffic Volume, 2012-2014')
plt.ylabel('Count (by Time, Binned)')
plt.title('Empirical Distribution of Traffic Volume')
plt.show()
# Consider the statistics of the 'traffic_volume' column
first_era['traffic_volume'].describe()
# It is possible for the traffic volume to be zero. Is that reasonable? Maybe it is reasonable if the road is closed.
# However, such a condition does not occur in this restricted data set.
count 13706.000000 mean 3282.542390 std 2013.190157 min 125.000000 25% 1221.250000 50% 3341.500000 75% 5000.000000 max 7217.000000 Name: traffic_volume, dtype: float64
# Create a dataframe that contains the rows where the time is at night (before 7 am or after 7 pm, exclusive)
night=first_era[(first_era['hour']<7) | (first_era['hour'] > 19)]
night_2=second_era[(second_era['hour']<7) | (second_era['hour'] > 19)]
# Statistics of the night time data set
night.describe()
year | month | day | tempK | rain_1h | snow_1h | clouds_all | traffic_volume | hour | tempF | |
---|---|---|---|---|---|---|---|---|---|---|
count | 6306.000000 | 6306.000000 | 6306.000000 | 6306.000000 | 6306.000000 | 6306.0 | 6306.000000 | 6306.000000 | 6306.000000 | 6306.000000 |
mean | 2013.161116 | 6.237393 | 15.618459 | 276.306807 | 0.098145 | 0.0 | 48.373930 | 1632.048842 | 9.861560 | 37.682253 |
std | 0.663098 | 3.606884 | 8.619584 | 13.087763 | 1.058121 | 0.0 | 38.404096 | 1403.340710 | 9.109747 | 23.557974 |
min | 2012.000000 | 1.000000 | 1.000000 | 244.820000 | 0.000000 | 0.0 | 0.000000 | 125.000000 | 0.000000 | -18.994000 |
25% | 2013.000000 | 3.000000 | 8.000000 | 266.992500 | 0.000000 | 0.0 | 1.000000 | 464.000000 | 2.000000 | 20.916500 |
50% | 2013.000000 | 6.000000 | 15.000000 | 276.175000 | 0.000000 | 0.0 | 64.000000 | 1072.500000 | 5.000000 | 37.445000 |
75% | 2014.000000 | 10.000000 | 23.000000 | 287.377500 | 0.000000 | 0.0 | 90.000000 | 2603.750000 | 21.000000 | 57.609500 |
max | 2014.000000 | 12.000000 | 31.000000 | 304.500000 | 55.630000 | 0.0 | 100.000000 | 6386.000000 | 23.000000 | 88.430000 |
night_2.describe()
year | month | day | tempK | rain_1h | snow_1h | clouds_all | traffic_volume | hour | tempF | |
---|---|---|---|---|---|---|---|---|---|---|
count | 12263.000000 | 12263.000000 | 12263.000000 | 12263.000000 | 12263.000000 | 12263.000000 | 12263.000000 | 12263.000000 | 12263.000000 | 12263.000000 |
mean | 2016.681644 | 6.615755 | 15.709696 | 281.553044 | 0.069744 | 0.000179 | 35.781049 | 1688.733426 | 9.720378 | 47.125480 |
std | 0.983021 | 3.264972 | 8.787532 | 11.798577 | 0.701718 | 0.007304 | 39.430960 | 1429.068756 | 9.077559 | 21.237438 |
min | 2015.000000 | 1.000000 | 1.000000 | 244.220000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -20.074000 |
25% | 2016.000000 | 4.000000 | 8.000000 | 272.955000 | 0.000000 | 0.000000 | 1.000000 | 508.000000 | 2.000000 | 31.649000 |
50% | 2017.000000 | 7.000000 | 16.000000 | 283.896000 | 0.000000 | 0.000000 | 5.000000 | 1109.000000 | 5.000000 | 51.342800 |
75% | 2017.000000 | 9.000000 | 23.000000 | 291.290000 | 0.000000 | 0.000000 | 90.000000 | 2692.000000 | 21.000000 | 64.652000 |
max | 2018.000000 | 12.000000 | 31.000000 | 306.290000 | 27.570000 | 0.510000 | 100.000000 | 6236.000000 | 23.000000 | 91.652000 |
# Create a dataframe that contains the rows where the time is during the day (after 7 am and before 7 pm inclusive)
day = first_era[(first_era['hour']>=7) & (first_era['hour']<=19)]
day_2=second_era[(second_era['hour']>=7)&(second_era['hour']<=19)]
# Statistics of the day time data set
day.describe()
year | month | day | tempK | rain_1h | snow_1h | clouds_all | traffic_volume | hour | tempF | |
---|---|---|---|---|---|---|---|---|---|---|
count | 7400.000000 | 7400.000000 | 7400.000000 | 7400.000000 | 7400.000000 | 7400.0 | 7400.000000 | 7400.000000 | 7400.000000 | 7400.000000 |
mean | 2013.162162 | 6.219865 | 15.555541 | 278.763767 | 0.098950 | 0.0 | 51.441622 | 4689.030541 | 13.034459 | 42.104781 |
std | 0.666967 | 3.608592 | 8.657945 | 13.843569 | 0.858936 | 0.0 | 36.238922 | 1236.442544 | 3.745849 | 24.918425 |
min | 2012.000000 | 1.000000 | 1.000000 | 245.700000 | 0.000000 | 0.0 | 0.000000 | 478.000000 | 7.000000 | -17.410000 |
25% | 2013.000000 | 3.000000 | 8.000000 | 268.850000 | 0.000000 | 0.0 | 8.000000 | 3959.000000 | 10.000000 | 24.260000 |
50% | 2013.000000 | 6.000000 | 15.000000 | 277.540000 | 0.000000 | 0.0 | 64.000000 | 4780.500000 | 13.000000 | 39.902000 |
75% | 2014.000000 | 10.000000 | 23.000000 | 290.680000 | 0.000000 | 0.0 | 90.000000 | 5589.250000 | 16.000000 | 63.554000 |
max | 2014.000000 | 12.000000 | 31.000000 | 308.240000 | 44.450000 | 0.0 | 100.000000 | 7217.000000 | 19.000000 | 95.162000 |
day_2.describe()
year | month | day | tempK | rain_1h | snow_1h | clouds_all | traffic_volume | hour | tempF | |
---|---|---|---|---|---|---|---|---|---|---|
count | 14262.000000 | 14262.000000 | 14262.000000 | 14262.000000 | 14262.000000 | 14262.000000 | 14262.000000 | 14262.000000 | 14262.00000 | 14262.000000 |
mean | 2016.699411 | 6.626280 | 15.693171 | 284.546551 | 0.056578 | 0.000180 | 46.311808 | 4674.336839 | 12.98766 | 52.513792 |
std | 0.982291 | 3.249787 | 8.792417 | 12.781518 | 0.591894 | 0.006766 | 38.096845 | 1157.094380 | 3.73464 | 23.006732 |
min | 2015.000000 | 1.000000 | 1.000000 | 243.390000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 7.00000 | -21.568000 |
25% | 2016.000000 | 4.000000 | 8.000000 | 274.580000 | 0.000000 | 0.000000 | 1.000000 | 4099.250000 | 10.00000 | 34.574000 |
50% | 2017.000000 | 7.000000 | 16.000000 | 287.290000 | 0.000000 | 0.000000 | 40.000000 | 4744.000000 | 13.00000 | 57.452000 |
75% | 2017.000000 | 9.000000 | 23.000000 | 295.060000 | 0.000000 | 0.000000 | 90.000000 | 5426.000000 | 16.00000 | 71.438000 |
max | 2018.000000 | 12.000000 | 31.000000 | 310.070000 | 31.750000 | 0.510000 | 100.000000 | 7280.000000 | 19.00000 | 98.456000 |
Both daytime and nighttime traffic have a minimum traffic level of 0 cars/hour; however, the average rate is two times as large for daytime traffic. The maximum value is about the same for daytime traffic as night time traffic.
# Examine the beginning of the daytime dataset
day.head(5)
year | month | day | tempK | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | hour | weekday | tempF | date_only | holiday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012 | 10 | 2 | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 | 9 | Tuesday | 59.234 | 2012-10-02 | None |
1 | 2012 | 10 | 2 | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 | 10 | Tuesday | 61.178 | 2012-10-02 | None |
2 | 2012 | 10 | 2 | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 | 11 | Tuesday | 61.574 | 2012-10-02 | None |
3 | 2012 | 10 | 2 | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 | 12 | Tuesday | 62.564 | 2012-10-02 | None |
4 | 2012 | 10 | 2 | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 | 13 | Tuesday | 64.382 | 2012-10-02 | None |
# Examine the daytime dataset
day.tail(5)
year | month | day | tempK | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | hour | weekday | tempF | date_only | holiday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13698 | 2014 | 7 | 31 | 300.58 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2014-07-31 15:00:00 | 6161 | 15 | Thursday | 81.374 | 2014-07-31 | None |
13699 | 2014 | 7 | 31 | 301.00 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2014-07-31 16:00:00 | 6608 | 16 | Thursday | 82.130 | 2014-07-31 | None |
13700 | 2014 | 7 | 31 | 300.59 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2014-07-31 17:00:00 | 6526 | 17 | Thursday | 81.392 | 2014-07-31 | None |
13701 | 2014 | 7 | 31 | 300.09 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2014-07-31 18:00:00 | 5127 | 18 | Thursday | 80.492 | 2014-07-31 | None |
13702 | 2014 | 7 | 31 | 299.21 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2014-07-31 19:00:00 | 3929 | 19 | Thursday | 78.908 | 2014-07-31 | None |
# Examine the beginning of the nighttime dataset and ensure that it has at least one day which only includes hours 0-6 and 20-24
night.head(30)
year | month | day | tempK | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | hour | weekday | tempF | date_only | holiday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11 | 2012 | 10 | 2 | 289.38 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 20:00:00 | 2784 | 20 | Tuesday | 61.214 | 2012-10-02 | None |
12 | 2012 | 10 | 2 | 288.61 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 21:00:00 | 2361 | 21 | Tuesday | 59.828 | 2012-10-02 | None |
13 | 2012 | 10 | 2 | 287.16 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 22:00:00 | 1529 | 22 | Tuesday | 57.218 | 2012-10-02 | None |
14 | 2012 | 10 | 2 | 285.45 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 23:00:00 | 963 | 23 | Tuesday | 54.140 | 2012-10-02 | None |
15 | 2012 | 10 | 3 | 284.63 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 00:00:00 | 506 | 0 | Wednesday | 52.664 | 2012-10-03 | None |
16 | 2012 | 10 | 3 | 283.47 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 01:00:00 | 321 | 1 | Wednesday | 50.576 | 2012-10-03 | None |
17 | 2012 | 10 | 3 | 281.18 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 02:00:00 | 273 | 2 | Wednesday | 46.454 | 2012-10-03 | None |
18 | 2012 | 10 | 3 | 281.09 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 03:00:00 | 367 | 3 | Wednesday | 46.292 | 2012-10-03 | None |
19 | 2012 | 10 | 3 | 279.53 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 04:00:00 | 814 | 4 | Wednesday | 43.484 | 2012-10-03 | None |
20 | 2012 | 10 | 3 | 278.62 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 05:00:00 | 2718 | 5 | Wednesday | 41.846 | 2012-10-03 | None |
21 | 2012 | 10 | 3 | 278.23 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 06:00:00 | 5673 | 6 | Wednesday | 41.144 | 2012-10-03 | None |
31 | 2012 | 10 | 3 | 288.19 | 0.0 | 0.0 | 20 | Clouds | few clouds | 2012-10-03 20:00:00 | 2898 | 20 | Wednesday | 59.072 | 2012-10-03 | None |
32 | 2012 | 10 | 3 | 287.10 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 21:00:00 | 2637 | 21 | Wednesday | 57.110 | 2012-10-03 | None |
33 | 2012 | 10 | 3 | 286.25 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 22:00:00 | 1777 | 22 | Wednesday | 55.580 | 2012-10-03 | None |
34 | 2012 | 10 | 3 | 285.26 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 23:00:00 | 1015 | 23 | Wednesday | 53.798 | 2012-10-03 | None |
35 | 2012 | 10 | 4 | 284.55 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-04 00:00:00 | 598 | 0 | Thursday | 52.520 | 2012-10-04 | None |
36 | 2012 | 10 | 4 | 283.47 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-04 01:00:00 | 369 | 1 | Thursday | 50.576 | 2012-10-04 | None |
37 | 2012 | 10 | 4 | 283.17 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-04 02:00:00 | 312 | 2 | Thursday | 50.036 | 2012-10-04 | None |
38 | 2012 | 10 | 4 | 282.04 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-04 03:00:00 | 367 | 3 | Thursday | 48.002 | 2012-10-04 | None |
39 | 2012 | 10 | 4 | 281.69 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-04 04:00:00 | 835 | 4 | Thursday | 47.372 | 2012-10-04 | None |
40 | 2012 | 10 | 4 | 281.32 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-04 05:00:00 | 2726 | 5 | Thursday | 46.706 | 2012-10-04 | None |
41 | 2012 | 10 | 4 | 280.74 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-04 06:00:00 | 5689 | 6 | Thursday | 45.662 | 2012-10-04 | None |
55 | 2012 | 10 | 4 | 292.43 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-04 20:00:00 | 3037 | 20 | Thursday | 66.704 | 2012-10-04 | None |
56 | 2012 | 10 | 4 | 291.77 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-04 21:00:00 | 2822 | 21 | Thursday | 65.516 | 2012-10-04 | None |
57 | 2012 | 10 | 4 | 291.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-04 22:00:00 | 1992 | 22 | Thursday | 64.778 | 2012-10-04 | None |
58 | 2012 | 10 | 4 | 291.12 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-04 23:00:00 | 1166 | 23 | Thursday | 64.346 | 2012-10-04 | None |
59 | 2012 | 10 | 5 | 290.63 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-05 00:00:00 | 627 | 0 | Friday | 63.464 | 2012-10-05 | None |
60 | 2012 | 10 | 5 | 290.46 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-05 01:00:00 | 388 | 1 | Friday | 63.158 | 2012-10-05 | None |
61 | 2012 | 10 | 5 | 289.26 | 0.0 | 0.0 | 0 | Clear | sky is clear | 2012-10-05 03:00:00 | 368 | 3 | Friday | 60.998 | 2012-10-05 | None |
62 | 2012 | 10 | 5 | 283.21 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-05 05:00:00 | 2489 | 5 | Friday | 50.108 | 2012-10-05 | None |
# Examine the end of the nighttime dataset
night.tail(5)
year | month | day | tempK | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | hour | weekday | tempF | date_only | holiday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13688 | 2014 | 7 | 31 | 288.26 | 0.00 | 0.0 | 1 | Clear | sky is clear | 2014-07-31 05:00:00 | 3154 | 5 | Thursday | 59.198 | 2014-07-31 | None |
13689 | 2014 | 7 | 31 | 288.17 | 0.00 | 0.0 | 1 | Clear | sky is clear | 2014-07-31 06:00:00 | 5633 | 6 | Thursday | 59.036 | 2014-07-31 | None |
13703 | 2014 | 7 | 31 | 297.20 | 8.64 | 0.0 | 1 | Rain | heavy rain | 2014-07-31 20:00:00 | 3437 | 20 | Thursday | 75.290 | 2014-07-31 | None |
13704 | 2014 | 7 | 31 | 294.87 | 0.00 | 0.0 | 1 | Clear | sky is clear | 2014-07-31 21:00:00 | 3222 | 21 | Thursday | 71.096 | 2014-07-31 | None |
13705 | 2014 | 7 | 31 | 293.38 | 0.00 | 0.0 | 1 | Clear | sky is clear | 2014-07-31 22:00:00 | 2610 | 22 | Thursday | 68.414 | 2014-07-31 | None |
# Distributions of daytime traffic and nighttime traffic, counts accumulated by traffic volume for each distinct time
# (year, month, day, hour, minute) set in the daytime and nighttime datasets
plt.figure(figsize=(15,10))
plt.subplot(2,2,1)
plt.grid(visible=True)
plt.hist(day['traffic_volume'])
plt.title('Daytime traffic, 2012-2014')
plt.xlabel('Traffic volume')
plt.ylim([0,2500])
plt.ylabel('Count of Distinct Times')
plt.subplot(2,2,2)
plt.grid(visible=True)
night['traffic_volume'].hist()
plt.title('Nighttime traffic, 2012-2014')
plt.xlabel('Traffic volume')
plt.ylabel('Count of Distinct Times')
plt.ylim([0,2500])
#frame2=plt.gca()
#frame2.yaxis.set_ticklabels([])
plt.subplot(2,2,3)
plt.grid(visible=True)
plt.hist(day_2['traffic_volume'])
plt.title('Daytime traffic, 2015-2018')
plt.xlabel('Traffic Volume')
plt.ylabel('Count of Distinct Times')
plt.ylim([0,5000])
plt.subplot(2,2,4)
plt.grid(visible=True)
plt.hist(night_2['traffic_volume'])
plt.title('Nighttime traffic, 2015-2018')
plt.xlabel('Traffic Volume')
plt.ylabel('Count of Distinct Times')
plt.ylim([0,5000])
plt.show()
Night time in general has a very low traffic volume compared to the daytme traffic volume. Daytime traffic volume peaks around 5000 cars/hour. Night time traffic volume has a strong peak near 0 cars/hour, but there are also features around 2800 cars/hour and a small feature around 5800 cars/hour.
# Consider only the times where the night time traffic volume is above 2000 cars/hour.
# This is the feature at 2800 and the feature at 5800
high_traffic_night=night[night['traffic_volume']>2000]
high_traffic_night_2=night_2[night_2['traffic_volume']>2000]
# Count of distinct hours where there are high volumes of traffic above the threshold (2000 cars/hour)
plt.figure(figsize=(15,5))
plt.title('Distinct Hours Where There Are More than 2000 Cars/Hour at Night')
plt.subplot(1,2,1)
high_traffic_night['hour'].hist()
plt.title('2012-2014')
plt.xlabel("Hour")
plt.ylabel("Count of Distinct Hours (Given High Traffic Volumes)")
plt.subplot(1,2,2)
high_traffic_night_2['hour'].hist()
plt.title('2015-2018')
plt.xlabel("Hour")
plt.ylabel("Count of Distinct Hours (Given High Traffic Volumes)")
plt.show()
It is in fact true that the hours where there are high traffic volumes at night are close to rush hour in the morning. But there is also a high traffic volume at night in the early evening during rush hour and when people may be going out to shows or bars.
# Create a table of the mean values of traffic volume for each month
by_month=day.groupby('month').mean()
by_month_2=day_2.groupby('month').mean()
# Display the traffic volume column of this table 2012-2014
by_month['traffic_volume']
month 1 4330.043724 2 4518.420438 3 4671.420470 4 4804.885753 5 4803.237845 6 4841.262834 7 4822.823120 8 5057.757310 9 4925.348837 10 5019.176157 11 4693.224964 12 4264.806536 Name: traffic_volume, dtype: float64
# Display traffic volume 2015-2018
by_month_2['traffic_volume']
month 1 4417.767308 2 4731.617978 3 4924.857007 4 4878.801843 5 4788.243882 6 4822.766122 7 4381.357947 8 4754.691204 9 4727.758781 10 4712.266857 11 4568.072375 12 4420.231072 Name: traffic_volume, dtype: float64
# Traffic volume vs month, note that January, December, and July have lower traffic volumes
plt.plot(by_month.index,by_month['traffic_volume'], label="2012-2014")
plt.plot(by_month_2.index,by_month_2['traffic_volume'],label='2015-2018')
plt.title('Daytime Traffic Volume by Month')
plt.xlabel('Month')
plt.ylabel('Daytime Traffic Volume, 2012-2014')
plt.legend()
plt.show()
In 2012-2014, traffic volume was much lower in January and December and somewhat higher in August through October In 2015-2018, traffic volume was much lower in January, December, and July.
Many Minnesotans go up north in July to fish, boat, swim, or hunt. This traffic jam is most common on the weekend beginning in the afternoon on Friday in the westbound direction, experientially, a bit further west along I-94. Due to the traffic slowdown and the people leaving the city, this should decrease traffic volume on Friday afternoons in July. I will examine this hypothesis further below. This hypothesis cannot explain the overall low traffic volume in July in only one data set.
It is possible that road repair was an issue In 2016 there was road repair between Minneapolis and St Paul along I-94 from Friday July 22 to Sunday July 24. I-94 Closure. This would decrease the traffic volume in only the second era dataset in July, specifically in the weekend of July 22-24 in 2016 This can also be tested by examining very low traffic days in July and not in July, comparing the percentage of hours the road may have been closed
It is also possible weather was the issue On July 5, 2016 there was a severe storm with 60-70 mph winds, flooding, and tornadoes and hail in neighboring areas of the state. It caused 250,000 Twin Cities homes to lose power. Storm article. This certainly could have had after effects. I will evaluate that as well. This would decrease the traffic volume in the second era in July, specifically on July 5th, 2016 and the following few days to a week.
Student holiday travel cannot explain the difference between July in 2012-2014 and 2015-2018 because the enrollment of University of Minnesota, Twin Cities changed by only 2% over any year in either of these two time periods while the drop in the July traffic volume was a 10% drop in the second era. U of MN enrollment chart.
A likely hypothesis for winter patterns is poor weather. I will examine possible effects.
# Create a table for daytime traffic in July
july_traffic=day[day['month']==7]
july_traffic_2=day_2[day_2['month']==7]
# Display the beginning of the July table to confirm the creation of the weekday column
july_traffic.head(5)
year | month | day | tempK | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | hour | weekday | tempF | date_only | holiday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6010 | 2013 | 7 | 1 | 289.15 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2013-07-01 07:00:00 | 6783 | 7 | Monday | 60.800 | 2013-07-01 | None |
6011 | 2013 | 7 | 1 | 291.59 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2013-07-01 08:00:00 | 5742 | 8 | Monday | 65.192 | 2013-07-01 | None |
6012 | 2013 | 7 | 1 | 294.15 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2013-07-01 09:00:00 | 4764 | 9 | Monday | 69.800 | 2013-07-01 | None |
6013 | 2013 | 7 | 1 | 295.60 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2013-07-01 10:00:00 | 4669 | 10 | Monday | 72.410 | 2013-07-01 | None |
6014 | 2013 | 7 | 1 | 297.15 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2013-07-01 11:00:00 | 4863 | 11 | Monday | 75.200 | 2013-07-01 | None |
july_traffic_2.head(5)
year | month | day | tempK | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | hour | weekday | tempF | date_only | holiday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13970 | 2015 | 7 | 1 | 287.58 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2015-07-01 07:00:00 | 6261 | 7 | Wednesday | 57.974 | 2015-07-01 | None |
13971 | 2015 | 7 | 1 | 288.58 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2015-07-01 09:00:00 | 4409 | 9 | Wednesday | 59.774 | 2015-07-01 | None |
13972 | 2015 | 7 | 1 | 289.24 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2015-07-01 10:00:00 | 4273 | 10 | Wednesday | 60.962 | 2015-07-01 | None |
13973 | 2015 | 7 | 1 | 289.44 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2015-07-01 11:00:00 | 4469 | 11 | Wednesday | 61.322 | 2015-07-01 | None |
13974 | 2015 | 7 | 1 | 290.53 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2015-07-01 12:00:00 | 4625 | 12 | Wednesday | 63.284 | 2015-07-01 | None |
# Find the mean traffic volume for each weekday and hour
july_traffic_by_weekday=july_traffic.groupby(['weekday','hour']).mean()
july_traffic_by_weekday_2=july_traffic_2.groupby(['weekday','hour']).mean()
# Reset the index so that weekday and hour are columns rather than indices
july_traffic_by_weekday=july_traffic_by_weekday.reset_index()
july_traffic_by_weekday_2=july_traffic_by_weekday_2.reset_index()
# Display the 2012-2014 table to confirm the columns and see the means
july_traffic_by_weekday[['weekday','hour','traffic_volume']]
weekday | hour | traffic_volume | |
---|---|---|---|
0 | Friday | 7 | 5399.625000 |
1 | Friday | 8 | 4869.375000 |
2 | Friday | 9 | 4510.428571 |
3 | Friday | 10 | 4487.125000 |
4 | Friday | 11 | 4757.500000 |
... | ... | ... | ... |
86 | Wednesday | 15 | 6002.500000 |
87 | Wednesday | 16 | 6586.900000 |
88 | Wednesday | 17 | 6265.000000 |
89 | Wednesday | 18 | 4833.200000 |
90 | Wednesday | 19 | 3598.500000 |
91 rows × 3 columns
#Display the 2015-2018 table
july_traffic_by_weekday_2[['weekday','hour','traffic_volume']]
weekday | hour | traffic_volume | |
---|---|---|---|
0 | Friday | 7 | 5605.388889 |
1 | Friday | 8 | 4918.666667 |
2 | Friday | 9 | 4490.777778 |
3 | Friday | 10 | 4389.444444 |
4 | Friday | 11 | 4712.500000 |
... | ... | ... | ... |
86 | Wednesday | 15 | 5341.058824 |
87 | Wednesday | 16 | 6063.529412 |
88 | Wednesday | 17 | 5761.133333 |
89 | Wednesday | 18 | 4369.588235 |
90 | Wednesday | 19 | 3311.294118 |
91 rows × 3 columns
# Plot traffic volume vs hour in July for each weekday on the same plot, labelling with word names for weekday
# This is to look for variations by weekday in overall traffic volume and in July rushhour traffic.
plt.figure(figsize=(15,5))
weekdays=['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday']
plt.subplot(1,2,1)
for i in range(7):
this_weekday=july_traffic_by_weekday.loc[(july_traffic_by_weekday['weekday']==weekdays[i]),['hour','traffic_volume']]
plt.plot(this_weekday['hour'],this_weekday['traffic_volume'],label=weekdays[i])
plt.xlabel('Hour')
plt.ylabel('Traffic Volume')
plt.title('Traffic Volume by Hour in July, 2012-2014')
plt.legend()
plt.subplot(1,2,2)
for i in range(7):
this_weekday=july_traffic_by_weekday_2.loc[(july_traffic_by_weekday_2['weekday']==weekdays[i]),['hour','traffic_volume']]
plt.plot(this_weekday['hour'],this_weekday['traffic_volume'],label=weekdays[i])
plt.xlabel('Hour')
plt.ylabel('Traffic Volume')
plt.title('Traffic Volume by Hour in July, 2015-2018')
plt.legend()
plt.show()
In 2012-2014, Friday has a slight decrease in traffic volume relative to other weekdays, indicating slight evidence for "going up north" traffic, perhaps, or simply people leaving work early or working from home on Fridays. In 2015-2018, Friday is not significantly different from other weekdays, but the overall traffic volume is 10-20% lower (1000 cars/hour lower) on weekdays but about the same on the weekend.
# Create a table that contains Friday daytime traffic in July
friday_traffic=day[day['weekday']=='Friday']
friday_traffic_2=day_2[day_2['weekday']=='Friday']
# Display the beginning of the Friday table, 2012-2014
friday_traffic.head(5)
year | month | day | tempK | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | hour | weekday | tempF | date_only | holiday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
64 | 2012 | 10 | 5 | 281.69 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-05 07:00:00 | 6628 | 7 | Friday | 47.372 | 2012-10-05 | None |
65 | 2012 | 10 | 5 | 280.57 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-05 08:00:00 | 5534 | 8 | Friday | 45.356 | 2012-10-05 | None |
66 | 2012 | 10 | 5 | 280.02 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-05 09:00:00 | 5403 | 9 | Friday | 44.366 | 2012-10-05 | None |
67 | 2012 | 10 | 5 | 280.11 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-05 10:00:00 | 4770 | 10 | Friday | 44.528 | 2012-10-05 | None |
68 | 2012 | 10 | 5 | 280.00 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-05 11:00:00 | 5217 | 11 | Friday | 44.330 | 2012-10-05 | None |
#Display the beginning of the friday table, 2015-2018
friday_traffic_2.head(5)
year | month | day | tempK | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | hour | weekday | tempF | date_only | holiday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14017 | 2015 | 7 | 3 | 287.42 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2015-07-03 07:00:00 | 2368 | 7 | Friday | 57.686 | 2015-07-03 | Independence Day |
14018 | 2015 | 7 | 3 | 288.55 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2015-07-03 08:00:00 | 2523 | 8 | Friday | 59.720 | 2015-07-03 | Independence Day |
14019 | 2015 | 7 | 3 | 292.11 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2015-07-03 09:00:00 | 3227 | 9 | Friday | 66.128 | 2015-07-03 | Independence Day |
14020 | 2015 | 7 | 3 | 294.18 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2015-07-03 10:00:00 | 3721 | 10 | Friday | 69.854 | 2015-07-03 | Independence Day |
14021 | 2015 | 7 | 3 | 295.86 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2015-07-03 11:00:00 | 4034 | 11 | Friday | 72.878 | 2015-07-03 | Independence Day |
# Find the mean of the Friday daytime traffic table over each month and hour
# And reset the index so that month and hour are columns rather than indices
friday_traffic_by_month=friday_traffic.groupby(['month','hour']).mean().reset_index()
friday_traffic_by_month_2=friday_traffic_2.groupby(['month','hour']).mean().reset_index()
# Display the head of the Friday daytime traffic in July table, 2012-2014
friday_traffic_by_month.head(5)
month | hour | year | day | tempK | rain_1h | snow_1h | clouds_all | traffic_volume | tempF | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 2013.555556 | 15.888889 | 260.575556 | 0.0 | 0.0 | 50.222222 | 6061.111111 | 9.366000 |
1 | 1 | 8 | 2013.555556 | 15.888889 | 260.871111 | 0.0 | 0.0 | 48.000000 | 5555.222222 | 9.898000 |
2 | 1 | 9 | 2013.571429 | 14.428571 | 263.411429 | 0.0 | 0.0 | 56.000000 | 4851.000000 | 14.470571 |
3 | 1 | 10 | 2013.555556 | 15.888889 | 264.498111 | 0.0 | 0.0 | 62.888889 | 4338.111111 | 16.426600 |
4 | 1 | 11 | 2013.555556 | 15.888889 | 265.303667 | 0.0 | 0.0 | 66.777778 | 4775.333333 | 17.876600 |
# Display the head of the Friday daytime traffic in July table, 2015-2018
friday_traffic_by_month_2.head(5)
month | hour | year | day | tempK | rain_1h | snow_1h | clouds_all | traffic_volume | tempF | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 2017.090909 | 15.090909 | 262.870909 | 0.0 | 0.0 | 49.636364 | 5616.363636 | 13.497636 |
1 | 1 | 8 | 2017.000000 | 16.250000 | 262.702500 | 0.0 | 0.0 | 64.666667 | 5061.083333 | 13.194500 |
2 | 1 | 9 | 2017.090909 | 15.090909 | 263.310000 | 0.0 | 0.0 | 40.454545 | 4504.454545 | 14.288000 |
3 | 1 | 10 | 2016.923077 | 15.615385 | 264.613846 | 0.0 | 0.0 | 53.538462 | 4239.307692 | 16.634923 |
4 | 1 | 11 | 2017.333333 | 17.444444 | 264.118889 | 0.0 | 0.0 | 42.333333 | 4823.888889 | 15.744000 |
# Plot the volume of traffic versus hour for Fridays over various months.
# This is to look for variations by month in overall traffic volume and Friday evening rushhour traffic.
plt.figure(figsize=(15,5))
months=['January','February','March','April','May','June','July','August','September','October','November','December']
colors=['b','b','b','b','b','b','r','b','b','b','b','b']
plt.subplot(1,2,1)
for i in range(1,13):
this_month=friday_traffic_by_month[friday_traffic_by_month['month']==i]
plt.plot(this_month['hour'],this_month['traffic_volume'],label=months[i-1],color=colors[i-1])
plt.xlabel('Hour')
plt.ylabel('Traffic Volume')
plt.title('Friday Traffic vs Hour by Month, 2012-2014')
plt.legend()
plt.subplot(1,2,2)
for i in range(1,13):
this_month=friday_traffic_by_month_2[friday_traffic_by_month_2['month']==i]
plt.plot(this_month['hour'],this_month['traffic_volume'],label=months[i-1],color=colors[i-1])
plt.xlabel('Hour')
plt.ylabel('Traffic Volume')
plt.title('Friday Traffic vs Hour by Month, 2015-2018')
plt.legend()
plt.show()
Not only do Fridays in July have a lower peak rush hour traffic volume relative to other weekdays, but July has a moderately lower peak rush hour Friday traffic volume relative to other months. This may be due to people going North or going outside to enjoy the summer. This is clearly not the only effect.
# Find the mean of the temperature for each month over all distinct times
temp_by_month=day.groupby('month').agg({'tempF':'mean'})
temp_by_month_2=day_2.groupby('month').agg({'tempF':'mean'})
temp_by_month
tempF | |
---|---|
month | |
1 | 13.880680 |
2 | 14.808607 |
3 | 28.023668 |
4 | 42.479968 |
5 | 60.381068 |
6 | 69.096300 |
7 | 74.764510 |
8 | 74.191316 |
9 | 68.294128 |
10 | 49.648295 |
11 | 35.130075 |
12 | 18.665929 |
# Plot average tempearture in Farenheit versus month
plt.figure()
plt.plot(temp_by_month.index,temp_by_month['tempF'],label='2012-2014')
plt.plot(temp_by_month_2.index,temp_by_month_2['tempF'],label='2015-2018')
plt.title('Average Temperature vs Month')
plt.xlabel('Month')
plt.ylabel('Average temperature (F)')
plt.legend()
plt.show()
The temperature was the warmest in July-August in 2012-2014 and the warmest in July in 2015-2018. The warmer temperatures roughly speaking correspond to higher traffic volumes.
# Assign the average temperature to a column 'avg_temp' in the by_month dataframe
by_month['avg_temp']=temp_by_month
by_month_2['avg_temp']=temp_by_month_2
# Sort the by_month dataframe so that the highest average temperatures are at the top
by_month.sort_values('avg_temp',ascending=False)['avg_temp']
month 7 74.764510 8 74.191316 6 69.096300 9 68.294128 5 60.381068 10 49.648295 4 42.479968 11 35.130075 3 28.023668 12 18.665929 2 14.808607 1 13.880680 Name: avg_temp, dtype: float64
# Second era
by_month_2.sort_values('avg_temp',ascending=False)['avg_temp']
month 7 75.585278 6 72.714827 8 72.298632 9 67.131799 5 63.900824 10 53.545826 4 46.386700 11 40.416629 3 35.858915 2 24.483318 12 22.455108 1 19.130729 Name: avg_temp, dtype: float64
July is in fact the warmest month
#Select the months that are not July from the daytime dataframe
other_months=day[day['month']!=7]
july_only = day[day['month']==7]
other_months_2=day_2[day_2['month']!=7]
july_only_2=day_2[day_2['month']==7]
import seaborn as sns
sns.histplot(other_months['traffic_volume'],stat='percent')
plt.title('Traffic Volume in Months other than July in 2012-2014')
Text(0.5, 1.0, 'Traffic Volume in Months other than July in 2012-2014')
import seaborn as sns
sns.histplot(july_only['traffic_volume'],stat='percent')
plt.title('Traffic Volume in July in 2012-2014')
Text(0.5, 1.0, 'Traffic Volume in July in 2012-2014')
import seaborn as sns
sns.histplot(other_months_2['traffic_volume'],stat='percent')
plt.title('Traffic Volume in Months other than July in 2015-2018')
Text(0.5, 1.0, 'Traffic Volume in Months other than July in 2015-2018')
import seaborn as sns
sns.histplot(july_only_2['traffic_volume'],stat='percent')
plt.title('Traffic Volume in July in 2015-2018')
Text(0.5, 1.0, 'Traffic Volume in July in 2015-2018')
Notice that only in July, only in 2015-2018, there are a signficant number of times where the traffic volume is 0 cars/hour or nearly 0 cars/hour implying the road is closed. That corresponds with the previous finding that therer was construction during this time. Let's examine these hours.
# Days when the traffic volume ways nearly zero (less than 200) in July in the second data era (2015-2018)
july_only_2[july_only_2['traffic_volume']<200]
year | month | day | tempK | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | hour | weekday | tempF | date_only | holiday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14544 | 2015 | 7 | 25 | 293.36 | 0.00 | 0.0 | 1 | Mist | mist | 2015-07-25 07:00:00 | 190 | 7 | Saturday | 68.378 | 2015-07-25 | None |
14545 | 2015 | 7 | 25 | 295.27 | 0.00 | 0.0 | 1 | Clear | sky is clear | 2015-07-25 08:00:00 | 1 | 8 | Saturday | 71.816 | 2015-07-25 | None |
14546 | 2015 | 7 | 25 | 295.82 | 0.00 | 0.0 | 1 | Clear | sky is clear | 2015-07-25 09:00:00 | 1 | 9 | Saturday | 72.806 | 2015-07-25 | None |
21413 | 2016 | 7 | 23 | 299.11 | 0.00 | 0.0 | 75 | Clouds | broken clouds | 2016-07-23 09:00:00 | 15 | 9 | Saturday | 78.728 | 2016-07-23 | None |
21414 | 2016 | 7 | 23 | 299.12 | 0.00 | 0.0 | 75 | Thunderstorm | proximity thunderstorm | 2016-07-23 10:00:00 | 3 | 10 | Saturday | 78.746 | 2016-07-23 | None |
21415 | 2016 | 7 | 23 | 297.10 | 11.23 | 0.0 | 75 | Rain | heavy rain | 2016-07-23 11:00:00 | 24 | 11 | Saturday | 75.110 | 2016-07-23 | None |
21416 | 2016 | 7 | 23 | 295.30 | 9.42 | 0.0 | 90 | Rain | heavy rain | 2016-07-23 12:00:00 | 2 | 12 | Saturday | 71.870 | 2016-07-23 | None |
21417 | 2016 | 7 | 23 | 295.42 | 13.46 | 0.0 | 90 | Rain | heavy rain | 2016-07-23 13:00:00 | 10 | 13 | Saturday | 72.086 | 2016-07-23 | None |
21418 | 2016 | 7 | 23 | 295.46 | 0.25 | 0.0 | 90 | Rain | heavy rain | 2016-07-23 14:00:00 | 2 | 14 | Saturday | 72.158 | 2016-07-23 | None |
21419 | 2016 | 7 | 23 | 295.55 | 0.00 | 0.0 | 75 | Rain | light rain | 2016-07-23 15:00:00 | 3 | 15 | Saturday | 72.320 | 2016-07-23 | None |
21420 | 2016 | 7 | 23 | 295.22 | 10.67 | 0.0 | 90 | Rain | heavy rain | 2016-07-23 16:00:00 | 7 | 16 | Saturday | 71.726 | 2016-07-23 | None |
21421 | 2016 | 7 | 23 | 295.54 | 0.51 | 0.0 | 90 | Rain | moderate rain | 2016-07-23 17:00:00 | 5 | 17 | Saturday | 72.302 | 2016-07-23 | None |
21422 | 2016 | 7 | 23 | 296.68 | 0.00 | 0.0 | 40 | Rain | light rain | 2016-07-23 18:00:00 | 0 | 18 | Saturday | 74.354 | 2016-07-23 | None |
21423 | 2016 | 7 | 23 | 297.48 | 0.00 | 0.0 | 40 | Clouds | scattered clouds | 2016-07-23 19:00:00 | 1 | 19 | Saturday | 75.794 | 2016-07-23 | None |
21437 | 2016 | 7 | 24 | 297.40 | 0.00 | 0.0 | 40 | Clouds | scattered clouds | 2016-07-24 09:00:00 | 15 | 9 | Sunday | 75.650 | 2016-07-24 | None |
21439 | 2016 | 7 | 24 | 299.45 | 0.00 | 0.0 | 40 | Clouds | scattered clouds | 2016-07-24 11:00:00 | 24 | 11 | Sunday | 79.340 | 2016-07-24 | None |
21441 | 2016 | 7 | 24 | 300.04 | 0.00 | 0.0 | 75 | Clouds | broken clouds | 2016-07-24 13:00:00 | 100 | 13 | Sunday | 80.402 | 2016-07-24 | None |
21444 | 2016 | 7 | 24 | 300.88 | 0.00 | 0.0 | 40 | Clouds | scattered clouds | 2016-07-24 16:00:00 | 44 | 16 | Sunday | 81.914 | 2016-07-24 | None |
All of these times are on the weekend. They are July 25 7-10 am (traffic near zero for the later 2 hours), 2015; July 23 9-8 pm, 2016 (traffic near zero all day); July 24: 9 am, 11 am, 1 pm, 4pm (traffic slow). These look like possible road closures or accidents. The 2016 times are during times of known road repair.
# The mean traffic volume excluding the times of road repair or accidents
july_mean_excluding_shutdowns=july_only_2[july_only_2['traffic_volume']>200]['traffic_volume'].mean()
july_mean_excluding_shutdowns
4430.989240506329
The mean traffic volume in July in the second era excluding the times of road repair or accidents is roughly 4430 cars/hour
# Traffic volume vs month, note that January, December, and July have lower traffic volumes
plt.plot(by_month.index,by_month['traffic_volume'], label="2012-2014")
plt.plot(by_month_2.index,by_month_2['traffic_volume'],label='2015-2018')
plt.plot(7,july_mean_excluding_shutdowns,label='July 2015-2018, excluding shutdowns',color='g',marker='o')
plt.title('Daytime Traffic Volume by Month')
plt.xlabel('Month')
plt.ylabel('Daytime Traffic Volume, 2012-2014')
plt.legend()
plt.show()
Excluding shutdowns due to road repair or accidents alone does not resolve the July 2015-2018 drop in traffic volume.
# Identify the main weather types and count their occurances over distinct dates by month in the daytime data
weather_by_month=day.groupby('month')['weather_main'].value_counts()
weather_by_month_2=day_2.groupby('month')['weather_main'].value_counts()
# Display the weather primary weather types and their frequency by month
weather_by_month
month weather_main 1 Clouds 385 Clear 127 Snow 76 Mist 44 Rain 41 ... 12 Clear 76 Rain 61 Haze 42 Fog 9 Drizzle 5 Name: weather_main, Length: 89, dtype: int64
# for example, in the second era it seems that December was a lot wetter
weather_by_month_2
month weather_main 1 Clouds 339 Clear 300 Snow 168 Mist 148 Haze 67 ... 12 Mist 107 Haze 38 Rain 36 Drizzle 18 Fog 2 Name: weather_main, Length: 103, dtype: int64
# Create a dataframe out of the series where weather is grouped and counted by month
weather_by_month_df=pd.DataFrame(weather_by_month)
weather_by_month_df_2=pd.DataFrame(weather_by_month_2)
# Assign the count column the name 'frequency_of_weather' to indicate that it describes how frequently the weather occurs
weather_by_month_df.columns=['frequency_of_weather']
weather_by_month_df_2.columns=['frequency_of_weather']
# Reset the indices of the weather_by_month_df dataframe so that 'month' and 'weather_main' are columns
weather_by_month_df=weather_by_month_df.reset_index()
weather_by_month_df_2=weather_by_month_df_2.reset_index()
# Display the dataframe to make sure that the data makes sense and the columns are correct
weather_by_month_df
month | weather_main | frequency_of_weather | |
---|---|---|---|
0 | 1 | Clouds | 385 |
1 | 1 | Clear | 127 |
2 | 1 | Snow | 76 |
3 | 1 | Mist | 44 |
4 | 1 | Rain | 41 |
... | ... | ... | ... |
84 | 12 | Clear | 76 |
85 | 12 | Rain | 61 |
86 | 12 | Haze | 42 |
87 | 12 | Fog | 9 |
88 | 12 | Drizzle | 5 |
89 rows × 3 columns
# second era looks okay as well
weather_by_month_df_2
month | weather_main | frequency_of_weather | |
---|---|---|---|
0 | 1 | Clouds | 339 |
1 | 1 | Clear | 300 |
2 | 1 | Snow | 168 |
3 | 1 | Mist | 148 |
4 | 1 | Haze | 67 |
... | ... | ... | ... |
98 | 12 | Mist | 107 |
99 | 12 | Haze | 38 |
100 | 12 | Rain | 36 |
101 | 12 | Drizzle | 18 |
102 | 12 | Fog | 2 |
103 rows × 3 columns
# Disply ino about the dataframe to determine the data types and if any values are null (looks okay)
weather_by_month_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 89 entries, 0 to 88 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 month 89 non-null int64 1 weather_main 89 non-null object 2 frequency_of_weather 89 non-null int64 dtypes: int64(2), object(1) memory usage: 2.2+ KB
# Info about second era (looks okay)
weather_by_month_df_2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 103 entries, 0 to 102 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 month 103 non-null int64 1 weather_main 103 non-null object 2 frequency_of_weather 103 non-null int64 dtypes: int64(2), object(1) memory usage: 2.5+ KB
# Create a dictionary with the most common weather type for each month (numerical) and the number of times it occurred
def common_weather_by_month(weather_by_month_df):
common_weather={}
months=range(1,13)
for month in months:
maxcounts=0
commonweather=''
for index in weather_by_month_df.index:
dfmonth=weather_by_month_df.iloc[index]['month']
dfweather=weather_by_month_df.iloc[index]['weather_main']
dfcounts=weather_by_month_df.iloc[index]['frequency_of_weather']
if dfmonth==month:
if dfcounts>maxcounts:
maxcounts=dfcounts
commonweather=dfweather
common_weather[month]=[commonweather,maxcounts]
return common_weather
# Get the common weather for the first era
common_weather=common_weather_by_month(weather_by_month_df)
# Get the common weather for the second era
common_weather_2=common_weather_by_month(weather_by_month_df_2)
# Display that dictionary, in order sorted by numerical month for the first era
common_weather
{1: ['Clouds', 385], 2: ['Clouds', 410], 3: ['Clouds', 399], 4: ['Clouds', 364], 5: ['Clouds', 362], 6: ['Clouds', 167], 7: ['Clouds', 327], 8: ['Clear', 163], 9: ['Clear', 103], 10: ['Clouds', 298], 11: ['Clouds', 488], 12: ['Clouds', 344]}
common_weather_2
{1: ['Clouds', 339], 2: ['Clear', 362], 3: ['Clouds', 431], 4: ['Clouds', 397], 5: ['Clouds', 574], 6: ['Clouds', 542], 7: ['Clouds', 788], 8: ['Clouds', 547], 9: ['Clouds', 546], 10: ['Clouds', 431], 11: ['Clear', 352], 12: ['Snow', 302]}
The most common weather type per month doesn't give us a good indication of why July, December, or January has lower travel relative to the other months or a change in travel between years. The most common weather type was clouds. In the first era, in August and September, clear weather was the most common, while temperatures were warm. This may explain why traffic volume increased in those months.
In the second era, August and September were cloudy. The cloudy weather in August and Septemer from 2015-2018 correlates with, but does not neccessarily have a causal connection, to the lower traffic volume of the second era.
One can imagine that snow might reduce traffic volume in December, but it is not present in both December and January both years, so it can't explain the whole effect.
print(day.head())
print(day_2.head())
year month day tempK rain_1h snow_1h clouds_all weather_main \ 0 2012 10 2 288.28 0.0 0.0 40 Clouds 1 2012 10 2 289.36 0.0 0.0 75 Clouds 2 2012 10 2 289.58 0.0 0.0 90 Clouds 3 2012 10 2 290.13 0.0 0.0 90 Clouds 4 2012 10 2 291.14 0.0 0.0 75 Clouds weather_description date_time traffic_volume hour weekday \ 0 scattered clouds 2012-10-02 09:00:00 5545 9 Tuesday 1 broken clouds 2012-10-02 10:00:00 4516 10 Tuesday 2 overcast clouds 2012-10-02 11:00:00 4767 11 Tuesday 3 overcast clouds 2012-10-02 12:00:00 5026 12 Tuesday 4 broken clouds 2012-10-02 13:00:00 4918 13 Tuesday tempF date_only holiday 0 59.234 2012-10-02 None 1 61.178 2012-10-02 None 2 61.574 2012-10-02 None 3 62.564 2012-10-02 None 4 64.382 2012-10-02 None year month day tempK rain_1h snow_1h clouds_all weather_main \ 13970 2015 7 1 287.58 0.0 0.0 75 Clouds 13971 2015 7 1 288.58 0.0 0.0 40 Clouds 13972 2015 7 1 289.24 0.0 0.0 40 Clouds 13973 2015 7 1 289.44 0.0 0.0 75 Clouds 13974 2015 7 1 290.53 0.0 0.0 1 Clear weather_description date_time traffic_volume hour \ 13970 broken clouds 2015-07-01 07:00:00 6261 7 13971 scattered clouds 2015-07-01 09:00:00 4409 9 13972 scattered clouds 2015-07-01 10:00:00 4273 10 13973 broken clouds 2015-07-01 11:00:00 4469 11 13974 sky is clear 2015-07-01 12:00:00 4625 12 weekday tempF date_only holiday 13970 Wednesday 57.974 2015-07-01 None 13971 Wednesday 59.774 2015-07-01 None 13972 Wednesday 60.962 2015-07-01 None 13973 Wednesday 61.322 2015-07-01 None 13974 Wednesday 63.284 2015-07-01 None
# Create at table of numerical month versus main weather type counting each occurance
month_weather_pivot=pd.pivot_table(day,values='traffic_volume',index='month',columns='weather_main',aggfunc='count')
month_weather_pivot_2=pd.pivot_table(day_2,values='traffic_volume',index='month',columns='weather_main',aggfunc='count')
# Display that pivot table, first era
month_weather_pivot
weather_main | Clear | Clouds | Drizzle | Fog | Haze | Mist | Rain | Snow | Thunderstorm |
---|---|---|---|---|---|---|---|---|---|
month | |||||||||
1 | 127.0 | 385.0 | 7.0 | NaN | 29.0 | 44.0 | 41.0 | 76.0 | NaN |
2 | 162.0 | 410.0 | NaN | NaN | 40.0 | 32.0 | 16.0 | 25.0 | NaN |
3 | 166.0 | 399.0 | 2.0 | NaN | 11.0 | 16.0 | 83.0 | 46.0 | NaN |
4 | 137.0 | 364.0 | 1.0 | NaN | 7.0 | 20.0 | 164.0 | 51.0 | NaN |
5 | 189.0 | 362.0 | 13.0 | NaN | 11.0 | 33.0 | 135.0 | 10.0 | 8.0 |
6 | 124.0 | 167.0 | 2.0 | NaN | 5.0 | 20.0 | 167.0 | NaN | 2.0 |
7 | 261.0 | 327.0 | 2.0 | 1.0 | 4.0 | 19.0 | 101.0 | NaN | 3.0 |
8 | 163.0 | 147.0 | NaN | 2.0 | 3.0 | 12.0 | 13.0 | NaN | 2.0 |
9 | 103.0 | 67.0 | 1.0 | NaN | 3.0 | 8.0 | 32.0 | NaN | 1.0 |
10 | 111.0 | 298.0 | 19.0 | 1.0 | 4.0 | 41.0 | 79.0 | 7.0 | 2.0 |
11 | 112.0 | 488.0 | 2.0 | NaN | 10.0 | 40.0 | 14.0 | 22.0 | 1.0 |
12 | 76.0 | 344.0 | 5.0 | 9.0 | 42.0 | 93.0 | 61.0 | 135.0 | NaN |
# Display pivot table, second era
month_weather_pivot_2
weather_main | Clear | Clouds | Drizzle | Fog | Haze | Mist | Rain | Smoke | Snow | Thunderstorm |
---|---|---|---|---|---|---|---|---|---|---|
month | ||||||||||
1 | 300.0 | 339.0 | 11.0 | 2.0 | 67.0 | 148.0 | 5.0 | NaN | 168.0 | NaN |
2 | 362.0 | 355.0 | 5.0 | 3.0 | 33.0 | 76.0 | 18.0 | NaN | 126.0 | 1.0 |
3 | 340.0 | 431.0 | 21.0 | 3.0 | 18.0 | 72.0 | 63.0 | NaN | 97.0 | 4.0 |
4 | 322.0 | 397.0 | 20.0 | NaN | 37.0 | 57.0 | 156.0 | NaN | 89.0 | 7.0 |
5 | 284.0 | 574.0 | 23.0 | 3.0 | 26.0 | 50.0 | 195.0 | 3.0 | 1.0 | 26.0 |
6 | 308.0 | 542.0 | 8.0 | NaN | 13.0 | 29.0 | 210.0 | NaN | NaN | 53.0 |
7 | 506.0 | 788.0 | 6.0 | 7.0 | 26.0 | 38.0 | 185.0 | 3.0 | NaN | 39.0 |
8 | 544.0 | 547.0 | 16.0 | 14.0 | 19.0 | 107.0 | 323.0 | 2.0 | NaN | 31.0 |
9 | 515.0 | 546.0 | 19.0 | 3.0 | 37.0 | 63.0 | 306.0 | NaN | NaN | 20.0 |
10 | 405.0 | 431.0 | 10.0 | 8.0 | 11.0 | 23.0 | 137.0 | NaN | 20.0 | 8.0 |
11 | 352.0 | 345.0 | 34.0 | 8.0 | 14.0 | 108.0 | 55.0 | NaN | 57.0 | 8.0 |
12 | 215.0 | 299.0 | 18.0 | 2.0 | 38.0 | 107.0 | 36.0 | NaN | 302.0 | NaN |
It looks by eye like heavy snow correlates to low traffic volumes on a by-month basis and that if anything, rain correlates to a higher traffic volume.
# Create a column of average traffic volume by month
day_traffic_by_month=day.groupby('month').agg({'traffic_volume':'mean'})['traffic_volume']
day_traffic_by_month_2=day_2.groupby('month').agg({'traffic_volume':'mean'})['traffic_volume']
# append that average traffic volume column to the end of the pivot tables for each era
month_weather_pivot['traffic_volume']=day_traffic_by_month
month_weather_pivot_2['traffic_volume']=day_traffic_by_month_2
# Correlate the traffic volume to the number of hours within a day of each weather condition, 2012-2014
month_weather_pivot.corr()['traffic_volume']
weather_main Clear 0.258648 Clouds -0.557272 Drizzle 0.186578 Fog -0.930919 Haze -0.895330 Mist -0.711984 Rain 0.196833 Snow -0.808171 Thunderstorm -0.202460 traffic_volume 1.000000 Name: traffic_volume, dtype: float64
# Correlate the traffic volume to the number of hours within a day of each weather condition, 2015-2018
month_weather_pivot_2.corr()['traffic_volume']
weather_main Clear 0.026986 Clouds -0.009368 Drizzle 0.161860 Fog 0.094164 Haze -0.442299 Mist -0.433574 Rain 0.334614 Smoke -0.434291 Snow -0.619592 Thunderstorm -0.241370 traffic_volume 1.000000 Name: traffic_volume, dtype: float64
Fog, haze, mist, and snow all have very strong negative correlations with traffic volume in 2012-2014. Clouds were moderately negatively correlated in that era. There was a very weak negative correlation with rain and thunderstorms, and a very weak positive correlation with clear weather.
In the second era, snow has the strongest negative correlation. Haze, mist, and smoke were moderately negatively correlated. Rain was weakly positively correlated, and thunderstorms were weakly negatively correlated. There was essentialy no relationship for drizzle, fog, clouds, or clear weather.
# Correlation between numerical weather data and traffic volume in daytime data averaged over each month, 2012-2014
by_month.corr()['traffic_volume']
year -0.107079 day -0.134033 tempK 0.840164 rain_1h 0.157328 snow_1h NaN clouds_all -0.590495 traffic_volume 1.000000 hour 0.618568 tempF 0.840164 avg_temp 0.840164 Name: traffic_volume, dtype: float64
In 2012-2014, temperature was strongly positively correlated with traffic volume. Time of day also had a moderately strong positive correlation, and number of clouds had a moderate negative correlation.
by_month_2.corr()['traffic_volume']
year 0.443250 day -0.243115 tempK 0.245040 rain_1h 0.092921 snow_1h -0.562595 clouds_all -0.164297 traffic_volume 1.000000 hour 0.110069 tempF 0.245040 avg_temp 0.245040 Name: traffic_volume, dtype: float64
In 2015-2018, there was an increasing trend by year with traffic volume. Snow was moderately negatively correlated with traffic volume, and tempearture was weakly positively correlated.
# Examination of by-month dataframe to look for outlier data that might explain the drop in July, December,\
# and January traffic volume
by_month
year | day | tempK | rain_1h | snow_1h | clouds_all | traffic_volume | hour | tempF | avg_temp | |
---|---|---|---|---|---|---|---|---|---|---|
month | ||||||||||
1 | 2013.535966 | 15.468265 | 263.083711 | 0.036911 | 0.0 | 53.732017 | 4330.043724 | 12.930889 | 13.880680 | 13.880680 |
2 | 2013.528467 | 14.455474 | 263.599226 | 0.010234 | 0.0 | 53.986861 | 4518.420438 | 13.093431 | 14.808607 | 14.808607 |
3 | 2013.550484 | 16.536653 | 270.940927 | 0.051549 | 0.0 | 55.123098 | 4671.420470 | 13.052559 | 28.023668 | 28.023668 |
4 | 2013.487903 | 15.112903 | 278.972204 | 0.184946 | 0.0 | 56.950269 | 4804.885753 | 13.018817 | 42.479968 | 42.479968 |
5 | 2013.474376 | 16.734560 | 288.917260 | 0.143141 | 0.0 | 52.052562 | 4803.237845 | 13.021025 | 60.381068 | 60.381068 |
6 | 2013.287474 | 12.677618 | 293.759055 | 0.463778 | 0.0 | 41.967146 | 4841.262834 | 13.145791 | 69.096300 | 69.096300 |
7 | 2013.472145 | 15.603064 | 296.908061 | 0.159123 | 0.0 | 35.746518 | 4822.823120 | 13.038997 | 74.764510 | 74.764510 |
8 | 2013.000000 | 14.695906 | 296.589620 | 0.011871 | 0.0 | 26.152047 | 5057.757310 | 13.046784 | 74.191316 | 74.191316 |
9 | 2013.000000 | 16.237209 | 293.313405 | 0.024140 | 0.0 | 29.730233 | 4925.348837 | 13.167442 | 68.294128 | 68.294128 |
10 | 2012.350534 | 15.528470 | 282.954609 | 0.021566 | 0.0 | 59.028470 | 5019.176157 | 13.083630 | 49.648295 | 49.648295 |
11 | 2012.442671 | 16.692308 | 274.888930 | 0.007881 | 0.0 | 58.509434 | 4693.224964 | 13.001451 | 35.130075 | 35.130075 |
12 | 2012.484967 | 15.928105 | 265.742183 | 0.063190 | 0.0 | 63.826144 | 4264.806536 | 12.964706 | 18.665929 | 18.665929 |
by_month_2
year | day | tempK | rain_1h | snow_1h | clouds_all | traffic_volume | hour | tempF | avg_temp | |
---|---|---|---|---|---|---|---|---|---|---|
month | ||||||||||
1 | 2017.162500 | 15.934615 | 266.000405 | 0.000240 | 0.000606 | 52.514423 | 4417.767308 | 12.891346 | 19.130729 | 19.130729 |
2 | 2017.109295 | 14.277835 | 268.974065 | 0.000000 | 0.000000 | 44.000000 | 4731.617978 | 12.933606 | 24.483318 | 24.483318 |
3 | 2017.147760 | 16.182078 | 275.293842 | 0.000286 | 0.000000 | 51.154433 | 4924.857007 | 12.939943 | 35.858915 | 35.858915 |
4 | 2017.073733 | 15.701382 | 281.142611 | 0.033217 | 0.000000 | 51.004608 | 4878.801843 | 12.936406 | 46.386700 | 46.386700 |
5 | 2017.020253 | 15.778059 | 290.872680 | 0.028549 | 0.000000 | 49.000000 | 4788.243882 | 13.008439 | 63.900824 | 63.900824 |
6 | 2017.006019 | 15.560619 | 295.769348 | 0.035770 | 0.000000 | 46.304385 | 4822.766122 | 12.994841 | 72.714827 | 72.714827 |
7 | 2016.503755 | 16.045056 | 297.364043 | 0.083667 | 0.000000 | 39.734043 | 4381.357947 | 13.006884 | 75.585278 | 75.585278 |
8 | 2016.502183 | 15.993762 | 295.538129 | 0.193213 | 0.000000 | 39.998752 | 4754.691204 | 12.996257 | 72.298632 | 72.298632 |
9 | 2016.541418 | 15.359841 | 292.667666 | 0.162346 | 0.000000 | 41.190192 | 4727.758781 | 13.049039 | 67.131799 | 67.131799 |
10 | 2016.134853 | 15.428300 | 285.119903 | 0.003172 | 0.000000 | 41.622982 | 4712.266857 | 13.107312 | 53.545826 | 53.545826 |
11 | 2016.191641 | 15.482161 | 277.825905 | 0.000000 | 0.000000 | 47.213048 | 4568.072375 | 12.973496 | 40.416629 | 40.416629 |
12 | 2016.172075 | 16.294002 | 267.847282 | 0.003097 | 0.001908 | 60.940020 | 4420.231072 | 12.964602 | 22.455108 | 22.455108 |
There is nothing specific to indicate outliers in these factors. Recall there was evidence for "low Friday rush hour traffic volume in July and to a lesser extent December and January". Let's also consider the possibility of "SQUALLS", recalling that it was missing data on rain.
traffic[traffic['weather_main']=='Squall']
year | month | day | tempK | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | hour | weekday | tempF | date_only | holiday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4879 | 2013 | 5 | 12 | 284.28 | 0.0 | 0.0 | 75 | Squall | SQUALLS | 2013-05-12 02:00:00 | 420 | 2 | Sunday | 52.034 | 2013-05-12 | None |
The only squall occurred in May of 2013. This would not decrease traffic in July of 2015-2018.
The factor remaining, of those I have found, that correlates to the change in July traffic volume, is "low Friday rush hour traffic volume in July, and to a lesser extent in December and January", which was inspired by the idea of "going North traffic, or taking Friday off when the weather is good or there is a holiday."
# Plot the weekday traffic volume for each date, traffic volume vs weekday
plt.figure(figsize=(16,6))
plt.subplot(1,2,1)
plt.scatter(day['weekday'],day['traffic_volume'])
plt.title('Daytime traffic volume by day of the week')
plt.xlabel('Weekday')
plt.ylabel('Traffic Volume, 2012-2014')
plt.subplot(1,2,2)
plt.scatter(day_2['weekday'],day_2['traffic_volume'])
plt.title('Daytime traffic volume by day of the week')
plt.xlabel('Weekday')
plt.ylabel('Traffic Volume, 2015-2018')
plt.show()
# Another way of doing the same thing
day_traffic_pivot=pd.pivot_table(day,index='weekday',values='traffic_volume',aggfunc='sum')
day_traffic_pivot_2=pd.pivot_table(day_2,index='weekday',values='traffic_volume',aggfunc='sum')
# Examine that sum
day_traffic_pivot
traffic_volume | |
---|---|
weekday | |
Friday | 5686084 |
Monday | 5044676 |
Saturday | 4188405 |
Sunday | 3591076 |
Thursday | 5430317 |
Tuesday | 5179749 |
Wednesday | 5578519 |
day_traffic_pivot_2
traffic_volume | |
---|---|
weekday | |
Friday | 10540884 |
Monday | 9663382 |
Saturday | 8019445 |
Sunday | 7185724 |
Thursday | 10661550 |
Tuesday | 10168932 |
Wednesday | 10425475 |
# Transform the pivot table with traffic volume by weekday string into traffic volume by numerical weekday
# with Sunday as weekday 0 using a dictionary for weekdays and a for loop, first era
weekdays={'Sunday':0,'Monday':1,'Tuesday':2,'Wednesday':3,'Thursday':4,'Friday':5,'Saturday':6}
for d in weekdays.keys():
for d2 in day_traffic_pivot.index:
if d==d2:
day_traffic_pivot.loc[d,'daynum']=weekdays[d]
# Transform the pivot table with traffic volume by weekday string into traffic volume by numerical weekday
# with Sunday as weekday 0 using a dictionary for weekdays and a for loop, second era
weekdays={'Sunday':0,'Monday':1,'Tuesday':2,'Wednesday':3,'Thursday':4,'Friday':5,'Saturday':6}
for d in weekdays.keys():
for d2 in day_traffic_pivot_2.index:
if d==d2:
day_traffic_pivot_2.loc[d,'daynum']=weekdays[d]
day_traffic_pivot
traffic_volume | daynum | |
---|---|---|
weekday | ||
Friday | 5686084 | 5.0 |
Monday | 5044676 | 1.0 |
Saturday | 4188405 | 6.0 |
Sunday | 3591076 | 0.0 |
Thursday | 5430317 | 4.0 |
Tuesday | 5179749 | 2.0 |
Wednesday | 5578519 | 3.0 |
day_traffic_pivot_2
traffic_volume | daynum | |
---|---|---|
weekday | ||
Friday | 10540884 | 5.0 |
Monday | 9663382 | 1.0 |
Saturday | 8019445 | 6.0 |
Sunday | 7185724 | 0.0 |
Thursday | 10661550 | 4.0 |
Tuesday | 10168932 | 2.0 |
Wednesday | 10425475 | 3.0 |
# Sort the pivot table by numerical weekday
day_traffic_pivot=day_traffic_pivot.sort_values('daynum')
day_traffic_pivot_2=day_traffic_pivot_2.sort_values('daynum')
# Display sorted dataframe, first era
day_traffic_pivot
traffic_volume | daynum | |
---|---|---|
weekday | ||
Sunday | 3591076 | 0.0 |
Monday | 5044676 | 1.0 |
Tuesday | 5179749 | 2.0 |
Wednesday | 5578519 | 3.0 |
Thursday | 5430317 | 4.0 |
Friday | 5686084 | 5.0 |
Saturday | 4188405 | 6.0 |
# Reset the sorted table so that the string weekday is a column
day_traffic_pivot=day_traffic_pivot.reset_index()
day_traffic_pivot_2=day_traffic_pivot_2.reset_index()
# Display second era table with reset index
day_traffic_pivot_2
weekday | traffic_volume | daynum | |
---|---|---|---|
0 | Sunday | 7185724 | 0.0 |
1 | Monday | 9663382 | 1.0 |
2 | Tuesday | 10168932 | 2.0 |
3 | Wednesday | 10425475 | 3.0 |
4 | Thursday | 10661550 | 4.0 |
5 | Friday | 10540884 | 5.0 |
6 | Saturday | 8019445 | 6.0 |
# Plot string weekday versus traffic volume as a line plot to examine trends by weekday
plt.plot(day_traffic_pivot['weekday'],day_traffic_pivot['traffic_volume'],label='2012-2014')
plt.plot(day_traffic_pivot_2['weekday'],day_traffic_pivot_2['traffic_volume'],label='2015-2018')
plt.title('Total traffic volume vs weekday')
plt.xlabel('Weekday')
plt.ylabel('Total traffic volume')
plt.legend()
plt.show()
Wednesdays have the highest traffic volume and the weekend has by far the lowest traffic volume. 2012-2014 has a lower total (rather than mean) traffic volume than 2015-2018
# Create a data frame for business days and a dataframe for weekend days
business_days = day.copy()[(day['day'] >0)& (day['day']<6)] # 1-5 = Monday-Friday
business_days_2=day_2.copy()[(day_2['day']>0)&(day_2['day']<6)]# second era
weekend = day.copy()[(day['day'] ==0)| (day['day']==6)] # 0==Sunday, 6 == Saturday
weekend_2=day_2.copy()[(day_2['day']==0)|(day_2['day']==6)]# second era
# Group the business days and the weekend days by hour and take the mean
by_hour_business = business_days.groupby('hour').mean()
by_hour_business_2=business_days_2.groupby('hour').mean()
by_hour_weekend = weekend.groupby('hour').mean()
by_hour_weekend_2=weekend_2.groupby('hour').mean()
# print the traffic volume of the business days by hour and the traffic volume of the weekend days by hour
print(by_hour_business['traffic_volume'])
print(by_hour_business_2['traffic_volume'])
print(by_hour_weekend['traffic_volume'])
print(by_hour_weekend_2['traffic_volume'])
hour 7 4525.329545 8 4513.423913 9 4336.655172 10 4126.241379 11 4420.505495 12 4729.565217 13 4698.065934 14 4954.797872 15 5215.688889 16 5533.516129 17 5244.621053 18 4204.629213 19 3177.615385 Name: traffic_volume, dtype: float64 hour 7 4540.797814 8 4371.602273 9 4253.275281 10 4146.598870 11 4458.830601 12 4682.994595 13 4697.779006 14 4867.255556 15 5149.377778 16 5554.848315 17 5199.280899 18 4218.099448 19 3313.122222 Name: traffic_volume, dtype: float64 hour 7 4601.800000 8 4502.473684 9 4363.105263 10 4154.750000 11 4420.250000 12 4671.666667 13 4813.200000 14 5060.954545 15 5381.800000 16 5746.545455 17 5366.454545 18 4267.285714 19 3248.681818 Name: traffic_volume, dtype: float64 hour 7 5112.394737 8 4705.794118 9 4521.405405 10 4287.878788 11 4496.538462 12 4735.947368 13 4698.378378 14 4896.472222 15 5308.162162 16 5834.583333 17 5453.868421 18 4308.111111 19 3312.315789 Name: traffic_volume, dtype: float64
#plot business days by hour on the left and weekend days by hour on the right
plt.figure(figsize=(14,14))
plt.subplot(2,2,1)
plt.plot(by_hour_business.index,by_hour_business['traffic_volume'])
plt.title('Business days, 2012-2014')
plt.xlabel('Time (Hours)')
plt.ylabel('Traffic Volume')
plt.subplot(2,2,2)
plt.plot(by_hour_weekend.index,by_hour_weekend['traffic_volume'])
plt.title('Weekend days, 2012-204')
plt.xlabel('Time (Hours)')
plt.ylabel('Traffic Volume')
plt.subplot(2,2,3)
plt.plot(by_hour_business_2.index,by_hour_business_2['traffic_volume'])
plt.title('Business days, 2015-2018')
plt.xlabel('Time (Hours)')
plt.ylabel('Traffic Volume')
plt.subplot(2,2,4)
plt.plot(by_hour_weekend_2.index,by_hour_weekend_2['traffic_volume'])
plt.title('Weekend days, 2015-208')
plt.xlabel('Time (Hours)')
plt.ylabel('Traffic Volume')
plt.show()
On both weekdays and weekends there is a morning rush hour and a much larger evening rush hour. On weekends the evening rush hour is just a little larger. The weekend volume was a little larger in the second era.
There is more traffic by day than night. The highest volumes at night are also at rush hour, where we made the cut between day and night. There is more traffic in the fall, spring, and summer than in the winter (warmer months), except July in the 2015-2018 when there was a drop in traffic volume. There are two rush hours centered around 7 am and 4pm for both the weekend and weekdays. However, weekend traffic has a higher traffic volume at the evening rush hour despite the lower overall volume.
FRIDAY RUSH HOUR TRAFFIC IN JULY
There is evidence that in July, Friday rush hour traffic is lower than other days of the week in July, and lower than other months on Friday. This may be due to leaving work early on beautiful summer days, or going north during the summer to the lakes, based on common Minnesotan lore and experiences.
TEMPERATURE
The temperature peaks in July around 90 F, but is moderately hot in June and August as well. It is quite cold (-20 F) in January and December). There is a very small effect where for temperatures over 75 degrees, there is never a traffic volume below a certain limit that increases with temperature.
COUNT OF WEATHER TYPES
In the first era, in August and September, clear weather was the most common, while temperatures were warm. This may explain why traffic volume increased in those months.
The cloudy weather in August and Septemer from 2015-2018 correlates with, but does not neccessarily have a causal connection, to the lower traffic volume of the second era.
On a by-month basis, fog, haze, mist, and snow all have very strong negative correlations with traffic volume in 2012-2014. Clouds were moderately negatively correlated in that era. There was a very weak negative correlation with rain and thunderstorms, and a very weak positive correlation with clear weather.
In the second era, on a by-month basis, snow has the strongest negative correlation. Haze, mist, and smoke were moderately negatively correlated. Rain was weakly positively correlated, and thunderstorms were weakly negatively correlated. There was essentialy no relationship for drizzle, fog, clouds, or clear weather.
NUMERICAL WEATHER INFO
In 2012-2014, temperature was strongly positively correlated with traffic volume. Time of day also had a moderately strong positive correlation, and number of clouds had a moderate negative correlation.
In 2015-2018, there was an increasing trend by year with traffic volume. Snow was moderately negatively correlated with traffic volume, and tempearture was weakly positively correlated.
Negative correlations with temperature and snow may explain the low traffic volume in December and January (people don't like to drive in bad weather). Or this may be merely a correlation.
ASSESSMENT OF THE EFFECT IN JULY, JANUARY, AND DECEMBER
I believe that so far the effect of lower traffic volume from 2015-2018 in July correlates most strongly to Friday traffic volumes in July. It may also be due to students, especially University of Minnesota students, not driving in the summer.
While there was a road closure July 22-24 2016 and seems to have been a similar closure July 25 2015, excluding these closures does not resolve the decrease in July traffic volume in 2015-2018.
The squall in the data is a tempting explanation, but is not a good one, since it takes place in May 2013 rather than July between 2015-2018.
Student not driving over school holidays could potentially explain the December/January traffic patterns. But so could cold/snow, which have moderate correlations with traffic patterns. Correlation does not imply causation in any of these cases, but it may be something to consider.
Later, I will consider the effects of holiday traffic, which may also impact December, January, and July; as well as weather on an hour-by-hour basis.
# Find the corrrelation of numerical daytiime traffic indicators with traffic volume
print(day.corr()['traffic_volume'])
print(day_2.corr()['traffic_volume'])
# Note that temperature and day_number have a slight correlation on a by day basis but that it is not really
# significant. on a by month basis, there are significant correlations.
year -0.009739 month 0.025114 day -0.018625 tempK 0.184307 rain_1h -0.017748 snow_1h NaN clouds_all -0.029770 traffic_volume 1.000000 hour -0.003676 tempF 0.184307 Name: traffic_volume, dtype: float64 year 0.038428 month -0.039868 day 0.005344 tempK 0.059745 rain_1h -0.023861 snow_1h -0.003869 clouds_all -0.003793 traffic_volume 1.000000 hour -0.000460 tempF 0.059745 Name: traffic_volume, dtype: float64
There are no significant correlations whatsoever on a by-hour basis with numerical weather data. The highest correlation is temperature (F or K), at p=0.059745.
# Plot slight correlation numerical daytime traffic indicators. These are temperature and day_number.
# Temperature has outliers that are implausible. It is not possible to have a zero kelvin temperature for weather.
#These are skewing the correlation. So zoom in on the rest of the data and visually inspect it.
# It does not seem there is a correlation.
#plt.subplot(1,2,1)
plt.scatter(day['traffic_volume'],day['tempF'],color='blue',label='2012-2014',marker='o',alpha=0.2)
plt.scatter(day_2['traffic_volume'],day_2['tempF'],color='red',label='2015-2018',marker='x',alpha=0.2)
plt.title('Traffic Volume vs Temperature, 2012-2014')
plt.ylabel('Temperature (F)')
plt.xlabel('Traffic Volume')
plt.legend()
plt.ylim([0,100])
plt.xlim([0,7500])
#plt.subplot(1,2,2)
#plt.scatter(day_2['traffic_volume'],day_2['tempF'])
#plt.title('Traffic Volume vs Temperature, 2015-2018')
#plt.ylabel('Temperature (F)')
#plt.ylim([0,100])
#plt.xlim([0,7500])
#plt.xlabel('Traffic Volume')
plt.show()
I don't think there is a real day correlation. There seems to be a cutoff where the traffic volume is never below a certain (increasing) value for a temperature above 75 degrees. This could create a very small effect in the hotter months.
Temperature has no significant correlation with traffic volume on a by hour or by day basis after dropping the unrealistic values.
# Continue our analysis with the full daytime dataframe-- there is no need to go back and redo this.
# Find the mean traffic volume for the daytime main weather and weather description over each individual time
by_weather_main = day.groupby('weather_main').mean().sort_values('traffic_volume')
by_weather_description = day.groupby('weather_description').mean().sort_values('traffic_volume')
by_weather_main_2 = day_2.groupby('weather_main').mean().sort_values('traffic_volume')
by_weather_description_2 = day_2.groupby('weather_description').mean().sort_values('traffic_volume')
# Create a bar plot of the traffic volume by main weather type
by_weather_main['traffic_volume'].plot.barh()
<AxesSubplot:ylabel='weather_main'>
by_weather_main_2['traffic_volume'].plot.barh()
<AxesSubplot:ylabel='weather_main'>
day['traffic_volume'].mean()
4689.030540540541
day_2['traffic_volume'].mean()
4674.336839152994
No weather type has a traffic volume exceeding 5000 cars/hour. In 2012-2014, there was low traffic volume during fog and snow. In 2015-2018, there was low traffic volume during snow and smoke, and high traffic volume during fog. Smoke was only present during the second era. The mean traffic volumes were 4689 cars/day during the first era and 4674 cars/day during the second era for daytime traffic.
# Determine the times when smoke was present
day_2[day_2['weather_main']=='Smoke']
year | month | day | tempK | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | hour | weekday | tempF | date_only | holiday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14099 | 2015 | 7 | 6 | 294.68 | 0.0 | 0.0 | 40 | Smoke | smoke | 2015-07-06 17:00:00 | 5861 | 17 | Monday | 70.754 | 2015-07-06 | None |
14100 | 2015 | 7 | 6 | 294.54 | 0.0 | 0.0 | 75 | Smoke | smoke | 2015-07-06 18:00:00 | 4264 | 18 | Monday | 70.502 | 2015-07-06 | None |
14101 | 2015 | 7 | 6 | 294.02 | 0.0 | 0.0 | 1 | Smoke | smoke | 2015-07-06 19:00:00 | 3040 | 19 | Monday | 69.566 | 2015-07-06 | None |
19601 | 2016 | 5 | 7 | 283.83 | 0.0 | 0.0 | 40 | Smoke | smoke | 2016-05-07 07:00:00 | 2182 | 7 | Saturday | 51.224 | 2016-05-07 | None |
19602 | 2016 | 5 | 7 | 285.60 | 0.0 | 0.0 | 1 | Smoke | smoke | 2016-05-07 08:00:00 | 3139 | 8 | Saturday | 54.410 | 2016-05-07 | None |
19603 | 2016 | 5 | 7 | 287.35 | 0.0 | 0.0 | 75 | Smoke | smoke | 2016-05-07 09:00:00 | 4085 | 9 | Saturday | 57.560 | 2016-05-07 | None |
39447 | 2018 | 8 | 18 | 298.34 | 0.0 | 0.0 | 1 | Smoke | smoke | 2018-08-18 12:00:00 | 4448 | 12 | Saturday | 77.342 | 2018-08-18 | None |
39448 | 2018 | 8 | 18 | 300.14 | 0.0 | 0.0 | 75 | Smoke | smoke | 2018-08-18 13:00:00 | 4438 | 13 | Saturday | 80.582 | 2018-08-18 | None |
Smoke was present on July 6, 2015; May 7, 2016; and August 18, 2018. The traffic volume was most remarkably low in May 2016. It would be challenging to claim that this accounted for the whole July effect in 2015-2018.
# create a bar plot of traffic volume by weather description, alternating colors to increase readability
by_weather_description['traffic_volume'].plot.barh(figsize=(10,12),color=int(len(by_weather_description.index)/2)*['blue','red'])
<AxesSubplot:ylabel='weather_description'>
Freezing rain has a traffic volume exceeding 5000 cars/hour. Light snow, snow, and very heavy rain have low car volumes. These low volumes seem sensible to a Minnesotan because traffic slows down quite a bit in these conditions.
# Find the unique holidays in the traffic dataframe
traffic['holiday'].unique()
array(['None', 'Columbus Day', 'Veterans Day', 'Thanksgiving Day', 'Christmas Day', 'New Years Day', 'Washingtons Birthday', 'Memorial Day', 'Independence Day', 'State Fair', 'Labor Day', 'Martin Luther King Jr Day'], dtype=object)
#Find the mean traffic volume of each holiday over individual times and also the number of individual times for each
#holdiay in the full data set-- not thse daytime data set
by_holiday=traffic.groupby('holiday').agg({'traffic_volume':['mean','count']})['traffic_volume']
#Display this count and mean
by_holiday
# Notice that each count is less than 24-- holidays do not occur at every hour of the day? What is that?
mean | count | |
---|---|---|
holiday | ||
Christmas Day | 1764.283186 | 113 |
Columbus Day | 3393.533333 | 105 |
Independence Day | 2155.175000 | 120 |
Labor Day | 2346.228814 | 118 |
Martin Luther King Jr Day | 2732.362319 | 69 |
Memorial Day | 2095.649573 | 117 |
New Years Day | 1794.526786 | 112 |
None | 3314.345789 | 39287 |
State Fair | 3837.100840 | 119 |
Thanksgiving Day | 1995.990654 | 107 |
Veterans Day | 3301.759259 | 108 |
Washingtons Birthday | 2830.547826 | 115 |
# Select the set of actual holidays, this time from the daytime dataframe rather than the nighttime dataframe
holidays_only=day[day['holiday']!='None']
holidays_only_2=day_2[day_2['holiday']!='None']
# Find the mean traffic volume for each holiday
daytime_holiday_traffic=holidays_only.groupby('holiday').agg({'traffic_volume':'mean'})
daytime_holiday_traffic_2=holidays_only_2.groupby('holiday').agg({'traffic_volume':'mean'})
# Sort the holidays by traffic volume from most to least
daytime_holiday_traffic=daytime_holiday_traffic.sort_values('traffic_volume',ascending=False)
daytime_holiday_traffic_2=daytime_holiday_traffic_2.sort_values('traffic_volume',ascending=False)
# Display that table (dataframe)
daytime_holiday_traffic
traffic_volume | |
---|---|
holiday | |
State Fair | 5456.307692 |
Columbus Day | 4993.210526 |
Veterans Day | 4333.480000 |
Martin Luther King Jr Day | 4100.818182 |
Washingtons Birthday | 3943.653846 |
Labor Day | 3473.333333 |
Memorial Day | 2877.576923 |
Thanksgiving Day | 2710.500000 |
Independence Day | 2679.923077 |
New Years Day | 2214.880000 |
Christmas Day | 2140.640000 |
daytime_holiday_traffic_2
traffic_volume | |
---|---|
holiday | |
State Fair | 5389.903846 |
Veterans Day | 4970.500000 |
Columbus Day | 4895.289474 |
Washingtons Birthday | 4046.777778 |
Martin Luther King Jr Day | 3958.846154 |
Labor Day | 3403.134615 |
Memorial Day | 3050.805556 |
Independence Day | 2839.807692 |
Thanksgiving Day | 2677.968750 |
Christmas Day | 2666.485714 |
New Years Day | 2639.457143 |
# Creat a bar chart from least to most of the traffic volume by holiday
# Find the mean traffic volume for each holiday
daytime_holiday_traffic=holidays_only.groupby(['holiday']).agg({'traffic_volume':'mean'})
daytime_holiday_traffic_2=holidays_only_2.groupby(['holiday']).agg({'traffic_volume':'mean'})
plt.figure(figsize=(10,6))
plt.title('Daytime Holiday Traffic')
plt.xlabel('Traffic Volume')
plt.ylabel('Holiday')
#daytime_holiday_traffic.plot.barh()
plt.barh(daytime_holiday_traffic.index,daytime_holiday_traffic['traffic_volume'], label='2012-2014',alpha=0.5,color='g')
plt.barh(daytime_holiday_traffic_2.index,daytime_holiday_traffic_2['traffic_volume'],label='2015-2018',alpha=0.5,color='b')
plt.legend()
plt.show()
Of the holidays, the State Fair had the most traffic with over 5000 cars/day in both 2012-2014 and 2015-2018. Veterans Day and Columbus Day also had a fair amount of traffic. Christmas Day, Thanksgiving, Independence Day, and New Years Day had the least with around 2000-2800 cars/day, with less in the first era for everything except Thanksgiving.
This seems reasonable based on experience-- people tend to stay home during the daytime on Christmas, Thanksgiving, and New Years Day. The State Fair is a wildly popular attraction in Minnesota, leading to heavy traffic between Minneapolis and St Paul, since it is held in mid-St Paul and people come from all over the Twin Cities and out-state Minnesota, many commuting along I-94. Veterans Day and Columbus Day are one day school holidays in the Minneapolis/St Paul school system.
Independence Day is a low-traffic holiday, but it had more traffic in 2015-2018 than in 2012-2014, so it cannot explain the decrease in traffic in July of the second era.
len(daytime_holiday_traffic.index)
11
# Non holidays
non_holidays=day[day['holiday']=='None']
non_holidays_2=day_2[day_2['holiday']=='None']
# Prepare to plot the non holidays by hour in the final cell of our graph
non_holidays_by_hr=non_holidays.groupby('hour').agg({'traffic_volume':'mean'})
non_holidays_by_hr_2=non_holidays_2.groupby('hour').agg({'traffic_volume':'mean'})
plt.figure(figsize=(14,14))
holiday_list=list(daytime_holiday_traffic.index)
num_holidays=len(holiday_list)
for i in range(num_holidays):
plt.subplot(3,4,i+1)
#print(i, holiday_list[i])
this_holiday=holidays_only[holidays_only['holiday']==holiday_list[i]]
this_holiday_2=holidays_only_2[holidays_only_2['holiday']==holiday_list[i]]
#print(this_holiday['holiday'].value_counts())
#print(this_holiday[['hour','traffic_volume']])
this_holiday_avg_by_hr=this_holiday.groupby('hour').agg({'traffic_volume':'mean'})
this_holiday_avg_by_hr_2=this_holiday_2.groupby('hour').agg({'traffic_volume':'mean'})
plt.plot(this_holiday_avg_by_hr.index,this_holiday_avg_by_hr['traffic_volume'],label='2012-2014')
plt.plot(this_holiday_avg_by_hr_2.index,this_holiday_avg_by_hr_2['traffic_volume'],label='2015-2018')
if (i>7):
plt.xlabel('Hour')
if (i%4==0):
plt.ylabel('Traffic Volume')
plt.title(holiday_list[i])
plt.legend()
plt.subplot(3,4,12)
plt.plot(non_holidays_by_hr.index,non_holidays_by_hr['traffic_volume'],label='2012-2014')
plt.plot(non_holidays_by_hr_2.index,non_holidays_by_hr_2['traffic_volume'],label='2015-2018')
plt.title('Non-Holiays')
plt.xlabel('Hour')
plt.ylabel('Traffic Volume')
plt.legend()
plt.show()
Christmas Day, Labor Day, Independence Day, Memorial Day, and New Years Day had a pattern of peak traffic mid day. Thanksgiving had a plateau of peak traffic following mid day. The other days had normal morning and evening rush hours. This provides evidence for light rush hour travel on days where people do not work. Peak Thanksgiving traffic decreased and New Years and Christmas traffic increased in the second era
plt.figure(figsize=(14,14))
holiday_list=list(daytime_holiday_traffic.index)
for i in range(num_holidays):
this_holiday=holidays_only[holidays_only['holiday']==holiday_list[i]]
this_holiday_avg_by_hr=this_holiday.groupby('hour').agg({'traffic_volume':'mean'})
plt.plot(this_holiday_avg_by_hr.index,this_holiday_avg_by_hr['traffic_volume'],label=holiday_list[i])
plt.xlabel('Hour')
plt.ylabel('Traffic Volume')
plt.title('Traffic Volume by Holidays, 2012-2014')
plt.legend()
plt.show()
plt.figure(figsize=(14,14))
holiday_list=list(daytime_holiday_traffic_2.index)
for i in range(num_holidays):
this_holiday_2=holidays_only_2[holidays_only_2['holiday']==holiday_list[i]]
this_holiday_avg_by_hr_2=this_holiday_2.groupby('hour').agg({'traffic_volume':'mean'})
plt.plot(this_holiday_avg_by_hr_2.index,this_holiday_avg_by_hr_2['traffic_volume'],label=holiday_list[i])
plt.xlabel('Hour')
plt.ylabel('Traffic Volume')
plt.title('Traffic Volume by Holidays, 2015-2018')
plt.legend()
plt.show()
On holidays with more traffic, there is both a morning and evening rush hour. These are the State Fair, Columbus Day, Veterans Day, Martin Luther King Jr Day, and Washington's Birthday. The traffic patterns are similar to ordinary days. On other holidays (Labor Day, Memorial Day, Independence Day, New Years Day, Thanksgiving Day, Christmas Day), the traffic is lowest in the morning, peaks around 2pm, and tapers off somewhat into the evening. This likely corresponds to working days vs days people take off to celebrate, on average.
There is more traffic by day than night. The highest volumes at night are also at rush hour, where we made the cut between day and night. There is more traffic in the fall, spring, and summer than in the winter (warmer months), except July in the 2015-2018 when there was a drop in traffic volume. There are two rush hours centered around 7 am and 4pm for both the weekend and weekdays. However, weekend traffic has a higher traffic volume at the evening rush hour despite the lower overall volume.
FRIDAY RUSH HOUR TRAFFIC IN JULY
There is evidence that in July, Friday rush hour traffic is lower than other days of the week in July, and lower than other months on Friday. This may be due to leaving work early on beautiful summer days, or going north during the summer to the lakes, based on common Minnesotan lore and experiences.
TEMPERATURE
The temperature peaks in July around 90 F, but is moderately hot in June and August as well. It is quite cold (-20 F) in January and December). There is a very small effect where for temperatures over 75 degrees, there is never a traffic volume below a certain limit that increases with temperature.
COUNT OF WEATHER TYPES
In the first era, in August and September, clear weather was the most common, while temperatures were warm. This may explain why traffic volume increased in those months.
The cloudy weather in August and Septemer from 2015-2018 correlates with, but does not neccessarily have a causal connection, to the lower traffic volume of the second era.
On a by-month basis, fog, haze, mist, and snow all have very strong negative correlations with traffic volume in 2012-2014. Clouds were moderately negatively correlated in that era. There was a very weak negative correlation with rain and thunderstorms, and a very weak positive correlation with clear weather.
In the second era, on a by-month basis, snow has the strongest negative correlation. Haze, mist, and smoke were moderately negatively correlated. Rain was weakly positively correlated, and thunderstorms were weakly negatively correlated. There was essentialy no relationship for drizzle, fog, clouds, or clear weather.
NUMERICAL WEATHER INFO
In 2012-2014, temperature was strongly positively correlated with traffic volume. Time of day also had a moderately strong positive correlation, and number of clouds had a moderate negative correlation.
In 2015-2018, there was an increasing trend by year with traffic volume. Snow was moderately negatively correlated with traffic volume, and tempearture was weakly positively correlated.
Negative correlations with temperature and snow may explain the low traffic volume in December and January (people don't like to drive in bad weather). Or this may be merely a correlation.
WEATHER ON A BY-HOUR BASIS
There are no significant correlations of traffic volume with numerical weather indicators on a by-hour basis.
No weather type has a traffic volume exceeding 5000 cars/hour. In 2012-2014, there was low traffic volume during fog and snow. In 2015-2018, there was low traffic volume during snow and smoke, and high traffic volume during fog. Smoke was only present during the second era. The mean traffic volumes were 4689 cars/day during the first era and 4674 cars/day during the second era for daytime traffic.
Freezing rain has a traffic volume exceeding 5000 cars/hour. Light snow, snow, and very heavy rain have low car volumes. These low volumes seem sensible to a Minnesotan because traffic slows down quite a bit in these conditions.
HOLIDAY TRAFFIC
The holidays are New Years Day, Martin Luther King Jr Day, Washington's Birthday, Memorial Day, Independence Day, State Fair, Labor Day, Columbus Day, Veterans Day, Thanksgiving Day, and Christmas.
Of the holidays, the State Fair had the most traffic with over 5000 cars/day in both 2012-2014 and 2015-2018. Veterans Day and Columbus Day also had a fair amount of traffic. Christmas Day, Thanksgiving, Independence Day, and New Years Day had the least with around 2000-2800 cars/day, with less in the first era for everything except Thanksgiving.
Christmas Day, Labor Day, Independence Day, Memorial Day, and New Years Day had a pattern of peak traffic mid day (around 2pm). Thanksgiving had a plateau of peak traffic following mid day. The other days had normal morning and evening rush hours. People likely have these specific days off, so there may be no work to travel to and from work.
Peak Thanksgiving traffic decreased and New Years and Christmas traffic increased in the second era
On holidays with more traffic, there is both a morning and evening rush hour. These are the State Fair, Columbus Day, Veterans Day, Martin Luther King Jr Day, and Washington's Birthday.
ASSESSMENT OF THE EFFECT IN JULY, JANUARY, AND DECEMBER
While there was a road closure July 22-24 2016 and seems to have been a similar closure July 25 2015, excluding these closures does not resolve the decrease in July traffic volume in 2015-2018.
The squall in the data is a tempting explanation, but is not a good one, since it takes place in May 2013 rather than July between 2015-2018.
Smoke was present on July 6, 2015; May 7, 2016; and August 18, 2018. The traffic volume was most remarkably low in May 2016. It would be challenging to claim that this accounted for the whole July effect in 2015-2018.
The low traffic on Independence Day itself cannot explain the drop in July traffic from the first era to the second because there is a higher average traffic volume in the second era on Independece Day.
I believe that so far the effect of lower traffic volume from 2015-2018 in July correlates most strongly to Friday traffic volumes in July. It may also be due to students, especially University of Minnesota students, not driving in the summer.