Guided Project 4: Finding Heavy Traffic Indicators on I-94¶

Introduction¶

The purpose of our analysis is to explore methods, features, and visualizations that can be useful for identifying indicators of relationships between data.
We will use as a sample traffic intensity on the I-94 Interstate highway.

Data Set Informatrion:\

The dataset is made by John Hogue and can be download from the UCI Machine Learning Repository.\

Hourly Interstate 94 Westbound traffic volume for MN DoT ATR station 301,
roughly midway between Minneapolis and St Paul, MN.
Hourly weather features and holidays included for impacts on traffic volume from 2012-2018.

Attibute Information:

Attribute	Description
holiday	Categorical US National holidays plus regional holiday, Minnesota State Fair
temp	Numeric Average temp in kelvin
rain_1h	Numeric Amount in mm of rain that occurred in the hour
snow_1h	Numeric Amount in mm of snow that occurred in the hour
clouds_all	Numeric Percentage of cloud cover
weather_main	Categorical Short textual description of the current weather
weather_description	Categorical Longer textual description of the current weather
date_time	DateTime Hour of the data collected in local CST time
traffic_volume	Numeric Hourly I-94 ATR 301 reported westbound traffic volume

Chapter A. Preparation of the File and Examination the Data¶

In [161]:

import pandas as pd
import seaborn as sns
sns.set()
import matplotlib.pyplot as plt
%matplotlib inline

In [162]:

traffic = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
traffic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48204 entries, 0 to 48203
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   holiday              48204 non-null  object 
 1   temp                 48204 non-null  float64
 2   rain_1h              48204 non-null  float64
 3   snow_1h              48204 non-null  float64
 4   clouds_all           48204 non-null  int64  
 5   weather_main         48204 non-null  object 
 6   weather_description  48204 non-null  object 
 7   date_time            48204 non-null  object 
 8   traffic_volume       48204 non-null  int64  
dtypes: float64(3), int64(2), object(4)
memory usage: 3.3+ MB

In [163]:

print("The file contains", len(traffic.columns), "columns and", len(traffic.index), "rows" )

The file contains 9 columns and 48204 rows

In [164]:

#Examination the first and the last five rows
traffic

Out[164]:

	holiday	temp	rain_1h	snow_1h	clouds_all	weather_main	weather_description	date_time	traffic_volume
0	None	288.28	0.0	0.0	40	Clouds	scattered clouds	2012-10-02 09:00:00	5545
1	None	289.36	0.0	0.0	75	Clouds	broken clouds	2012-10-02 10:00:00	4516
2	None	289.58	0.0	0.0	90	Clouds	overcast clouds	2012-10-02 11:00:00	4767
3	None	290.13	0.0	0.0	90	Clouds	overcast clouds	2012-10-02 12:00:00	5026
4	None	291.14	0.0	0.0	75	Clouds	broken clouds	2012-10-02 13:00:00	4918
...	...	...	...	...	...	...	...	...	...
48199	None	283.45	0.0	0.0	75	Clouds	broken clouds	2018-09-30 19:00:00	3543
48200	None	282.76	0.0	0.0	90	Clouds	overcast clouds	2018-09-30 20:00:00	2781
48201	None	282.73	0.0	0.0	90	Thunderstorm	proximity thunderstorm	2018-09-30 21:00:00	2159
48202	None	282.09	0.0	0.0	90	Clouds	overcast clouds	2018-09-30 22:00:00	1450
48203	None	282.12	0.0	0.0	90	Clouds	overcast clouds	2018-09-30 23:00:00	954

48204 rows × 9 columns

Examination the dataset showed following:

Dataset contains 9 columns and 48204 rows.
Data was stored in different data types: float64 - 3 columns, int64 - 2 columns, object- 4 columns.
No null data.
Temperature is stored in Kelvin, though I'd prefer Celsius granulation, it does not matter as we need to find only influences and dependencies,
Date time is stored as objects (string) and must be converted to datetime type.
3 columns contain categorical data (holiday, weather_main, weather_description)

The data set covers 6 years (from 2012 to 2018 included).
6 years * 365 days * 24 hours = 52560 .
There are 48204 rows in the data set.
Thus, (52560 - 48204)/ 24 = 181.5 days are missing in the data. Is it about 6 months, and we do not even what season they are.
So, our further analysis cannot be taken as the iron fact.

Step A.1. Research and removing inaccurate data

In [165]:

traffic['rain_1h'].describe()

Out[165]:

count    48204.000000
mean         0.334264
std         44.789133
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max       9831.300000
Name: rain_1h, dtype: float64

In [166]:

traffic[traffic['rain_1h'] >= 50]

Out[166]:

	holiday	temp	rain_1h	snow_1h	clouds_all	weather_main	weather_description	date_time	traffic_volume
8247	None	289.10	55.63	0.0	68	Rain	very heavy rain	2013-08-07 02:00:00	315
24872	None	302.11	9831.30	0.0	75	Rain	very heavy rain	2016-07-11 17:00:00	5535

Maximum rain about 10000 mm in the hour is 10 m in the hour, it's much much more than the world record (Weather Records).
Since it is just one line, we can remove it.

In [167]:

traffic['temp'].describe()

Out[167]:

count    48204.000000
mean       281.205870
std         13.338232
min          0.000000
25%        272.160000
50%        282.450000
75%        291.806000
max        310.070000
Name: temp, dtype: float64

In [168]:

traffic[traffic['temp'] < 230]

Out[168]:

	holiday	weather_main	weather_description	date_time	traffic_volume
11898	None	Clear	sky is clear	2014-01-31 03:00:00	361
11899	None	Clear	sky is clear	2014-01-31 04:00:00	734
11900	None	Clear	sky is clear	2014-01-31 05:00:00	2557
11901	None	Clear	sky is clear	2014-01-31 06:00:00	5150
11946	None	Clear	sky is clear	2014-02-02 03:00:00	291
11947	None	Clear	sky is clear	2014-02-02 04:00:00	284
11948	None	Clear	sky is clear	2014-02-02 05:00:00	434
11949	None	Clear	sky is clear	2014-02-02 06:00:00	739
11950	None	Clear	sky is clear	2014-02-02 07:00:00	962
11951	None	Clear	sky is clear	2014-02-02 08:00:00	1670

Temperature 0 degrees in Kelvin is -273 Celsius. The maximum record in the area is about -42 C (what is about 231 K).
Since there are just 11 line with inaccurate data, we can remove them also.

In [169]:

traffic.to_csv('traffic_cleaned.csv', index= False)
traffic_cleaned = pd.read_csv('traffic_cleaned.csv')

In [170]:

traffic_cleaned = traffic_cleaned[traffic_cleaned['rain_1h'] < 100]

In [225]:

traffic_cleaned = traffic_cleaned[traffic_cleaned['temp'] != 0]

In [172]:

traffic_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48193 entries, 0 to 48203
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   holiday              48193 non-null  object        
 1   temp                 48193 non-null  float64       
 2   rain_1h              48193 non-null  float64       
 3   snow_1h              48193 non-null  float64       
 4   clouds_all           48193 non-null  int64         
 5   weather_main         48193 non-null  object        
 6   weather_description  48193 non-null  object        
 7   date_time            48193 non-null  datetime64[ns]
 8   traffic_volume       48193 non-null  int64         
 9   hour                 48193 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(3), object(3)
memory usage: 4.0+ MB

In [173]:

print("The file contains", len(traffic_cleaned.columns), "columns and", len(traffic_cleaned.index), "rows" )

The file contains 10 columns and 48193 rows

Chapter B. Analysis and Visualization¶

Step B.1. Observation of Traffic Volume

Traffic Volume is the number of vehicles crossing a section of road per unit time at any selected period (Definition).

In [174]:

print('There are',  len(traffic_cleaned['traffic_volume'].unique()), 'unique values in the"traffic_volume" column.')

There are 6704 unique values in the"traffic_volume" column.

In [175]:

traffic_cleaned['traffic_volume'].describe()

Out[175]:

count    48193.000000
mean      3260.174029
std       1986.754010
min          0.000000
25%       1194.000000
50%       3380.000000
75%       4933.000000
max       7280.000000
Name: traffic_volume, dtype: float64

The maximum value is 7280, let's take it roughly as 7500, which can be divided by 3: 7500/3 = 2500.
Now we will decide the following:

traffic in the range from 0 to 2500 is light,
traffic in the range from 2500 to 5000 is average,
traffic over 5000 is heavy\

Also, x-coordinates with 8000 units will suffice for the maximum values in the "traffic volume" columns.
We will not calculate the maximum units for y-coordinate spine since we can count on the Python ability.

In [176]:

# To make our chart reasonably readable, let's divide the unique values (6784 units ~ 6800) by 34 (6800/200)
fig = plt.figure(figsize = (12, 5))
traffic_cleaned['traffic_volume'].plot.hist(bins = 34)
plt.title('Intensity of Traffic Volume', fontsize = 20)
plt.xlabel('Traffic Volume (vehicles per hour)', size = 15)
plt.ylabel('Frequency in Data', size = 15)
plt.axhline(y = 1000)
plt.axhline(y = 2000)

Out[176]:

<matplotlib.lines.Line2D at 0x170a754fe80>

In [177]:

print("Data contains", len(traffic_cleaned['traffic_volume']), "lines with non-null values\
in the 'traffic_volume' column.\nThe maximun amount is", traffic_cleaned['traffic_volume'].max(),\
"vehicles per hour.\nThe minimum amount is", traffic_cleaned['traffic_volume'].min(),\
"vehicles per hour.\nThe average amount is", round((traffic_cleaned['traffic_volume'].mean())),\
"vehicles per hour.")

Data contains 48193 lines with non-null valuesin the 'traffic_volume' column.
The maximun amount is 7280 vehicles per hour.
The minimum amount is 0 vehicles per hour.
The average amount is 3260 vehicles per hour.

Looking at the graph above, we can say that the most frequent cases (over 4000 cases) are when the traffic is low (less than 500 cars per hour).
At the same time, cases of heavy traffic (about 7,000 vehicles per hour) are rare (less than 1000 cases). We see three frequency peaks: 500-600 vehicles per hour ( about 2200 cases), about 3000 (about 1800 cases) and about 4800-4900 vehicles per hour (about 2200 cases).
Since we have no comparison with any factors so far, we cannot yet see the cause of the facts.

Step B.2. Comparing daytime with nighttime data

Comparing daytime and nighttime traffic is a natural way to think about what could cause traffic volume volatility.

Step B.2.1. Extracting daytime data from the data set

In [178]:

# transformation the 'date_time' column to datetime type
traffic_cleaned['date_time'] = pd.to_datetime(traffic_cleaned['date_time'])

In [179]:

traffic_cleaned['date_time'].dt.hour.nunique()

Out[179]:

In [226]:

# Adding a 'hour' column 
traffic_cleaned['hour'] = traffic_cleaned['date_time'].dt.hour

In [228]:

fig = plt.figure(figsize = (15, 25))

sns.relplot(data=traffic_cleaned, x='traffic_volume', y='hour')

plt.title('Intensity of Traffic Volume per Hour', fontsize = 20)
plt.xlabel('Traffic Volume (vehicles per hour)', size = 15)
plt.ylabel('Hour of a Day', size = 15)
plt.axvline(x = 6000)
plt.axvline(x = 5000)

plt.show()

<Figure size 1080x1800 with 0 Axes>

The chart above showed the following:

there are time from 0:00 to 5:00 am with relatively little traffic (less than 3500 cars per hour).\
from 6:00 to 18:00 included traffic is really dense (even 6000 and much more) with a slight drop at 10 am to 5500 approximately.\
from 19:00 to 24:00 hours the traffic intensity is about average ( (4500 - 5000 vehicles per hour )

Thus, in accordance with our main goal of analyzing the causes of heavy traffic, we can count the hours from 6:00 to 19:00 included as the daytime, and set aside the rest of the time as night time.

In [182]:

daytime_hours = list(range(6, 19))
traffic_daytime = traffic_cleaned.loc[traffic_cleaned['hour'].isin(daytime_hours)].copy()

traffic_nighttime = traffic_cleaned.loc[-(traffic_cleaned['hour'].isin(daytime_hours))].copy()

In [183]:

print("Daytime hours:\n", sorted(traffic_daytime['hour'].unique()),\
          "\nNighttime hours:\n", sorted(traffic_nighttime['hour'].unique())
     )

Daytime hours:
 [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18] 
Nighttime hours:
 [0, 1, 2, 3, 4, 5, 19, 20, 21, 22, 23]

Step B.2.2. Comparing two parts of the data

In [184]:

print(traffic_daytime['traffic_volume'].max(),'\n'\
, traffic_nighttime['traffic_volume'].max())

7280 
 4939

In [185]:

traffic_daytime['traffic_volume'].describe()

Out[185]:

count    25959.000000
mean      4712.453561
std       1281.148328
min          0.000000
25%       4232.000000
50%       4850.000000
75%       5597.000000
max       7280.000000
Name: traffic_volume, dtype: float64

In [186]:

traffic_nighttime['traffic_volume'].describe()

Out[186]:

count    22234.000000
mean      1564.585095
std       1140.972961
min          0.000000
25%        484.000000
50%       1117.000000
75%       2693.000000
max       4939.000000
Name: traffic_volume, dtype: float64

In [187]:

print (len(traffic_daytime['traffic_volume'].unique()), "\n"\
        ,len(traffic_nighttime['traffic_volume'].unique())
)

5137 
 3642

Maxumal amount of unique values in daytime dataset is 5137, which is roughly 5200.
5200 / 200 = 26, so, in our chart below bins = 26.
Maxumal amount of unique values in nighttime dataset is 3642, which is roughly 3800.
3800 / 200 = 26, so, in our chart below bins = 19

In [188]:

# Creating the charts
plt.figure(figsize = (13, 6)) 

plt.subplot(1,2,1)
plt.hist(traffic_daytime['traffic_volume'], bins = 26)
plt.title('Traffic Volume at Day')
plt.xlabel('Traffic Volume')
plt.ylabel('Frequency')
plt.xlim([0, 8000])
plt.ylim([0, 6000])

plt.subplot(1,2,2)
plt.hist(traffic_nighttime['traffic_volume'], bins = 19)
plt.title('Traffic Volume at Night')
plt.xlabel('Traffic Volume')
plt.ylabel('Frequency')
plt.xlim([0, 8000])
plt.ylim([0, 6000])

plt.show()

The "Traffic Volume at Day" histogram is quite close to the normal distribution, although the peak is shifted slightly to the right.
The chart "Traffic Volume at Night" can be estimated as left skewed with an additional peak in the middle.

In [189]:

print ("The maxumal traffic intensity  (",  traffic_nighttime['traffic_volume'].max()," cars per hour) at night is close to the average traffic volume during the day ( ", \
    round(traffic_daytime['traffic_volume'].mean()), "cars per hour).\nThus, we can safely concentrate only on daytime.")

The maxumal traffic intensity  ( 4939  cars per hour) at night is close to the average traffic volume during the day (  4712 cars per hour).
Thus, we can safely concentrate only on daytime.

Step B.3. The Average Traffic Volume per Month

In this section we will explore traffic variability during the year.

In [190]:

traffic_daytime['month'] = traffic_daytime['date_time'].dt.month
by_month = round((traffic_daytime.groupby('month').mean()), 2)
by_month['traffic_volume']

Out[190]:

month
1     4454.12
2     4674.38
3     4808.60
4     4855.75
5     4864.85
6     4844.95
7     4547.78
8     4880.41
9     4822.25
10    4885.18
11    4655.13
12    4315.10
Name: traffic_volume, dtype: float64

In [191]:

fig = plt.figure(figsize = (10, 6))
by_month['traffic_volume'].plot.line()
plt.title('The Average Traffic Volume per Month', fontsize = 20)
plt.ylabel('Average Traffic Volume', size = 15)
plt.xlabel('Months', size = 15)
plt.xticks(ticks= by_month.index,
               labels= ['Jan','Feb','Mar','Apr',
                        'May','Jun','Jul','Aug',
                        'Sep','Oct','Nov','Dec'],
               size= 12,
               rotation= 45)

Out[191]:

([<matplotlib.axis.XTick at 0x170a6a355e0>,
  <matplotlib.axis.XTick at 0x170a6a35ca0>,
  <matplotlib.axis.XTick at 0x170a6283130>,
  <matplotlib.axis.XTick at 0x170a6a3e370>,
  <matplotlib.axis.XTick at 0x170a6a3ec70>,
  <matplotlib.axis.XTick at 0x170a6633850>,
  <matplotlib.axis.XTick at 0x170a6a3e9d0>,
  <matplotlib.axis.XTick at 0x170a6633910>,
  <matplotlib.axis.XTick at 0x170a72f4c10>,
  <matplotlib.axis.XTick at 0x170a6a4fa60>,
  <matplotlib.axis.XTick at 0x170a6a4f5e0>,
  <matplotlib.axis.XTick at 0x170a4b4c4f0>],
 [Text(1, 0, 'Jan'),
  Text(2, 0, 'Feb'),
  Text(3, 0, 'Mar'),
  Text(4, 0, 'Apr'),
  Text(5, 0, 'May'),
  Text(6, 0, 'Jun'),
  Text(7, 0, 'Jul'),
  Text(8, 0, 'Aug'),
  Text(9, 0, 'Sep'),
  Text(10, 0, 'Oct'),
  Text(11, 0, 'Nov'),
  Text(12, 0, 'Dec')])

In [192]:

by_month['traffic_volume'].describe()

Out[192]:

count      12.00000
mean     4717.37500
std       189.93909
min      4315.10000
25%      4628.29250
50%      4815.42500
75%      4858.02500
max      4885.18000
Name: traffic_volume, dtype: float64

In [193]:

tv_by_m_mean = round(by_month['traffic_volume']).mean()
tv_by_m_max = round(by_month['traffic_volume']).max()
tv_by_m_min = round(by_month['traffic_volume']).min()

prop_max =round(( tv_by_m_max / tv_by_m_mean)*100)
prop_min =round(( tv_by_m_min / tv_by_m_mean)*100)

print("The maxumal traffic volume is", prop_max, "% of average.\nThe minimal traffic intensity is", prop_min, "% of average")  

The maxumal traffic volume is 104 % of average.
The minimal traffic intensity is 91 % of average

The line chart above showed the following:

traffic intensity is increasing in the spring months,
intensity stays on the one level until October,
though in July there is a noticeable drop,
and in December the fall is simply dramatic.

However, if we look at the numbers, we will see:

the maximum traffic intensity is only 4% higher than the average,
the minimum volume of traffic is only 9% less than the average intensity.

Thus, we can talk about the stability of traffic intensity throughout the year.
This means that those connected with this highway, from traffic police to the smallest cafe along the road, must be prepared for a lot of work throughout the year actually without decreading.

Step B.4. The Average Traffic Volume per a Time Unit

In this section, we will try to find what cause traffic volume volatility during a week, a day of a week and a hour of a day.

Step B.4.1. The Average Traffic Volume per a Day of a Week

In [194]:

traffic_daytime["day_of_week"]= traffic_daytime['date_time'].dt.dayofweek
by_dayofweek = traffic_daytime.groupby('day_of_week').mean().round(2)
by_dayofweek['traffic_volume'] 

Out[194]:

day_of_week
0    4908.91
1    5214.63
2    5305.49
3    5325.30
4    5283.93
5    3723.53
6    3224.11
Name: traffic_volume, dtype: float64

In [195]:

fig = plt.figure(figsize = (8, 5))
by_dayofweek['traffic_volume'].plot.line()
plt.title('The Average Traffic Volume per Day of a Week', fontsize = 20)
plt.ylabel('Average Traffic Volume', size = 15)
plt.xlabel('Days of a Week', size = 15)
plt.xticks(ticks = by_dayofweek.index,
           labels = ['Mon','Tue','Wed','Thu',
                    'Fri','Sat','Sun'],
           fontsize = 12)

Out[195]:

([<matplotlib.axis.XTick at 0x170a5e71c40>,
  <matplotlib.axis.XTick at 0x170a5e715b0>,
  <matplotlib.axis.XTick at 0x170a723f0d0>,
  <matplotlib.axis.XTick at 0x170a319ec40>,
  <matplotlib.axis.XTick at 0x170a32ba730>,
  <matplotlib.axis.XTick at 0x170a319e2e0>,
  <matplotlib.axis.XTick at 0x170a32bac70>],
 [Text(0, 0, 'Mon'),
  Text(1, 0, 'Tue'),
  Text(2, 0, 'Wed'),
  Text(3, 0, 'Thu'),
  Text(4, 0, 'Fri'),
  Text(5, 0, 'Sat'),
  Text(6, 0, 'Sun')])

In [196]:

by_dayofweek['traffic_volume'] . describe()

Out[196]:

count       7.000000
mean     4712.271429
std       869.652300
min      3224.110000
25%      4316.220000
50%      5214.630000
75%      5294.710000
max      5325.300000
Name: traffic_volume, dtype: float64

In [197]:

tv_by_d_mean = round(by_dayofweek['traffic_volume'].mean())
tv_by_d_max = round(by_dayofweek['traffic_volume'].max())
tv_by_d_min = round(by_dayofweek['traffic_volume'].min())

prop_d_max = round((tv_by_d_max / tv_by_d_mean)*100)
prop_d_min = round(( tv_by_d_min / tv_by_d_mean)*100)

print("The maxumal traffic volume  per day of a week", prop_d_max, "% of average.\nThe minimal \
traffic intensity is", prop_d_min, "% of average")  

The maxumal traffic volume  per day of a week 113 % of average.
The minimal traffic intensity is 68 % of average

The line chart above showed the following:

traffic intensity is slightly increasing from Monday to Friday,
during weekdays it is falling dramatically.

Looking at the numbers, we can confirm:

traffic is dense during weekdays,
32% below average means a significant reduction in weekend traffic.

Step B.4.2. Weekday Traffic vs Weekend Traffic

In [198]:

traffic_daytime['hour'] = traffic_daytime['date_time'].dt.hour
weekdays = traffic_daytime.copy()[traffic_daytime['day_of_week'] <= 4] # 4 == Friday
weekend = traffic_daytime.copy()[traffic_daytime['day_of_week'] >= 5] # 5 == Saturday
by_hour_weekdays = weekdays.groupby('hour').mean()
by_hour_weekend = weekend.groupby('hour').mean()

print('Average traffic intensity per hour at weekdays:\n', by_hour_weekdays['traffic_volume'], '\n\n\
Average traffic intensity per hour at weekend:\n', by_hour_weekend['traffic_volume'])

Average traffic intensity per hour at weekdays:
 hour
6     5366.128360
7     6030.413559
8     5503.497970
9     4895.269257
10    4378.419118
11    4633.419470
12    4855.382143
13    4859.180473
14    5152.995778
15    5592.897768
16    6189.473647
17    5785.009489
18    4434.209431
Name: traffic_volume, dtype: float64 

Average traffic intensity per hour at weekend:
 hour
6     1089.686767
7     1590.406302
8     2339.690516
9     3111.623917
10    3686.632302
11    4044.154955
12    4372.482883
13    4362.296564
14    4358.543796
15    4342.456881
16    4339.693805
17    4151.919929
18    3811.792279
Name: traffic_volume, dtype: float64

In [199]:

fig = plt.figure(figsize = (10,5))
plt.title('Weekday Traffic vs Weekend Traffic', fontsize = 20)
plt.ylabel('Average Traffic Volume (cars per hour)', size = 15)
plt.xlabel('Day Hours', size = 15)
plt.yticks(fontsize = 12)

sns.set_style("ticks")
sns.lineplot(data = by_hour_weekdays['traffic_volume'],
             label = "Weekdays", marker='o')
sns.lineplot(data = by_hour_weekend['traffic_volume'],
             label = "Weekend", marker = 'o')

plt.axhline(y = 5000) # the average traffic volume (roughly)

plt.legend()

Out[199]:

<matplotlib.legend.Legend at 0x170a4a8d580>

In [200]:

by_hour_weekdays['traffic_volume'].describe()

Out[200]:

count      13.000000
mean     5205.868959
std       591.392769
min      4378.419118
25%      4855.382143
50%      5152.995778
75%      5592.897768
max      6189.473647
Name: traffic_volume, dtype: float64

In [201]:

by_hour_weekend['traffic_volume'].describe()

Out[201]:

count      13.000000
mean     3507.798530
std      1135.134726
min      1089.686767
25%      3111.623917
50%      4044.154955
75%      4342.456881
max      4372.482883
Name: traffic_volume, dtype: float64

In [202]:

by_h_max_weekdays = round(by_hour_weekdays['traffic_volume'].max())
by_h_min_weekdays = round(by_hour_weekdays['traffic_volume'].min())
by_h_mean_weekdays = round(by_hour_weekdays['traffic_volume'].mean())

by_h_max_weekend = round(by_hour_weekend['traffic_volume'].max())
by_h_mean_weekend = round(by_hour_weekend['traffic_volume'].mean())

weekdays_vs_weekend = round((by_h_mean_weekdays /by_h_mean_weekend )*100)
prop_h_max_weekdays = round((by_h_max_weekdays / tv_by_d_mean)*100)
prop_h_min_weekdays = round(( by_h_min_weekdays / tv_by_d_mean)*100)
prop_h_mean_weekdays = round(( by_h_mean_weekdays / tv_by_d_mean)*100)
prop_h_max_weekend = round((by_h_max_weekend / tv_by_d_mean)*100)
prop_h_mean_weekend = round(( by_h_mean_weekend / tv_by_d_mean)*100)

In [203]:

print("The statistic calculations showed following:\n\
       - The average traffic of weekdays more than weekend trafic on", weekdays_vs_weekend - 100, "% .\n \
      - The maxumal traffic on weekdays is more than average traffic in general on",  prop_h_max_weekdays - 100, "%.\n\
      - The minimal traffic volume on weekdays is less than average traffic in general on", 100 - prop_h_min_weekdays, "%.\n\
      -  The average weekdays traffic is more than average traffic in general on", prop_h_mean_weekdays - 100, "%.\n\
      -  The maxumal traffic volume on weekend is less than average traffic in general on", 100 - prop_h_max_weekend, "%.\n\
      -  The average traffic volume on weekend is less than average traffic in general on", 100 - prop_h_mean_weekend, "%."
)  

The statistic calculations showed following:
       - The average traffic of weekdays more than weekend trafic on 48 % .
       - The maxumal traffic on weekdays is more than average traffic in general on 31 %.
      - The minimal traffic volume on weekdays is less than average traffic in general on 7 %.
      -  The average weekdays traffic is more than average traffic in general on 10 %.
      -  The maxumal traffic volume on weekend is less than average traffic in general on 7 %.
      -  The average traffic volume on weekend is less than average traffic in general on 26 %.

The line chart above showed the following:

on working days, the traffic intensity is more than 4000 cars per hour from morning to evening,
there is a drop at 10:00 when traffic alights to about a minimum,
there are two peaks at 7:00 and 16:00 with maximum traffic intensity,
on weekends, traffic starts from a minimum and smoothly increases to the maxumum until 12:00,
remains on this level until 16:00, then drops slightly until 18:00 .

Looking at the numbers, we can confirm the following:

traffic is dense during every day of weekdays,
usually, the highway is the busiest around 7:00 am and 16:00 on weekdays,
weekend traffic is much less intense and on average does not even reach the average daily traffic volume.

Step B.5. Correlation of Weather Condition with Traffic Intensity

In this section, we will try to figure out how weather conditions affect traffic volume.
We will work with the business days stored in the 'weekdays' dataframe.
Explanation:
Our goal is to find circumstances that affect traffic. On weekends, of course, people prefer to stay at home if the weather is bad,
therefore, the traffic will be even less intense than usual. We have to calculate whether bad weather affects traffic on weekdays.

Step B.5.1 Corellation between traffic volume and weather conditions stored as numerical data

In [204]:

round(weekdays['traffic_volume'].mean())

Out[204]:

In [205]:

weekdays.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18572 entries, 0 to 48147
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   holiday              18572 non-null  object        
 1   temp                 18572 non-null  float64       
 2   rain_1h              18572 non-null  float64       
 3   snow_1h              18572 non-null  float64       
 4   clouds_all           18572 non-null  int64         
 5   weather_main         18572 non-null  object        
 6   weather_description  18572 non-null  object        
 7   date_time            18572 non-null  datetime64[ns]
 8   traffic_volume       18572 non-null  int64         
 9   hour                 18572 non-null  int64         
 10  month                18572 non-null  int64         
 11  day_of_week          18572 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(5), object(3)
memory usage: 1.8+ MB

In [206]:

# what columns are numerical we took from the paragraph Attribute Information above
weather_numerical = weekdays[['temp','rain_1h','snow_1h','clouds_all', 'traffic_volume']]
weather_numerical.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18572 entries, 0 to 48147
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   temp            18572 non-null  float64
 1   rain_1h         18572 non-null  float64
 2   snow_1h         18572 non-null  float64
 3   clouds_all      18572 non-null  int64  
 4   traffic_volume  18572 non-null  int64  
dtypes: float64(3), int64(2)
memory usage: 870.6 KB

Description of correlation methods can be found here
Syntax of .corr() methods is here.
Description of correlation coefficients can be read here
Climate data can be found in Wikipedia

In [207]:

#Spearman methods measures the degree of association between two variables.
weather_numerical.corr(method = 'spearman')['traffic_volume'].round(2).sort_values()

Out[207]:

clouds_all       -0.09
rain_1h          -0.02
snow_1h          -0.02
temp              0.11
traffic_volume    1.00
Name: traffic_volume, dtype: float64

The above correlation calculation showed the following:

Cloudiness, snow falling in mm per hour and rain also in mm per hour have a negative impact on traffic,
this means that the worse the weather got, the less traffic volume got.
Temperature has a positive effect on traffic, i.e. the higher the temperature, the higher the traffic intensity.\

However, the influence of weather conditions on traffic is insignificant:\

For cloudiness, rain and snowfall, the correlation coefficients are close to zero (less than -0.1). We can say that these weather conditions do not have a linear relationship (or have a very weak linear relationship) with traffic.
We can say that the temperature with its correlation coefficient of +0.11 has a weak linear relationship with traffic.

To better understand the numbers, let's visualize the relationship between traffic and weather conditions on graphs

In [208]:

weather_columns = ['clouds_all',   'rain_1h',  'snow_1h', 'temp']
weather_names = [ "Cloud Coverage", "Rain", "Snowfall", "Average Temperature"]
colors = ['black', 'green', 'blue', 'red']
units = ["%", "mm/h", "mm/h", "K"]

fig = plt.figure(figsize = (12,10))
for i in range(1,5):
    sns.set_style("ticks")
    plt.subplot(2,2,i)
    sns.scatterplot(data = weather_numerical,
                    x = 'traffic_volume',y = weather_columns[i-1],
                    color = colors[i-1])
    plt.title(label = "Traffic vs. {}".format(weather_names[i-1]),
              fontsize=20)
    plt.xlabel(xlabel = "Traffic Volume",
           fontsize = 15)
    plt.ylabel(ylabel = "{} ({})".format(weather_names[i-1],units[i-1]),
           fontsize=15)

fig.tight_layout(pad = 2)
plt.show()

Step B.5.1.1. Checking out the weather conditions in the data (rain and snow)

In [209]:

traffic_cleaned['snow_1h'].value_counts().sort_index()

Out[209]:

0.00    48130
0.05       14
0.06       12
0.08        2
0.10        6
0.13        6
0.17        3
0.21        1
0.25        6
0.32        5
0.44        2
0.51        6
Name: snow_1h, dtype: int64

0.5 mm per hour does not look like snowfall. And 48130 hours without snow (for 5.5 years) does not correlate in any way with the description of the climate (snowfall in April-October is a rather trivial phenomenon in this area).

In [210]:

print("The given data set contains values which does not equel zero :\n",\
round((len(traffic_cleaned[traffic_cleaned['rain_1h'] != 0])/len(traffic_cleaned))*100), "%  in the rain_1h column,\n",\
round(((len(traffic_cleaned[traffic_cleaned['snow_1h'] != 0])/len(traffic_cleaned))*100), 3), "% in the 'snow_1h' column,\n",\
round((len(traffic_cleaned[traffic_cleaned['clouds_all'] != 0])/len(traffic_cleaned))*100), "% in the 'clouds_all' column,\n",\
round((len(traffic_cleaned[traffic_cleaned['temp'] != 0])/len(traffic_cleaned))*100),  "%  in the 'temp' column.")

The given data set contains values which does not equel zero :
 7 %  in the rain_1h column,
 0.131 % in the 'snow_1h' column,
 96 % in the 'clouds_all' column,
 100 %  in the 'temp' column.

Now we can better understand our analysis:

cloudiness does not affect traffic,
temperature has a weak positive effect,
we can't solve anything about snow and rain, because we actually have no data,
we can only assume that these weather factors can affect traffic congestion any way (should be negative - the more intense the precipitation, the less traffic intensity become, but we cannot be that sure).

Step B.5.2. Corellation between traffic volume and weather conditions stored as categorical data

In this section, we will try to figure out how complex weather conditions affect traffic volume. We again will work with the business days stored in the 'weekdays' dataframe.

In [211]:

by_weather_main = weekdays.groupby('weather_main').mean().reset_index().sort_values('traffic_volume')
by_weather_description = weekdays.groupby('weather_description').mean().reset_index().sort_values('traffic_volume')

Step B.5.2.1. Exploration the "weather_main" column

In [212]:

by_weather_main.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11 entries, 9 to 2
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   weather_main    11 non-null     object 
 1   temp            11 non-null     float64
 2   rain_1h         11 non-null     float64
 3   snow_1h         11 non-null     float64
 4   clouds_all      11 non-null     float64
 5   traffic_volume  11 non-null     float64
 6   hour            11 non-null     float64
 7   month           11 non-null     float64
 8   day_of_week     11 non-null     float64
dtypes: float64(8), object(1)
memory usage: 880.0+ bytes

In [213]:

by_weather_main['traffic_volume'].value_counts().sort_index()

Out[213]:

4211.000000    1
4772.062237    1
4990.468217    1
5185.315278    1
5204.408946    1
5211.857143    1
5219.994286    1
5237.311249    1
5251.874354    1
5265.739767    1
5281.033755    1
Name: traffic_volume, dtype: int64

Step B.5.2.2. Exploration the "weather_description" column

In [214]:

by_weather_description.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37 entries, 0 to 26
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   weather_description  37 non-null     object 
 1   temp                 37 non-null     float64
 2   rain_1h              37 non-null     float64
 3   snow_1h              37 non-null     float64
 4   clouds_all           37 non-null     float64
 5   traffic_volume       37 non-null     float64
 6   hour                 37 non-null     float64
 7   month                37 non-null     float64
 8   day_of_week          37 non-null     float64
dtypes: float64(8), object(1)
memory usage: 2.9+ KB

In [215]:

by_weather_description['traffic_volume'].value_counts().sort_index()

Out[215]:

4211.000000    1
4314.000000    1
4451.106195    1
4618.636364    1
4654.222222    1
4788.295042    1
4856.567100    1
4932.666667    1
4980.142857    1
4990.468217    1
5162.235294    1
5173.384181    1
5185.315278    1
5191.116556    1
5204.408946    1
5205.983051    1
5211.857143    1
5216.000000    1
5222.000000    1
5231.235294    1
5234.427885    1
5235.021991    1
5236.800000    1
5245.875000    1
5250.180473    1
5251.957279    1
5254.884615    1
5260.458711    1
5272.462500    1
5283.446903    1
5283.620614    1
5329.078298    1
5338.305158    1
5387.600000    1
5517.923077    1
5579.750000    1
5664.000000    1
Name: traffic_volume, dtype: int64

Step B.5.2.3. Correlation between traffic and complex weather conditions

In [216]:

fig = plt.figure(figsize = (10, 6))
sns.barplot(data = by_weather_main,  x = 'traffic_volume', y = 'weather_main')                          
sns.set_style("ticks")

plt.title(label = "Traffic Volume by Main Weather Conditions",
          fontsize = 20)
plt.xlabel(xlabel = "Average Traffic Volume",  fontsize = 15)
plt.ylabel(ylabel = "Main Weather Conditions", size = 15)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)

plt.axvline(x= 5200) # average traffic volume at weekdays

plt.show()

It seems that clouds, clear sky and drizzle hace the most influence on the traffic volume.
However, the ifluence is rather weak.

In [217]:

fig = plt.figure(figsize = (12, 14))
sns.barplot(data = by_weather_description,  x = 'traffic_volume', y = 'weather_description' )              
sns.set_style("ticks")

plt.title(label = "Traffic Volume by Detailed Weather Conditions",
          fontsize = 20)
plt.xlabel(xlabel = "Average Traffic Volume",  fontsize = 15)
plt.ylabel(ylabel = "Description of  Weather Conditions", size = 15)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 10)

plt.axvline(x= 5200) # average traffic volume at weekdays

plt.show()

It appears that "shower snow", "light rain and snow" and "proximity thunderstorm with rain" have the biggest impact on traffic.
However, this does not seem significant.

Step B.6. Traffic Intensity during Holidays

In this section, we will try to figure out how traffic volume behaves during holidays. We will work with the cleaned dataset.

In [218]:

traffic_cleaned['holiday'].value_counts()

Out[218]:

None                         48132
Labor Day                        7
Thanksgiving Day                 6
Christmas Day                    6
New Years Day                    6
Martin Luther King Jr Day        6
Columbus Day                     5
Veterans Day                     5
Washingtons Birthday             5
Memorial Day                     5
Independence Day                 5
State Fair                       5
Name: holiday, dtype: int64

In [219]:

len(traffic_cleaned[traffic_cleaned['holiday']!= 'None'])

Out[219]:

In [220]:

holidays = traffic_cleaned[traffic_cleaned['holiday'] != 'None']

In [221]:

by_holidays = holidays.groupby('holiday').mean().reset_index().sort_values('traffic_volume')

In [222]:

by_holidays['traffic_volume'].describe()

Out[222]:

count      11.000000
mean      855.200866
std       262.772492
min       519.400000
25%       635.000000
50%       827.500000
75%      1044.571429
max      1356.000000
Name: traffic_volume, dtype: float64

In [223]:

fig = plt.figure(figsize = (10, 6))
sns.barplot(data = by_holidays,  x = 'traffic_volume', y = 'holiday')                          
sns.set_style("ticks")

plt.title(label = "Traffic Volume by Holidays",
          fontsize = 20)
plt.xlabel(xlabel = "Average Traffic Volume",  fontsize = 15)
plt.ylabel(ylabel = "Holidays", size = 15)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)

plt.axvline(x= 855) # average traffic volume at weekdays

plt.show()

In [224]:

print("The statistic calculations showed following:\n\
       - The average traffic on holidays is", round((holidays['traffic_volume'].mean() / weekend['traffic_volume'].mean()), 2)*100  , "%  of average traffic of weekends.\n \
       - The maximum traffic volume on holidays is ", round((holidays['traffic_volume'].max() / holidays['traffic_volume'].mean()), 2)*100, "% of average traffic on holidays.\n\
       - The maximun traffic on holidays is", round((holidays['traffic_volume'].max() / weekend['traffic_volume'].max()), 2)*100  , "%  of maximum traffic of weekends."
     )      

The statistic calculations showed following:
       - The average traffic on holidays is 25.0 %  of average traffic of weekends.
        - The maximum traffic volume on holidays is  178.0 % of average traffic on holidays.
       - The maximun traffic on holidays is 23.0 %  of maximum traffic of weekends.

The graph and statistical calculation above showed that the surveyed highway is almost empty on holidays (average 855 cars per hour versus 3507 on regular weekends).
The most intensive traffic falls on the New Year holidays (1356 cars per hour).

Conclusion¶

In this project we learned:

how to find the correlation between the numerical characteristics of the object (the correlation between the verbal characteristics still needs to be studied),
how to visualize the correlation and choose the best way to demonstrate it (format, style, labels etc.),
how to understand and describe the result of the analysis.

For example, as a conclusion from the above analysis, we can say:

the intensity of traffic strongly depends on the time of day,
traffic on weekdays is much more intense than on holidays and weekends,
weather conditions either do not affect the traffic volume, or have a weak effect.

It can also be said that no matter how reliable the data may seem, before starting the analysis, it is necessary to check their validation.