The purpose of our analysis is to explore methods, features, and visualizations that can be useful for identifying indicators of relationships between data.
We will use as a sample traffic intensity on the I-94 Interstate highway.
Data Set Informatrion:\
The dataset is made by John Hogue and can be download from the UCI Machine Learning Repository.\
Hourly Interstate 94 Westbound traffic volume for MN DoT ATR station 301,
roughly midway between Minneapolis and St Paul, MN.
Hourly weather features and holidays included for impacts on traffic volume from 2012-2018.
Attibute Information:
Attribute | Description |
---|---|
holiday | Categorical US National holidays plus regional holiday, Minnesota State Fair |
temp | Numeric Average temp in kelvin |
rain_1h | Numeric Amount in mm of rain that occurred in the hour |
snow_1h | Numeric Amount in mm of snow that occurred in the hour |
clouds_all | Numeric Percentage of cloud cover |
weather_main | Categorical Short textual description of the current weather |
weather_description | Categorical Longer textual description of the current weather |
date_time | DateTime Hour of the data collected in local CST time |
traffic_volume | Numeric Hourly I-94 ATR 301 reported westbound traffic volume |
import pandas as pd
import seaborn as sns
sns.set()
import matplotlib.pyplot as plt
%matplotlib inline
traffic = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
traffic.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48204 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null object 8 traffic_volume 48204 non-null int64 dtypes: float64(3), int64(2), object(4) memory usage: 3.3+ MB
print("The file contains", len(traffic.columns), "columns and", len(traffic.index), "rows" )
The file contains 9 columns and 48204 rows
#Examination the first and the last five rows
traffic
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
48199 | None | 283.45 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2018-09-30 19:00:00 | 3543 |
48200 | None | 282.76 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 20:00:00 | 2781 |
48201 | None | 282.73 | 0.0 | 0.0 | 90 | Thunderstorm | proximity thunderstorm | 2018-09-30 21:00:00 | 2159 |
48202 | None | 282.09 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 22:00:00 | 1450 |
48203 | None | 282.12 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 23:00:00 | 954 |
48204 rows × 9 columns
Examination the dataset showed following:
The data set covers 6 years (from 2012 to 2018 included).
6 years * 365 days * 24 hours = 52560 .
There are 48204 rows in the data set.
Thus, (52560 - 48204)/ 24 = 181.5 days are missing in the data. Is it about 6 months, and we do not even what season they are.
So, our further analysis cannot be taken as the iron fact.
Step A.1. Research and removing inaccurate data
traffic['rain_1h'].describe()
count 48204.000000 mean 0.334264 std 44.789133 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 9831.300000 Name: rain_1h, dtype: float64
traffic[traffic['rain_1h'] >= 50]
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
8247 | None | 289.10 | 55.63 | 0.0 | 68 | Rain | very heavy rain | 2013-08-07 02:00:00 | 315 |
24872 | None | 302.11 | 9831.30 | 0.0 | 75 | Rain | very heavy rain | 2016-07-11 17:00:00 | 5535 |
Maximum rain about 10000 mm in the hour is 10 m in the hour, it's much much more than the world record (Weather Records).
Since it is just one line, we can remove it.
traffic['temp'].describe()
count 48204.000000 mean 281.205870 std 13.338232 min 0.000000 25% 272.160000 50% 282.450000 75% 291.806000 max 310.070000 Name: temp, dtype: float64
traffic[traffic['temp'] < 230]
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
11898 | None | 0.0 | 0.0 | 0.0 | 0 | Clear | sky is clear | 2014-01-31 03:00:00 | 361 |
11899 | None | 0.0 | 0.0 | 0.0 | 0 | Clear | sky is clear | 2014-01-31 04:00:00 | 734 |
11900 | None | 0.0 | 0.0 | 0.0 | 0 | Clear | sky is clear | 2014-01-31 05:00:00 | 2557 |
11901 | None | 0.0 | 0.0 | 0.0 | 0 | Clear | sky is clear | 2014-01-31 06:00:00 | 5150 |
11946 | None | 0.0 | 0.0 | 0.0 | 0 | Clear | sky is clear | 2014-02-02 03:00:00 | 291 |
11947 | None | 0.0 | 0.0 | 0.0 | 0 | Clear | sky is clear | 2014-02-02 04:00:00 | 284 |
11948 | None | 0.0 | 0.0 | 0.0 | 0 | Clear | sky is clear | 2014-02-02 05:00:00 | 434 |
11949 | None | 0.0 | 0.0 | 0.0 | 0 | Clear | sky is clear | 2014-02-02 06:00:00 | 739 |
11950 | None | 0.0 | 0.0 | 0.0 | 0 | Clear | sky is clear | 2014-02-02 07:00:00 | 962 |
11951 | None | 0.0 | 0.0 | 0.0 | 0 | Clear | sky is clear | 2014-02-02 08:00:00 | 1670 |
Temperature 0 degrees in Kelvin is -273 Celsius. The maximum record in the area is about -42 C (what is about 231 K).
Since there are just 11 line with inaccurate data, we can remove them also.
traffic.to_csv('traffic_cleaned.csv', index= False)
traffic_cleaned = pd.read_csv('traffic_cleaned.csv')
traffic_cleaned = traffic_cleaned[traffic_cleaned['rain_1h'] < 100]
traffic_cleaned = traffic_cleaned[traffic_cleaned['temp'] != 0]
traffic_cleaned.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 48193 entries, 0 to 48203 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48193 non-null object 1 temp 48193 non-null float64 2 rain_1h 48193 non-null float64 3 snow_1h 48193 non-null float64 4 clouds_all 48193 non-null int64 5 weather_main 48193 non-null object 6 weather_description 48193 non-null object 7 date_time 48193 non-null datetime64[ns] 8 traffic_volume 48193 non-null int64 9 hour 48193 non-null int64 dtypes: datetime64[ns](1), float64(3), int64(3), object(3) memory usage: 4.0+ MB
print("The file contains", len(traffic_cleaned.columns), "columns and", len(traffic_cleaned.index), "rows" )
The file contains 10 columns and 48193 rows
Step B.1. Observation of Traffic Volume
Traffic Volume is the number of vehicles crossing a section of road per unit time at any selected period (Definition).
print('There are', len(traffic_cleaned['traffic_volume'].unique()), 'unique values in the"traffic_volume" column.')
There are 6704 unique values in the"traffic_volume" column.
traffic_cleaned['traffic_volume'].describe()
count 48193.000000 mean 3260.174029 std 1986.754010 min 0.000000 25% 1194.000000 50% 3380.000000 75% 4933.000000 max 7280.000000 Name: traffic_volume, dtype: float64
The maximum value is 7280, let's take it roughly as 7500, which can be divided by 3: 7500/3 = 2500.
Now we will decide the following:
Also, x-coordinates with 8000 units will suffice for the maximum values in the "traffic volume" columns.
We will not calculate the maximum units for y-coordinate spine since we can count on the Python ability.
# To make our chart reasonably readable, let's divide the unique values (6784 units ~ 6800) by 34 (6800/200)
fig = plt.figure(figsize = (12, 5))
traffic_cleaned['traffic_volume'].plot.hist(bins = 34)
plt.title('Intensity of Traffic Volume', fontsize = 20)
plt.xlabel('Traffic Volume (vehicles per hour)', size = 15)
plt.ylabel('Frequency in Data', size = 15)
plt.axhline(y = 1000)
plt.axhline(y = 2000)
<matplotlib.lines.Line2D at 0x170a754fe80>
print("Data contains", len(traffic_cleaned['traffic_volume']), "lines with non-null values\
in the 'traffic_volume' column.\nThe maximun amount is", traffic_cleaned['traffic_volume'].max(),\
"vehicles per hour.\nThe minimum amount is", traffic_cleaned['traffic_volume'].min(),\
"vehicles per hour.\nThe average amount is", round((traffic_cleaned['traffic_volume'].mean())),\
"vehicles per hour.")
Data contains 48193 lines with non-null valuesin the 'traffic_volume' column. The maximun amount is 7280 vehicles per hour. The minimum amount is 0 vehicles per hour. The average amount is 3260 vehicles per hour.
Looking at the graph above, we can say that the most frequent cases (over 4000 cases) are when the traffic is low (less than 500 cars per hour).
At the same time, cases of heavy traffic (about 7,000 vehicles per hour) are rare (less than 1000 cases).
We see three frequency peaks: 500-600 vehicles per hour ( about 2200 cases), about 3000 (about 1800 cases) and about 4800-4900 vehicles per hour (about 2200 cases).
Since we have no comparison with any factors so far, we cannot yet see the cause of the facts.
Step B.2. Comparing daytime with nighttime data
Comparing daytime and nighttime traffic is a natural way to think about what could cause traffic volume volatility.
Step B.2.1. Extracting daytime data from the data set
# transformation the 'date_time' column to datetime type
traffic_cleaned['date_time'] = pd.to_datetime(traffic_cleaned['date_time'])
traffic_cleaned['date_time'].dt.hour.nunique()
24
# Adding a 'hour' column
traffic_cleaned['hour'] = traffic_cleaned['date_time'].dt.hour
fig = plt.figure(figsize = (15, 25))
sns.relplot(data=traffic_cleaned, x='traffic_volume', y='hour')
plt.title('Intensity of Traffic Volume per Hour', fontsize = 20)
plt.xlabel('Traffic Volume (vehicles per hour)', size = 15)
plt.ylabel('Hour of a Day', size = 15)
plt.axvline(x = 6000)
plt.axvline(x = 5000)
plt.show()
<Figure size 1080x1800 with 0 Axes>
The chart above showed the following:
Thus, in accordance with our main goal of analyzing the causes of heavy traffic, we can count the hours from 6:00 to 19:00 included as the daytime, and set aside the rest of the time as night time.
daytime_hours = list(range(6, 19))
traffic_daytime = traffic_cleaned.loc[traffic_cleaned['hour'].isin(daytime_hours)].copy()
traffic_nighttime = traffic_cleaned.loc[-(traffic_cleaned['hour'].isin(daytime_hours))].copy()
print("Daytime hours:\n", sorted(traffic_daytime['hour'].unique()),\
"\nNighttime hours:\n", sorted(traffic_nighttime['hour'].unique())
)
Daytime hours: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18] Nighttime hours: [0, 1, 2, 3, 4, 5, 19, 20, 21, 22, 23]
Step B.2.2. Comparing two parts of the data
print(traffic_daytime['traffic_volume'].max(),'\n'\
, traffic_nighttime['traffic_volume'].max())
7280 4939
traffic_daytime['traffic_volume'].describe()
count 25959.000000 mean 4712.453561 std 1281.148328 min 0.000000 25% 4232.000000 50% 4850.000000 75% 5597.000000 max 7280.000000 Name: traffic_volume, dtype: float64
traffic_nighttime['traffic_volume'].describe()
count 22234.000000 mean 1564.585095 std 1140.972961 min 0.000000 25% 484.000000 50% 1117.000000 75% 2693.000000 max 4939.000000 Name: traffic_volume, dtype: float64
print (len(traffic_daytime['traffic_volume'].unique()), "\n"\
,len(traffic_nighttime['traffic_volume'].unique())
)
5137 3642
Maxumal amount of unique values in daytime dataset is 5137, which is roughly 5200.
5200 / 200 = 26, so, in our chart below bins = 26.
Maxumal amount of unique values in nighttime dataset is 3642, which is roughly 3800.
3800 / 200 = 26, so, in our chart below bins = 19
# Creating the charts
plt.figure(figsize = (13, 6))
plt.subplot(1,2,1)
plt.hist(traffic_daytime['traffic_volume'], bins = 26)
plt.title('Traffic Volume at Day')
plt.xlabel('Traffic Volume')
plt.ylabel('Frequency')
plt.xlim([0, 8000])
plt.ylim([0, 6000])
plt.subplot(1,2,2)
plt.hist(traffic_nighttime['traffic_volume'], bins = 19)
plt.title('Traffic Volume at Night')
plt.xlabel('Traffic Volume')
plt.ylabel('Frequency')
plt.xlim([0, 8000])
plt.ylim([0, 6000])
plt.show()
The "Traffic Volume at Day" histogram is quite close to the normal distribution, although the peak is shifted slightly to the right.
The chart "Traffic Volume at Night" can be estimated as left skewed with an additional peak in the middle.
print ("The maxumal traffic intensity (", traffic_nighttime['traffic_volume'].max()," cars per hour) at night is close to the average traffic volume during the day ( ", \
round(traffic_daytime['traffic_volume'].mean()), "cars per hour).\nThus, we can safely concentrate only on daytime.")
The maxumal traffic intensity ( 4939 cars per hour) at night is close to the average traffic volume during the day ( 4712 cars per hour). Thus, we can safely concentrate only on daytime.
Step B.3. The Average Traffic Volume per Month
In this section we will explore traffic variability during the year.
traffic_daytime['month'] = traffic_daytime['date_time'].dt.month
by_month = round((traffic_daytime.groupby('month').mean()), 2)
by_month['traffic_volume']
month 1 4454.12 2 4674.38 3 4808.60 4 4855.75 5 4864.85 6 4844.95 7 4547.78 8 4880.41 9 4822.25 10 4885.18 11 4655.13 12 4315.10 Name: traffic_volume, dtype: float64
fig = plt.figure(figsize = (10, 6))
by_month['traffic_volume'].plot.line()
plt.title('The Average Traffic Volume per Month', fontsize = 20)
plt.ylabel('Average Traffic Volume', size = 15)
plt.xlabel('Months', size = 15)
plt.xticks(ticks= by_month.index,
labels= ['Jan','Feb','Mar','Apr',
'May','Jun','Jul','Aug',
'Sep','Oct','Nov','Dec'],
size= 12,
rotation= 45)
([<matplotlib.axis.XTick at 0x170a6a355e0>, <matplotlib.axis.XTick at 0x170a6a35ca0>, <matplotlib.axis.XTick at 0x170a6283130>, <matplotlib.axis.XTick at 0x170a6a3e370>, <matplotlib.axis.XTick at 0x170a6a3ec70>, <matplotlib.axis.XTick at 0x170a6633850>, <matplotlib.axis.XTick at 0x170a6a3e9d0>, <matplotlib.axis.XTick at 0x170a6633910>, <matplotlib.axis.XTick at 0x170a72f4c10>, <matplotlib.axis.XTick at 0x170a6a4fa60>, <matplotlib.axis.XTick at 0x170a6a4f5e0>, <matplotlib.axis.XTick at 0x170a4b4c4f0>], [Text(1, 0, 'Jan'), Text(2, 0, 'Feb'), Text(3, 0, 'Mar'), Text(4, 0, 'Apr'), Text(5, 0, 'May'), Text(6, 0, 'Jun'), Text(7, 0, 'Jul'), Text(8, 0, 'Aug'), Text(9, 0, 'Sep'), Text(10, 0, 'Oct'), Text(11, 0, 'Nov'), Text(12, 0, 'Dec')])
by_month['traffic_volume'].describe()
count 12.00000 mean 4717.37500 std 189.93909 min 4315.10000 25% 4628.29250 50% 4815.42500 75% 4858.02500 max 4885.18000 Name: traffic_volume, dtype: float64
tv_by_m_mean = round(by_month['traffic_volume']).mean()
tv_by_m_max = round(by_month['traffic_volume']).max()
tv_by_m_min = round(by_month['traffic_volume']).min()
prop_max =round(( tv_by_m_max / tv_by_m_mean)*100)
prop_min =round(( tv_by_m_min / tv_by_m_mean)*100)
print("The maxumal traffic volume is", prop_max, "% of average.\nThe minimal traffic intensity is", prop_min, "% of average")
The maxumal traffic volume is 104 % of average. The minimal traffic intensity is 91 % of average
The line chart above showed the following:
However, if we look at the numbers, we will see:
Thus, we can talk about the stability of traffic intensity throughout the year.
This means that those connected with this highway, from traffic police to the smallest cafe along the road, must be prepared for a lot of work throughout the year actually without decreading.
Step B.4. The Average Traffic Volume per a Time Unit
In this section, we will try to find what cause traffic volume volatility during a week, a day of a week and a hour of a day.
Step B.4.1. The Average Traffic Volume per a Day of a Week
traffic_daytime["day_of_week"]= traffic_daytime['date_time'].dt.dayofweek
by_dayofweek = traffic_daytime.groupby('day_of_week').mean().round(2)
by_dayofweek['traffic_volume']
day_of_week 0 4908.91 1 5214.63 2 5305.49 3 5325.30 4 5283.93 5 3723.53 6 3224.11 Name: traffic_volume, dtype: float64
fig = plt.figure(figsize = (8, 5))
by_dayofweek['traffic_volume'].plot.line()
plt.title('The Average Traffic Volume per Day of a Week', fontsize = 20)
plt.ylabel('Average Traffic Volume', size = 15)
plt.xlabel('Days of a Week', size = 15)
plt.xticks(ticks = by_dayofweek.index,
labels = ['Mon','Tue','Wed','Thu',
'Fri','Sat','Sun'],
fontsize = 12)
([<matplotlib.axis.XTick at 0x170a5e71c40>, <matplotlib.axis.XTick at 0x170a5e715b0>, <matplotlib.axis.XTick at 0x170a723f0d0>, <matplotlib.axis.XTick at 0x170a319ec40>, <matplotlib.axis.XTick at 0x170a32ba730>, <matplotlib.axis.XTick at 0x170a319e2e0>, <matplotlib.axis.XTick at 0x170a32bac70>], [Text(0, 0, 'Mon'), Text(1, 0, 'Tue'), Text(2, 0, 'Wed'), Text(3, 0, 'Thu'), Text(4, 0, 'Fri'), Text(5, 0, 'Sat'), Text(6, 0, 'Sun')])
by_dayofweek['traffic_volume'] . describe()
count 7.000000 mean 4712.271429 std 869.652300 min 3224.110000 25% 4316.220000 50% 5214.630000 75% 5294.710000 max 5325.300000 Name: traffic_volume, dtype: float64
tv_by_d_mean = round(by_dayofweek['traffic_volume'].mean())
tv_by_d_max = round(by_dayofweek['traffic_volume'].max())
tv_by_d_min = round(by_dayofweek['traffic_volume'].min())
prop_d_max = round((tv_by_d_max / tv_by_d_mean)*100)
prop_d_min = round(( tv_by_d_min / tv_by_d_mean)*100)
print("The maxumal traffic volume per day of a week", prop_d_max, "% of average.\nThe minimal \
traffic intensity is", prop_d_min, "% of average")
The maxumal traffic volume per day of a week 113 % of average. The minimal traffic intensity is 68 % of average
The line chart above showed the following:
Looking at the numbers, we can confirm:
Step B.4.2. Weekday Traffic vs Weekend Traffic
traffic_daytime['hour'] = traffic_daytime['date_time'].dt.hour
weekdays = traffic_daytime.copy()[traffic_daytime['day_of_week'] <= 4] # 4 == Friday
weekend = traffic_daytime.copy()[traffic_daytime['day_of_week'] >= 5] # 5 == Saturday
by_hour_weekdays = weekdays.groupby('hour').mean()
by_hour_weekend = weekend.groupby('hour').mean()
print('Average traffic intensity per hour at weekdays:\n', by_hour_weekdays['traffic_volume'], '\n\n\
Average traffic intensity per hour at weekend:\n', by_hour_weekend['traffic_volume'])
Average traffic intensity per hour at weekdays: hour 6 5366.128360 7 6030.413559 8 5503.497970 9 4895.269257 10 4378.419118 11 4633.419470 12 4855.382143 13 4859.180473 14 5152.995778 15 5592.897768 16 6189.473647 17 5785.009489 18 4434.209431 Name: traffic_volume, dtype: float64 Average traffic intensity per hour at weekend: hour 6 1089.686767 7 1590.406302 8 2339.690516 9 3111.623917 10 3686.632302 11 4044.154955 12 4372.482883 13 4362.296564 14 4358.543796 15 4342.456881 16 4339.693805 17 4151.919929 18 3811.792279 Name: traffic_volume, dtype: float64
fig = plt.figure(figsize = (10,5))
plt.title('Weekday Traffic vs Weekend Traffic', fontsize = 20)
plt.ylabel('Average Traffic Volume (cars per hour)', size = 15)
plt.xlabel('Day Hours', size = 15)
plt.yticks(fontsize = 12)
sns.set_style("ticks")
sns.lineplot(data = by_hour_weekdays['traffic_volume'],
label = "Weekdays", marker='o')
sns.lineplot(data = by_hour_weekend['traffic_volume'],
label = "Weekend", marker = 'o')
plt.axhline(y = 5000) # the average traffic volume (roughly)
plt.legend()
<matplotlib.legend.Legend at 0x170a4a8d580>
by_hour_weekdays['traffic_volume'].describe()
count 13.000000 mean 5205.868959 std 591.392769 min 4378.419118 25% 4855.382143 50% 5152.995778 75% 5592.897768 max 6189.473647 Name: traffic_volume, dtype: float64
by_hour_weekend['traffic_volume'].describe()
count 13.000000 mean 3507.798530 std 1135.134726 min 1089.686767 25% 3111.623917 50% 4044.154955 75% 4342.456881 max 4372.482883 Name: traffic_volume, dtype: float64
by_h_max_weekdays = round(by_hour_weekdays['traffic_volume'].max())
by_h_min_weekdays = round(by_hour_weekdays['traffic_volume'].min())
by_h_mean_weekdays = round(by_hour_weekdays['traffic_volume'].mean())
by_h_max_weekend = round(by_hour_weekend['traffic_volume'].max())
by_h_mean_weekend = round(by_hour_weekend['traffic_volume'].mean())
weekdays_vs_weekend = round((by_h_mean_weekdays /by_h_mean_weekend )*100)
prop_h_max_weekdays = round((by_h_max_weekdays / tv_by_d_mean)*100)
prop_h_min_weekdays = round(( by_h_min_weekdays / tv_by_d_mean)*100)
prop_h_mean_weekdays = round(( by_h_mean_weekdays / tv_by_d_mean)*100)
prop_h_max_weekend = round((by_h_max_weekend / tv_by_d_mean)*100)
prop_h_mean_weekend = round(( by_h_mean_weekend / tv_by_d_mean)*100)
print("The statistic calculations showed following:\n\
- The average traffic of weekdays more than weekend trafic on", weekdays_vs_weekend - 100, "% .\n \
- The maxumal traffic on weekdays is more than average traffic in general on", prop_h_max_weekdays - 100, "%.\n\
- The minimal traffic volume on weekdays is less than average traffic in general on", 100 - prop_h_min_weekdays, "%.\n\
- The average weekdays traffic is more than average traffic in general on", prop_h_mean_weekdays - 100, "%.\n\
- The maxumal traffic volume on weekend is less than average traffic in general on", 100 - prop_h_max_weekend, "%.\n\
- The average traffic volume on weekend is less than average traffic in general on", 100 - prop_h_mean_weekend, "%."
)
The statistic calculations showed following: - The average traffic of weekdays more than weekend trafic on 48 % . - The maxumal traffic on weekdays is more than average traffic in general on 31 %. - The minimal traffic volume on weekdays is less than average traffic in general on 7 %. - The average weekdays traffic is more than average traffic in general on 10 %. - The maxumal traffic volume on weekend is less than average traffic in general on 7 %. - The average traffic volume on weekend is less than average traffic in general on 26 %.
The line chart above showed the following:
Looking at the numbers, we can confirm the following:
Step B.5. Correlation of Weather Condition with Traffic Intensity
In this section, we will try to figure out how weather conditions affect traffic volume.
We will work with the business days stored in the 'weekdays' dataframe.
Explanation:
Our goal is to find circumstances that affect traffic. On weekends, of course, people prefer to stay at home if the weather is bad,
therefore, the traffic will be even less intense than usual. We have to calculate whether bad weather affects traffic on weekdays.
Step B.5.1 Corellation between traffic volume and weather conditions stored as numerical data
round(weekdays['traffic_volume'].mean())
5205
weekdays.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 18572 entries, 0 to 48147 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 18572 non-null object 1 temp 18572 non-null float64 2 rain_1h 18572 non-null float64 3 snow_1h 18572 non-null float64 4 clouds_all 18572 non-null int64 5 weather_main 18572 non-null object 6 weather_description 18572 non-null object 7 date_time 18572 non-null datetime64[ns] 8 traffic_volume 18572 non-null int64 9 hour 18572 non-null int64 10 month 18572 non-null int64 11 day_of_week 18572 non-null int64 dtypes: datetime64[ns](1), float64(3), int64(5), object(3) memory usage: 1.8+ MB
# what columns are numerical we took from the paragraph Attribute Information above
weather_numerical = weekdays[['temp','rain_1h','snow_1h','clouds_all', 'traffic_volume']]
weather_numerical.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 18572 entries, 0 to 48147 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 temp 18572 non-null float64 1 rain_1h 18572 non-null float64 2 snow_1h 18572 non-null float64 3 clouds_all 18572 non-null int64 4 traffic_volume 18572 non-null int64 dtypes: float64(3), int64(2) memory usage: 870.6 KB
#Spearman methods measures the degree of association between two variables.
weather_numerical.corr(method = 'spearman')['traffic_volume'].round(2).sort_values()
clouds_all -0.09 rain_1h -0.02 snow_1h -0.02 temp 0.11 traffic_volume 1.00 Name: traffic_volume, dtype: float64
The above correlation calculation showed the following:
However, the influence of weather conditions on traffic is insignificant:\
To better understand the numbers, let's visualize the relationship between traffic and weather conditions on graphs
weather_columns = ['clouds_all', 'rain_1h', 'snow_1h', 'temp']
weather_names = [ "Cloud Coverage", "Rain", "Snowfall", "Average Temperature"]
colors = ['black', 'green', 'blue', 'red']
units = ["%", "mm/h", "mm/h", "K"]
fig = plt.figure(figsize = (12,10))
for i in range(1,5):
sns.set_style("ticks")
plt.subplot(2,2,i)
sns.scatterplot(data = weather_numerical,
x = 'traffic_volume',y = weather_columns[i-1],
color = colors[i-1])
plt.title(label = "Traffic vs. {}".format(weather_names[i-1]),
fontsize=20)
plt.xlabel(xlabel = "Traffic Volume",
fontsize = 15)
plt.ylabel(ylabel = "{} ({})".format(weather_names[i-1],units[i-1]),
fontsize=15)
fig.tight_layout(pad = 2)
plt.show()
Step B.5.1.1. Checking out the weather conditions in the data (rain and snow)
traffic_cleaned['snow_1h'].value_counts().sort_index()
0.00 48130 0.05 14 0.06 12 0.08 2 0.10 6 0.13 6 0.17 3 0.21 1 0.25 6 0.32 5 0.44 2 0.51 6 Name: snow_1h, dtype: int64
0.5 mm per hour does not look like snowfall. And 48130 hours without snow (for 5.5 years) does not correlate in any way with the description of the climate (snowfall in April-October is a rather trivial phenomenon in this area).
print("The given data set contains values which does not equel zero :\n",\
round((len(traffic_cleaned[traffic_cleaned['rain_1h'] != 0])/len(traffic_cleaned))*100), "% in the rain_1h column,\n",\
round(((len(traffic_cleaned[traffic_cleaned['snow_1h'] != 0])/len(traffic_cleaned))*100), 3), "% in the 'snow_1h' column,\n",\
round((len(traffic_cleaned[traffic_cleaned['clouds_all'] != 0])/len(traffic_cleaned))*100), "% in the 'clouds_all' column,\n",\
round((len(traffic_cleaned[traffic_cleaned['temp'] != 0])/len(traffic_cleaned))*100), "% in the 'temp' column.")
The given data set contains values which does not equel zero : 7 % in the rain_1h column, 0.131 % in the 'snow_1h' column, 96 % in the 'clouds_all' column, 100 % in the 'temp' column.
Now we can better understand our analysis:
Step B.5.2. Corellation between traffic volume and weather conditions stored as categorical data
In this section, we will try to figure out how complex weather conditions affect traffic volume. We again will work with the business days stored in the 'weekdays' dataframe.
by_weather_main = weekdays.groupby('weather_main').mean().reset_index().sort_values('traffic_volume')
by_weather_description = weekdays.groupby('weather_description').mean().reset_index().sort_values('traffic_volume')
Step B.5.2.1. Exploration the "weather_main" column
by_weather_main.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 11 entries, 9 to 2 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 weather_main 11 non-null object 1 temp 11 non-null float64 2 rain_1h 11 non-null float64 3 snow_1h 11 non-null float64 4 clouds_all 11 non-null float64 5 traffic_volume 11 non-null float64 6 hour 11 non-null float64 7 month 11 non-null float64 8 day_of_week 11 non-null float64 dtypes: float64(8), object(1) memory usage: 880.0+ bytes
by_weather_main['traffic_volume'].value_counts().sort_index()
4211.000000 1 4772.062237 1 4990.468217 1 5185.315278 1 5204.408946 1 5211.857143 1 5219.994286 1 5237.311249 1 5251.874354 1 5265.739767 1 5281.033755 1 Name: traffic_volume, dtype: int64
Step B.5.2.2. Exploration the "weather_description" column
by_weather_description.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 37 entries, 0 to 26 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 weather_description 37 non-null object 1 temp 37 non-null float64 2 rain_1h 37 non-null float64 3 snow_1h 37 non-null float64 4 clouds_all 37 non-null float64 5 traffic_volume 37 non-null float64 6 hour 37 non-null float64 7 month 37 non-null float64 8 day_of_week 37 non-null float64 dtypes: float64(8), object(1) memory usage: 2.9+ KB
by_weather_description['traffic_volume'].value_counts().sort_index()
4211.000000 1 4314.000000 1 4451.106195 1 4618.636364 1 4654.222222 1 4788.295042 1 4856.567100 1 4932.666667 1 4980.142857 1 4990.468217 1 5162.235294 1 5173.384181 1 5185.315278 1 5191.116556 1 5204.408946 1 5205.983051 1 5211.857143 1 5216.000000 1 5222.000000 1 5231.235294 1 5234.427885 1 5235.021991 1 5236.800000 1 5245.875000 1 5250.180473 1 5251.957279 1 5254.884615 1 5260.458711 1 5272.462500 1 5283.446903 1 5283.620614 1 5329.078298 1 5338.305158 1 5387.600000 1 5517.923077 1 5579.750000 1 5664.000000 1 Name: traffic_volume, dtype: int64
Step B.5.2.3. Correlation between traffic and complex weather conditions
fig = plt.figure(figsize = (10, 6))
sns.barplot(data = by_weather_main, x = 'traffic_volume', y = 'weather_main')
sns.set_style("ticks")
plt.title(label = "Traffic Volume by Main Weather Conditions",
fontsize = 20)
plt.xlabel(xlabel = "Average Traffic Volume", fontsize = 15)
plt.ylabel(ylabel = "Main Weather Conditions", size = 15)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.axvline(x= 5200) # average traffic volume at weekdays
plt.show()
It seems that clouds, clear sky and drizzle hace the most influence on the traffic volume.
However, the ifluence is rather weak.
fig = plt.figure(figsize = (12, 14))
sns.barplot(data = by_weather_description, x = 'traffic_volume', y = 'weather_description' )
sns.set_style("ticks")
plt.title(label = "Traffic Volume by Detailed Weather Conditions",
fontsize = 20)
plt.xlabel(xlabel = "Average Traffic Volume", fontsize = 15)
plt.ylabel(ylabel = "Description of Weather Conditions", size = 15)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 10)
plt.axvline(x= 5200) # average traffic volume at weekdays
plt.show()
It appears that "shower snow", "light rain and snow" and "proximity thunderstorm with rain" have the biggest impact on traffic.
However, this does not seem significant.
Step B.6. Traffic Intensity during Holidays
In this section, we will try to figure out how traffic volume behaves during holidays. We will work with the cleaned dataset.
traffic_cleaned['holiday'].value_counts()
None 48132 Labor Day 7 Thanksgiving Day 6 Christmas Day 6 New Years Day 6 Martin Luther King Jr Day 6 Columbus Day 5 Veterans Day 5 Washingtons Birthday 5 Memorial Day 5 Independence Day 5 State Fair 5 Name: holiday, dtype: int64
len(traffic_cleaned[traffic_cleaned['holiday']!= 'None'])
61
holidays = traffic_cleaned[traffic_cleaned['holiday'] != 'None']
by_holidays = holidays.groupby('holiday').mean().reset_index().sort_values('traffic_volume')
by_holidays['traffic_volume'].describe()
count 11.000000 mean 855.200866 std 262.772492 min 519.400000 25% 635.000000 50% 827.500000 75% 1044.571429 max 1356.000000 Name: traffic_volume, dtype: float64
fig = plt.figure(figsize = (10, 6))
sns.barplot(data = by_holidays, x = 'traffic_volume', y = 'holiday')
sns.set_style("ticks")
plt.title(label = "Traffic Volume by Holidays",
fontsize = 20)
plt.xlabel(xlabel = "Average Traffic Volume", fontsize = 15)
plt.ylabel(ylabel = "Holidays", size = 15)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.axvline(x= 855) # average traffic volume at weekdays
plt.show()
print("The statistic calculations showed following:\n\
- The average traffic on holidays is", round((holidays['traffic_volume'].mean() / weekend['traffic_volume'].mean()), 2)*100 , "% of average traffic of weekends.\n \
- The maximum traffic volume on holidays is ", round((holidays['traffic_volume'].max() / holidays['traffic_volume'].mean()), 2)*100, "% of average traffic on holidays.\n\
- The maximun traffic on holidays is", round((holidays['traffic_volume'].max() / weekend['traffic_volume'].max()), 2)*100 , "% of maximum traffic of weekends."
)
The statistic calculations showed following: - The average traffic on holidays is 25.0 % of average traffic of weekends. - The maximum traffic volume on holidays is 178.0 % of average traffic on holidays. - The maximun traffic on holidays is 23.0 % of maximum traffic of weekends.
The graph and statistical calculation above showed that the surveyed highway is almost empty on holidays (average 855 cars per hour versus 3507 on regular weekends).
The most intensive traffic falls on the New Year holidays (1356 cars per hour).
In this project we learned:
For example, as a conclusion from the above analysis, we can say:
It can also be said that no matter how reliable the data may seem, before starting the analysis, it is necessary to check their validation.