Interstate 94 (I-94) is an east–west Interstate Highway connecting the Great Lakes and northern Great Plains regions of the United States. Its western terminus is in Billings, Montana, at a junction with I-90; its eastern terminus is in Port Huron, Michigan, where it meets with I-69 and crosses the Blue Water Bridge into Sarnia, Ontario, Canada, where the route becomes Ontario Highway 402. It thus lies along the primary overland route from Seattle (via I-90) to Toronto (via Ontario Highway 401), and is the only east–west Interstate highway to have a direct connection to Canada.
The record of traffic in I-94 was tranfromed to the dataset, and created by (). In this porject, we will use this dataset to analysis some insight about the factor of heavy traffic on I-94 road.
Based on the description of dataset, all these recording is about from midway between Minneapolis and St Paul, MN and Westbound traffic volume (meaning that is is car traffic's records from east to west). So we'll find the indicators of heavy traffic around the factor: Westbound traffic, with these title description below:
holiday (Categorical): US National holidays plus regional holiday, Minnesota State Fair
temp (Numeric): Average temp in kelvin
rain_1h (Numeric): Amount in mm of rain that occurred in the hour
snow_1h (Numeric): Amount in mm of snow that occurred in the hour
clouds_all (Numeric): Percentage of cloud cover
weather_main (Categorical): Short textual description of the current weather
weather_description (Categorical): Longer textual description of the current weather
date_time (DateTime): Hour of the data collected in local CST time
traffic_volume (Numeric): Hourly I-94 ATR 301 reported westbound traffic volume
Now, let's get started!
We'll start the project by load in, and see some basic information about data.
## Using pandas, seaborn, matplotlib libraries
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
## Interpet Excel file
i94 = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
## Check basic information
i94.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48204 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null object 8 traffic_volume 48204 non-null int64 dtypes: float64(3), int64(2), object(4) memory usage: 3.3+ MB
By the few basics information, we can see that the data has 48024 rows, 9 columns, and none of it have missed data, except for the date_time is in object format. Let's check a few five row of this dataset.
## Check first 5 rows of dataset
i94.head()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
We will explore the 'traffic_volume' first, but instead on print it here, we're going to see the distribution by graph first, and check the basics distribution by numbers after. Below, we'll use histogram graph to quickly check-out data:
## Drawing histogram for 'traffic_volume':
i94['traffic_volume'].plot.hist()
plt.show()
## Check the distribution by number:
i94['traffic_volume'].describe()
count 48204.000000 mean 3259.818355 std 1986.860670 min 0.000000 25% 1193.000000 50% 3380.000000 75% 4933.000000 max 7280.000000 Name: traffic_volume, dtype: float64
Based on the histogram, and the number's distribution, we can see that:
An suggestion: from the histogram and the first 5 rows of dataset, we can considering that during the daytime (before 18:00 p.m, after 8:00 a.m) will gain more traffic volume than other time period in a day. We also considering it by the number's distribution:
Since we're exploring that the time period could become an indicator of traffic condition by possibility, it got us a direction analysis: compare with record data at daytime vs nighttime.
We can divide the daytime and nighttime along these mark below:
## Convert datetime object data form:
i94['date_time'] = pd.to_datetime(i94['date_time'])
## Check 'traffic_volume' format data
i94.info()
i94.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48204 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null datetime64[ns] 8 traffic_volume 48204 non-null int64 dtypes: datetime64[ns](1), float64(3), int64(2), object(3) memory usage: 3.3+ MB
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
## Isolating hour in datetime data:
hours = i94['date_time'].dt.hour
## Participation data in daytime/nighttime:
daytime = i94[hours.between(7,19)]
nighttime = i94[(hours <7)|(hours >19)]
## Check data:
daytime.head()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
nighttime.head()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
11 | None | 289.38 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 20:00:00 | 2784 |
12 | None | 288.61 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 21:00:00 | 2361 |
13 | None | 287.16 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 22:00:00 | 1529 |
14 | None | 285.45 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-02 23:00:00 | 963 |
15 | None | 284.63 | 0.0 | 0.0 | 1 | Clear | sky is clear | 2012-10-03 00:00:00 | 506 |
We've been participated dataset into two part: 1 part contain only datetime in daytime (07a.m ~ 19p.m), and the rest is containing only in the nighttime. Now we're going to compare traffic volume in daytime vs nighttime:
## Drawing histogram in grid graph
plt.figure(figsize=[30,10]) ## Constrat size of grid graph
## Drawing the first histogram graph: daytime
plt.subplot(1,2,1)
plt.hist(daytime['traffic_volume'])
plt.title('Traffic Volume During Daytime (times)')
plt.xlabel('Volume (times)')
plt.ylabel('Frequency')
## Drawing the second graph at the right
plt.subplot(1,2,2)
plt.hist(nighttime['traffic_volume'])
plt.title('Traffic Volume During Night Time (times)')
plt.xlabel('Volume (times)')
plt.ylabel('Frequency')
plt.show()
## Check the distribution of two dataset:
daytime['traffic_volume'].describe()
count 25838.000000 mean 4649.292360 std 1202.321987 min 0.000000 25% 4021.000000 50% 4736.000000 75% 5458.000000 max 7280.000000 Name: traffic_volume, dtype: float64
By combined the result of histogram (the left graph) and the distribution, we got:
Sub-conclusion 1: As in daytime, the traffic volume is increasing <= We will check the night time to see whether time period is a indicator or not.
nighttime['traffic_volume'].describe()
count 22366.000000 mean 1654.648484 std 1425.175292 min 0.000000 25% 486.000000 50% 1056.500000 75% 2630.750000 max 6386.000000 Name: traffic_volume, dtype: float64
Similary, we got some information by combined two result of histogram and number's distribution:
Sub-conclusion 2: In the nighttime, the traffic volume have decreased than daytime.
CONCLUSION : The time period can be considered as a indicator of heavy traffic, specify during daytime, and we can skip the night time because the volume during this period is lighter, have not effect to our analysis.
We've already conclusived that daytime can effect to the heavy traffic in I-94, and it might be another factor also play a role in traffic volume, ex: which month? which day in a week?. Now, we're going to explore it.
Starting with month, we might want to examine the relation of traffic volume with specify month in year by its mean value, which will show by code below:
## Isolating the month, create a new column:
i94['month'] = i94['date_time'].dt.month
## Gathering data by month columns:
by_month = i94.groupby('month').mean()
by_month
temp | rain_1h | snow_1h | clouds_all | traffic_volume | |
---|---|---|---|---|---|
month | |||||
1 | 264.907683 | 0.012154 | 0.000716 | 55.393909 | 3051.081378 |
2 | 265.577242 | 0.002204 | 0.000000 | 49.326716 | 3197.945547 |
3 | 272.764798 | 0.012138 | 0.000000 | 54.839705 | 3308.388611 |
4 | 278.722985 | 0.084137 | 0.000000 | 57.634656 | 3304.372388 |
5 | 288.172498 | 0.123616 | 0.000000 | 51.648557 | 3366.319432 |
6 | 293.351019 | 0.257601 | 0.000000 | 42.975610 | 3419.077413 |
7 | 295.276199 | 2.348936 | 0.000000 | 36.036288 | 3205.481752 |
8 | 293.615149 | 0.311969 | 0.000000 | 37.697807 | 3394.241891 |
9 | 291.190540 | 0.286053 | 0.000000 | 40.749152 | 3303.049334 |
10 | 282.966573 | 0.051330 | 0.000000 | 50.069105 | 3390.678376 |
11 | 275.795938 | 0.003049 | 0.000000 | 53.813619 | 3167.592784 |
12 | 267.097371 | 0.051226 | 0.001847 | 64.189456 | 3024.257943 |
## Visualing line plot od relation between month and volume:
plt.plot(by_month['traffic_volume'])
plt.title('Traffic Volume by Month (times)')
plt.ylabel('Volume (times)')
plt.xlabel('Month')
plt.show()
By a quick glance in graph, the month have a large traffic volume is about April - June (summer period), Autumn and October. The possibility that in summer, people's activities increase, lead to traffic requirement increase, and so on. It means that we have some specify month in year which the traffic volume suddenly increase, especially from April to June, Autumn and Octorber.
Next, we will examine how day of week effect to the traffic volume in I-94, in similar way we've done with Month factor.
## Create new column: day of week
i94['dayofweek']=i94['date_time'].dt.dayofweek #0 is Monday, 6 is Sunday
## Create new dataframe describe the relation of day of week to traffic volume:
by_dayofweek = i94.groupby('dayofweek').mean()
by_dayofweek['traffic_volume']
dayofweek 0 3309.387161 1 3488.555799 2 3583.196681 3 3637.899663 4 3656.358836 5 2773.638120 6 2368.588329 Name: traffic_volume, dtype: float64
## Visualize the realtion of day in week to traffic volume:
plt.plot(by_dayofweek['traffic_volume'])
plt.title('Traffic Volume by Day of Week (times)')
plt.xlabel('Day')
plt.ylabel('Volume (times)')
plt.show()
As the result was reflected in the graph, the traffic volume had decreased down start from Friday (item 4th in axis x) and reach the minimum on Sunday (item 6th in axis X). We have a border line point is Friday: in Friday, the traffic volume is the most, from Monday, but from Friday to the weekend the traffic volume significantly decrease.
By gathering all day of week in one dataset, we've been dragged down the average of the business day's record, so, let's take a closer look to the two different dataset: 1 for business days and 1 for weekend days - in order to clearly the different of traffic volume on these two dataset.
## Isolating and participating the business days data and weekend days data:
business_days = i94.copy()[i94['dayofweek']<=4]
weekend_days = i94.copy()[i94['dayofweek']>4]
## Groupby and retrive the mean value of two dataset
by_businessdays = business_days.groupby('dayofweek').mean()
by_weekenddays = weekend_days.groupby('dayofweek').mean()
## Check some data:
by_businessdays['traffic_volume'].describe()
count 5.000000 mean 3535.079628 std 142.036450 min 3309.387161 25% 3488.555799 50% 3583.196681 75% 3637.899663 max 3656.358836 Name: traffic_volume, dtype: float64
by_weekenddays['traffic_volume'].describe()
count 2.000000 mean 2571.113225 std 286.413454 min 2368.588329 25% 2469.850777 50% 2571.113225 75% 2672.375673 max 2773.638120 Name: traffic_volume, dtype: float64
Now let's take some graph of these two dataset, and compare it.
## Use grid plot to compare
plt.figure(figsize=[20,10])
plt.subplot(1,2,1)
plt.plot(by_businessdays['traffic_volume'])
plt.title('Traffic Volume in Business Days (times)')
plt.xlabel('Day')
plt.ylabel('Volume (times)')
## The second gird graph's element
plt.subplot(1,2,2)
plt.plot(by_weekenddays['traffic_volume'])
plt.ylim([0,3700])
plt.title('Traffic Volume in Weekend Days (times)')
plt.xlabel('Day')
plt.ylabel('Volume (times)')
plt.show()
Let's combine the distribution in number and the line graph, we can see the significantly different betwwen business days and weekend days about traffic volume. The peak of volume reached when it's Friday (item 4th in axis X) - the end of a business week - and min in Monday.The growth is likely linear, one is linear increase (business days) and one is linear decrease (weekend days).
When compare to the graph about volume in weekend, the minimum is reached in Sunday, but the decrease is slightly smooth in slope angle, because from the end of Friday had the strongly dropped down to less than 3000 times traffic volume in Saturday (item 5th in axis X). Summarilize, the traffic volume will increase in business days and reach the peak in Friday.
CONCLUSION 2: There's an aprearance of some specific month, and, in business weekdays, the traffic volume is increasing and traffic conditon is in heavy status.
After time, now we're going to exploring data more and find out what's else can effect to the traffic status. Let's do this by examine the correlation module with traffic volume and the rest.
## Define the correlation of dataframe
i94.corr()
temp | rain_1h | snow_1h | clouds_all | traffic_volume | month | dayofweek | |
---|---|---|---|---|---|---|---|
temp | 1.000000 | 0.009069 | -0.019755 | -0.101976 | 0.130299 | 0.223738 | -0.007708 |
rain_1h | 0.009069 | 1.000000 | -0.000090 | 0.004818 | 0.004714 | 0.001298 | -0.006920 |
snow_1h | -0.019755 | -0.000090 | 1.000000 | 0.027931 | 0.000733 | 0.020412 | -0.014928 |
clouds_all | -0.101976 | 0.004818 | 0.027931 | 1.000000 | 0.067054 | -0.009133 | -0.039715 |
traffic_volume | 0.130299 | 0.004714 | 0.000733 | 0.067054 | 1.000000 | -0.002533 | -0.149544 |
month | 0.223738 | 0.001298 | 0.020412 | -0.009133 | -0.002533 | 1.000000 | 0.010741 |
dayofweek | -0.007708 | -0.006920 | -0.014928 | -0.039715 | -0.149544 | 0.010741 | 1.000000 |
We've already known that the distribution is linear when correlation module (r) match: r=1 (or r=-1)
, mean that a factor have stronger with another factor by its r if r is high and closer to 1.
In the correlation table above, we see r(traffic_volume,temp) = 0.13
and r(traffic_volume,dayofweek)=-0.149
, means that traffic status have a relation with temp and dayofweek factor, so, let's talk about the temp factor.
## Convert the temp into Celsius
i94['temp'] = i94['temp'] - 272.15
## Graphing the relation of traffic volume with temp:
plt.scatter(i94['traffic_volume'],i94['temp'],)
plt.title('Traffic Volume by Ambient Temperature (times)')
plt.xlabel('Volume (times)')
plt.ylabel('Temperature (Celsius)')
plt.ylim([-40,40])
plt.show()
The result in the scatter plot can't mention any information to us, though its correlation is high, aproximately equal to dayofweek factor. We will try with the second one: Clouds_all
## Graphing the relation of traffic volume with foggy cover percentage:
plt.scatter(i94['traffic_volume'],i94['clouds_all'],)
plt.title('Traffic Volume by Foggy Percentage (times)')
plt.xlabel('Volume (times)')
plt.ylabel('Foggy Cover (%)')
plt.show()
Since we can't see anything in scatter plot of traffic volume with foggy cover percentage, we'll next to the factor: weather_main and weather_description. We're going to calculate the average traffic volume associated with each unique value in these two columns:
## Caculate the mean value of traffic volume realtion to weather situation:
by_weathermain = i94.copy().groupby('weather_main').mean()
by_weatherdesc = i94.copy().groupby('weather_description').mean()
## Check data:
by_weathermain['traffic_volume']
weather_main Clear 3055.908819 Clouds 3618.449749 Drizzle 3290.727073 Fog 2703.720395 Haze 3502.101471 Mist 2932.956639 Rain 3317.905501 Smoke 3237.650000 Snow 3016.844228 Squall 2061.750000 Thunderstorm 3001.620890 Name: traffic_volume, dtype: float64
by_weathermain['traffic_volume'].describe()
count 11.000000 mean 3067.239524 std 424.607888 min 2061.750000 25% 2967.288764 50% 3055.908819 75% 3304.316287 max 3618.449749 Name: traffic_volume, dtype: float64
by_weatherdesc['traffic_volume'].head()
weather_description SQUALLS 2061.750000 Sky is Clear 3423.148899 broken clouds 3661.142092 drizzle 3094.858679 few clouds 3691.453476 Name: traffic_volume, dtype: float64
by_weatherdesc['traffic_volume'].describe()
count 38.000000 mean 3350.650074 std 708.136232 min 2061.750000 25% 2866.053160 50% 3220.126983 75% 3632.773235 max 5664.000000 Name: traffic_volume, dtype: float64
Now, we're going to plot some graph to reveal to relation between traffic volume and weather main/ weather description, and this times, we'll use horizontal bar-plot to draw:
## Drawing plot of weather main and traffic volume
by_weathermain['traffic_volume'].plot.barh()
plt.title('Traffic Volume by Weather Status')
plt.xlabel('Volume (times)')
plt.ylabel('Weather Status')
plt.show()
As shown in the graph result, in the weather status with Cloud, the traffic volume is maximum => People prefer to join traffic in the cool weather than Clear (refer the compare with weather description below), and, Haze status too.
Also in the graph, the status of Rain, Smoke, Drizzle have the traffic volume at high situation (>3000), only the status 'Squall' have minimum volume - of course, because no one want to traffic under the weather is rain with water's rock fall down. In other words, people want to join traffic in properly dry, cool, and not harder too much at weather situation.
Finally, we examine the relation of traffic volume with specific weather description situation.
## Drawing plot of specific weather description and traffic volume
by_weatherdesc['traffic_volume'].plot.barh(figsize=[30,60])
plt.title('Traffic Volume by Specific Weather Situation')
plt.xlabel('Volume (times)')
plt.ylabel('Weather Situation')
plt.show()
There's some funny point here, the traffic volume is maximum (over 5000 times) at the weather's situation: shower snow, and we can see that the situation: sky is clear and Sky is clear is the same situation but due to case_sensitive, value is separated, and it can be the second one after shower_snow case. Minimum? Of course, is SQUALL situation.
In the graph compare traffic volume with weather main above, because it is the shortest of a specific weather situation which have been detaily descripted in Weather description => We can put this result to the reference item. The reason is there's some mismatch in the result when compare traffic volume with weather main and compare traffic volume with weather description
Because weather description is the description of weather at the record time in detaily, so we can considering that at the Clear situation, the traffic volume is high. About the case of shower snow, we can temporaly accept this, and waiting for another dataset record to check whether the description is properly or not.