This project will look at a dataset of traffic going West on the I-94 Interstate highway. The goal will be to identify various indicators that are most conducive to heavy traffic. Will explore along the lines of time and weather.
Load the .csv file into a pandas dataframe.
import pandas as pd
metro = pd.read_csv("Metro_Interstate_Traffic_Volume.csv")
Examine the first five rows.
metro.head()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
0 | None | 288.28 | 0.0 | 0.0 | 40 | Clouds | scattered clouds | 2012-10-02 09:00:00 | 5545 |
1 | None | 289.36 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 10:00:00 | 4516 |
2 | None | 289.58 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 11:00:00 | 4767 |
3 | None | 290.13 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2012-10-02 12:00:00 | 5026 |
4 | None | 291.14 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2012-10-02 13:00:00 | 4918 |
Examine the last five rows.
metro.tail()
holiday | temp | rain_1h | snow_1h | clouds_all | weather_main | weather_description | date_time | traffic_volume | |
---|---|---|---|---|---|---|---|---|---|
48199 | None | 283.45 | 0.0 | 0.0 | 75 | Clouds | broken clouds | 2018-09-30 19:00:00 | 3543 |
48200 | None | 282.76 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 20:00:00 | 2781 |
48201 | None | 282.73 | 0.0 | 0.0 | 90 | Thunderstorm | proximity thunderstorm | 2018-09-30 21:00:00 | 2159 |
48202 | None | 282.09 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 22:00:00 | 1450 |
48203 | None | 282.12 | 0.0 | 0.0 | 90 | Clouds | overcast clouds | 2018-09-30 23:00:00 | 954 |
Get information about dataset and columns.
metro.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48204 entries, 0 to 48203 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 holiday 48204 non-null object 1 temp 48204 non-null float64 2 rain_1h 48204 non-null float64 3 snow_1h 48204 non-null float64 4 clouds_all 48204 non-null int64 5 weather_main 48204 non-null object 6 weather_description 48204 non-null object 7 date_time 48204 non-null object 8 traffic_volume 48204 non-null int64 dtypes: float64(3), int64(2), object(4) memory usage: 3.3+ MB
The dataset records seven possible factors that could influence the state of traffic at a given time: whether it is a holiday, the temperature, rain, snow, clouds, weather, and time. The other columns are a description of the weather and traffic level.
Import the matplotlib library to make graphs.
import matplotlib.pyplot as plt
%matplotlib inline
Examine distribution of traffic_volume column using a histogram.
metro["traffic_volume"].plot.hist()
plt.title("Traffic Volume")
plt.xlabel("Number of Cars")
plt.show()
The distribution of the Traffic Volume column is bimodal and slightly skewed to the right. The center of the data is around 3500.
Examine descriptive statistics for the traffic_volume column.
metro["traffic_volume"].describe()
count 48204.000000 mean 3259.818355 std 1986.860670 min 0.000000 25% 1193.000000 50% 3380.000000 75% 4933.000000 max 7280.000000 Name: traffic_volume, dtype: float64
The data ranges from 0 to 7280 cars per hour on the I-94. Most times there are below 5000 cars.
Want to determine whether it being daytime or nighttime affects traffic volume. Daytime is considered 7 a.m. to 7 p.m, while nighttime is considered 7 p.m. to 7 a.m.
metro["date_time"] = pd.to_datetime(metro["date_time"])
hour = metro["date_time"].dt.strftime("%H").astype(int)
d_metro = metro[hour.between(7,19)]
n_metro = metro[hour.between(7,19) == False]
Plot side-by-side the histograms of traffic volume for daytime and nighttime.
plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
plt.hist(d_metro["traffic_volume"])
plt.title("Daytime Traffic Volume")
plt.xlabel("Number of Cars")
plt.ylabel("Frequency")
plt.xlim([0,7500])
plt.ylim([0,8000])
plt.subplot(1,2,2)
plt.hist(n_metro["traffic_volume"])
plt.title("Nighttime Traffic Volume")
plt.xlabel("Number of Cars")
plt.ylabel("Frequency")
plt.xlim([0,7500])
plt.ylim([0,8000])
plt.show()
The daytime traffic volume is unimodal and mostly symmetric, with a slight skew to the left. However, the nighttime traffic volume is highly skewed to the right. The center of values in the daytime traffic volume is much higher than the center of values in the nighttime traffic volume.
Summary statistics for daytime and nighttime traffic volumes.
print(d_metro["traffic_volume"].describe(),"\n")
print(n_metro["traffic_volume"].describe())
count 25838.000000 mean 4649.292360 std 1202.321987 min 0.000000 25% 4021.000000 50% 4736.000000 75% 5458.000000 max 7280.000000 Name: traffic_volume, dtype: float64 count 22366.000000 mean 1654.648484 std 1425.175292 min 0.000000 25% 486.000000 50% 1056.500000 75% 2630.750000 max 6386.000000 Name: traffic_volume, dtype: float64
The average traffic volume for dayttime is 4649.29 cars per hour, while for nighttime it is 1652.65 cars. If the goal is to identify indicators of heavy traffic, should focus on daytime data.
Get average traffic volume per month using daytime data.
pd.options.mode.chained_assignment = None
d_metro["month"] = d_metro.loc[:,"date_time"].dt.month
df_month = d_metro.groupby("month").mean()
month_traf = df_month["traffic_volume"]
print(month_traf)
month 1 4385.217310 2 4593.187798 3 4761.529676 4 4771.232816 5 4788.966639 6 4791.087488 7 4502.628360 8 4818.434690 9 4755.709916 10 4809.481678 11 4588.910486 12 4276.567081 Name: traffic_volume, dtype: float64
Line graph of average traffic volume per month using daytime data.
plt.plot(month_traf)
plt.xticks(ticks = [1,2,3,4,5,6,7,8,9,10,11,12], labels = ["Jan","Feb","Mar","Apr","May","Jun",
"Jul","Aug","Sep","Oct","Nov","Dec"], rotation = 30)
plt.title("Average Traffic Volume by Month")
plt.xlabel("Month")
plt.ylabel("Traffic Volume")
plt.show()
Traffic tends to be higher in the spring, summer, and fall months, and lower in the winter months. However, July is an exception to this trend.
Line plot distribution of average traffic volume by day of the week.
d_metro["day"] = d_metro["date_time"].dt.dayofweek
day_metro = d_metro.groupby("day").mean()
day_traffic = day_metro["traffic_volume"]
plt.plot(day_traffic)
plt.title("Traffic Volume by Day")
plt.xticks(ticks = [0,1,2,3,4,5,6], labels = ["Mon","Tue","Wed","Thur","Fri","Sat","Sun"])
plt.xlabel("Day of Week")
plt.ylabel("Traffic Volume")
plt.show()
Traffic tends to be much higher on business days than weekends. This is because people commute to work on business days, and take a break on the weekends.
d_metro["hour"] = d_metro["date_time"].dt.hour
business_day = d_metro[d_metro["day"] <= 4]
weekend = d_metro[d_metro["day"] >= 5]
business_df = business_day.groupby("hour").mean()
business_traf = business_df["traffic_volume"]
weekend_df = weekend.groupby("hour").mean()
weekend_traf = weekend_df["traffic_volume"]
plt.figure(figsize = (12,6))
plt.subplot(1,2,1)
plt.plot(business_traf)
plt.title("Business Day Traffic Volume")
plt.xlabel("Time of Day")
plt.ylabel("Traffic Volume")
plt.ylim([1500,6500])
plt.subplot(1,2,2)
plt.plot(weekend_traf)
plt.title("Weekend Traffic Volume")
plt.xlabel("Time of Day")
plt.ylabel("Traffic Volume")
plt.ylim([1500,6500])
plt.show()
Business day traffic volume is higher for all hours of the day than weekend traffic volume. On business days, the rush hours are 7 a.m. and 4 p.m.
Will now analyze traffic volume with respect to weather. Will find correlation values between traffic volume and each type of weather.
d_metro.corr()["traffic_volume"]["temp":"clouds_all"]
temp 0.118084 rain_1h 0.004020 snow_1h 0.003768 clouds_all -0.033410 Name: traffic_volume, dtype: float64
It appears that the higher the temperature is, the heavier the traffic. However, this positive correlation is very weak with an r = 0.118 value. Will plot a scatterplot to confirm.
plt.scatter(d_metro["temp"], d_metro["traffic_volume"])
plt.xlabel("Temperature (Kelvins)")
plt.ylabel("Traffic Volume")
plt.title("Temperature vs. Traffic Volume")
plt.show()
No weather column is a reliable indicator of heavy traffic. The scatterplot and correlation of the temperature vs traffic volume is heavily influenced by two outlier values.
Will now look at the weather_main and weather_desc categorical variable columns to identify traffic volume patterns.
wthr_main = d_metro.groupby("weather_main").mean()
wthr_desc = d_metro.groupby("weather_description").mean()
plt.barh(wthr_main.index, wthr_main["traffic_volume"])
plt.xlabel("Traffic Volume")
plt.ylabel("Weather Main Condition")
plt.title("Weather Main vs Traffic Volume")
plt.show()
plt.figure(figsize=(10,12))
plt.barh(wthr_desc.index, wthr_desc["traffic_volume"])
plt.xlabel("Traffic Volume")
plt.ylabel("Weather Description")
plt.title("Weather Description vs Traffic Volume")
plt.show()
From the bar plots above, it is evident that more severe types of weather such as snow and thunderstorms result in heavier traffic.
print(wthr_desc[wthr_desc["traffic_volume"] > 5000].index.to_list())
['light rain and snow', 'proximity thunderstorm with drizzle', 'shower snow']
Snow showers, light rain and snow, and proximity thunderstorms with drizzle all result in traffic volumes of over 5,000.
Graph of how various factors combined affect traffic volume.
import seaborn as sns
sns.set_theme()
d_metro["day_type"] = pd.Series(str)
d_metro.loc[d_metro["date_time"].dt.dayofweek <= 4, "day_type"] = "Business Day"
d_metro.loc[d_metro["date_time"].dt.dayofweek >= 5, "day_type"] = "Weekend"
d_metro["season"] = pd.Series(str)
d_metro.loc[(d_metro["month"].between(0,2)) | (d_metro["month"] == 12),"season"] = "Winter"
d_metro.loc[d_metro["month"].between(3,5),"season"] = "Spring"
d_metro.loc[d_metro["month"].between(6,8),"season"] = "Summer"
d_metro.loc[d_metro["month"].between(9,11),"season"] = "Fall"
sns.relplot(data = d_metro, x = "hour", y = "traffic_volume",
hue = "season", palette = "RdYlGn", style = "day_type")
plt.title("Factors Affecting Traffic Volume")
plt.show()
The business day values tend to be closer to the top of the graph, while the weekend values are at the bottom. Because of the range of values in the hour column, it is difficult to tell how much the hour affects the traffic, although a general trend is visible. Most of the winter values are at the bottom of the graph, which is expected.
Conclusion: in regards to time indicators, traffic is higher during the spring, summer, and fall months than in the winter months. Traffic is higher during business days than the weekend, and rush hours on business days are 7 a.m. and 4 p.m. Weather does have an impact on traffic volume, with snow showers, light rain and snow, and proximity thunderstorms with drizzle all resulting in heavier traffic.
Thank you for taking the time to look through my project. Any and all feedback is greatly appreciated.