Many U.S. cities have communal bike sharing stations where you can rent bicycles by the hour or day. Washington, D.C. is one of these cities. The District collects detailed data on the number of bicycles people rent by the hour and day.
Hadi Fanaee-T at the University of Porto compiled this data into a CSV file, which we'll work with in this project. The file contains 17380 rows, with each row representing the number of bike rentals for a single hour of a single day. The data can be downloaded from the University of California, Irvine's website.
Here are the descriptions for the relevant columns:
In this project, we'll try to predict the total number of bikes people rented in a given hour. We'll predict the cnt column using all of the other columns, except for casual and registered. To accomplish this, we'll create a few different machine learning models and evaluate their performance.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
# Read in the dataframe
bike_rentals = pd.read_csv("hour.csv", parse_dates=["dteday"])
# Display basic info and the head of the dataframe
display(bike_rentals.info())
display(bike_rentals.head())
# Display a histogram of our variable of interest, "cnt"
plt.hist(bike_rentals["cnt"])
plt.show()
# Display the correlation values of the rest of the columns with "cnt"
print('Correlation values of each column with "cnt" column:', bike_rentals.corr()["cnt"])
<class 'pandas.core.frame.DataFrame'> RangeIndex: 17379 entries, 0 to 17378 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 instant 17379 non-null int64 1 dteday 17379 non-null datetime64[ns] 2 season 17379 non-null int64 3 yr 17379 non-null int64 4 mnth 17379 non-null int64 5 hr 17379 non-null int64 6 holiday 17379 non-null int64 7 weekday 17379 non-null int64 8 workingday 17379 non-null int64 9 weathersit 17379 non-null int64 10 temp 17379 non-null float64 11 atemp 17379 non-null float64 12 hum 17379 non-null float64 13 windspeed 17379 non-null float64 14 casual 17379 non-null int64 15 registered 17379 non-null int64 16 cnt 17379 non-null int64 dtypes: datetime64[ns](1), float64(4), int64(12) memory usage: 2.3 MB
None
instant | dteday | season | yr | mnth | hr | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2011-01-01 | 1 | 0 | 1 | 0 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.81 | 0.0 | 3 | 13 | 16 |
1 | 2 | 2011-01-01 | 1 | 0 | 1 | 1 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0 | 8 | 32 | 40 |
2 | 3 | 2011-01-01 | 1 | 0 | 1 | 2 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0 | 5 | 27 | 32 |
3 | 4 | 2011-01-01 | 1 | 0 | 1 | 3 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0 | 3 | 10 | 13 |
4 | 5 | 2011-01-01 | 1 | 0 | 1 | 4 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0 | 0 | 1 | 1 |
Correlation values of each column with "cnt" column: instant 0.278379 season 0.178056 yr 0.250495 mnth 0.120638 hr 0.394071 holiday -0.030927 weekday 0.026900 workingday 0.030284 weathersit -0.142426 temp 0.404772 atemp 0.400929 hum -0.322911 windspeed 0.093234 casual 0.694564 registered 0.972151 cnt 1.000000 Name: cnt, dtype: float64
Next, we'll create a metric that categorizes the hour of the day as morning, afternoon, evening, and night. This bundles similar times together, enabling the model to make better decisions.
# Create a function to categorize the moment of the day
def assign_label(hour):
"""
Takes the hour of the day as input, and returns a 1-4 integer depending on whether its a morning,
afternoon, evening, or night hour.
Args:
hour: Hour of the day, as integer
Returns:
1-4: Moment of the day categorized as integer
"""
if (hour >= 6) & (hour < 12):
return 1
if (hour >= 12) & (hour < 18):
return 2
if (hour >= 18) & (hour < 24):
return 3
if (hour >= 0) & (hour < 6):
return 4
# Apply the function to the dataframe
bike_rentals["time_label"] = bike_rentals["hr"].apply(assign_label)
# Display head of the dataframe
display(bike_rentals.head())
instant | dteday | season | yr | mnth | hr | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | time_label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2011-01-01 | 1 | 0 | 1 | 0 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.81 | 0.0 | 3 | 13 | 16 | 4 |
1 | 2 | 2011-01-01 | 1 | 0 | 1 | 1 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0 | 8 | 32 | 40 | 4 |
2 | 3 | 2011-01-01 | 1 | 0 | 1 | 2 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0 | 5 | 27 | 32 | 4 |
3 | 4 | 2011-01-01 | 1 | 0 | 1 | 3 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0 | 3 | 10 | 13 | 4 |
4 | 5 | 2011-01-01 | 1 | 0 | 1 | 4 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0 | 0 | 1 | 1 | 4 |
# Compute the number of rows that train dataframe will have
train_size = round(bike_rentals.shape[0] * .8)
# Select a random sample to be the train dataframe
train = bike_rentals.sample(n=train_size, random_state=0)
# Select the remaining rows to be the test dataframe
test = bike_rentals.loc[~bike_rentals.index.isin(train.index)]
# List the features to train the model on
features = ["temp", "hum", "workingday", "yr", "time_label"]
# Instantiate a linear regression model
lr = LinearRegression()
# Fit the model using the train dataset
lr.fit(train[features], train["cnt"])
# Predict on the test dataset
predictions = lr.predict(test[features])
# Calculate root mean squared error and display it
rmse = mean_squared_error(test["cnt"], predictions) ** 1/2
print("RMSE value of the linear regression model:", round(rmse))
RMSE value of the linear regression model: 9936
Our linear regression model returned a RMSE value of 9936
We'll next try a decision tree algorithm, to evaluate if it increases accurate of predictions.
# Loop over different parameter values for min_sample_leaf to search for optimal value
rmse_values = []
parameter_list = [1, 5, 10, 15, 20, 25, 50, 100]
for parameter in parameter_list:
# Instantiate a decision tree model
dt = DecisionTreeRegressor(min_samples_leaf=parameter)
# Fit the model using the train dataset
dt.fit(train[features], train["cnt"])
# Predict on the test dataset
predictions = dt.predict(test[features])
# Calculate root mean squared error and display it
rmse = mean_squared_error(test["cnt"], predictions) ** 1/2
rmse_values.append(rmse)
# Plot the results of the different models
plt.plot(parameter_list, rmse_values)
plt.xlabel("min_sample_leaf value")
plt.ylabel("RMSE value")
plt.show()
# Find the best parameter and display it
best_par = parameter_list[rmse_values.index(min(rmse_values))]
print("The best model has a min_sample_leaf parameter of", best_par, "and a RMSE value of", round(min(rmse_values)))
The best model has a min_sample_leaf parameter of 25 and a RMSE value of 7027
Our decision tree algorithm improved the RMSE by lowering it to 7027
Finally, we'll try a random forest approach using the same parameter value to see if we can improve it further.
# Instantiate a random forest model
rf = RandomForestRegressor(min_samples_leaf=25)
# Fit the model using the train dataset
rf.fit(train[features], train["cnt"])
# Predict on the test dataset
predictions = rf.predict(test[features])
# Calculate root mean squared error and display it
rmse = mean_squared_error(test["cnt"], predictions) ** 1/2
print("RMSE value of the random forest model:", round(rmse))
RMSE value of the random forest model: 6732
Our random forest algorithm performed better by decreasing RMSE to 6729
After comparing the different machine learning models using our bike rental data, we can conclude that the model that had a better accuracy was the Random Forest algorithm.