Predicting Bike Rentals¶

1) Introduction¶

Many U.S. cities have communal bike sharing stations where you can rent bicycles by the hour or day. Washington, D.C. is one of these cities. The District collects detailed data on the number of bicycles people rent by the hour and day.

Hadi Fanaee-T at the University of Porto compiled this data into a CSV file, which we'll work with in this project. The file contains 17380 rows, with each row representing the number of bike rentals for a single hour of a single day. The data can be downloaded from the University of California, Irvine's website.

Here are the descriptions for the relevant columns:

instant - A unique sequential ID number for each row
dteday - The date of the rentals
season - The season in which the rentals occurred
yr - The year the rentals occurred
mnth - The month the rentals occurred
hr - The hour the rentals occurred
holiday - Whether or not the day was a holiday
weekday - The day of the week (as a number, 0 to 7)
workingday - Whether or not the day was a working day
weathersit - The weather (as a categorical variable)
temp - The temperature, on a 0-1 scale
atemp - The adjusted temperature
hum - The humidity, on a 0-1 scale
windspeed - The wind speed, on a 0-1 scale
casual - The number of casual riders (people who hadn't previously signed up with the bike sharing program)
registered - The number of registered riders (people who had already signed up)
cnt - The total number of bike rentals (casual + registered)

In this project, we'll try to predict the total number of bikes people rented in a given hour. We'll predict the cnt column using all of the other columns, except for casual and registered. To accomplish this, we'll create a few different machine learning models and evaluate their performance.

2) Data preparation¶

2.1) Setting up the environment¶

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

2.2) Read the data and initial inspection¶

In [2]:

# Read in the dataframe
bike_rentals = pd.read_csv("hour.csv", parse_dates=["dteday"])

# Display basic info and the head of the dataframe
display(bike_rentals.info())
display(bike_rentals.head())

# Display a histogram of our variable of interest, "cnt"
plt.hist(bike_rentals["cnt"])
plt.show()

# Display the correlation values of the rest of the columns with "cnt"
print('Correlation values of each column with "cnt" column:', bike_rentals.corr()["cnt"])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   instant     17379 non-null  int64         
 1   dteday      17379 non-null  datetime64[ns]
 2   season      17379 non-null  int64         
 3   yr          17379 non-null  int64         
 4   mnth        17379 non-null  int64         
 5   hr          17379 non-null  int64         
 6   holiday     17379 non-null  int64         
 7   weekday     17379 non-null  int64         
 8   workingday  17379 non-null  int64         
 9   weathersit  17379 non-null  int64         
 10  temp        17379 non-null  float64       
 11  atemp       17379 non-null  float64       
 12  hum         17379 non-null  float64       
 13  windspeed   17379 non-null  float64       
 14  casual      17379 non-null  int64         
 15  registered  17379 non-null  int64         
 16  cnt         17379 non-null  int64         
dtypes: datetime64[ns](1), float64(4), int64(12)
memory usage: 2.3 MB

None

	instant	dteday	season	mnth	hr	weekday	weathersit	temp	atemp	hum	casual	registered	cnt
0	1	2011-01-01	1	1	0	6	1	0.24	0.2879	0.81	3	13	16
1	2	2011-01-01	1	1	1	6	1	0.22	0.2727	0.80	8	32	40
2	3	2011-01-01	1	1	2	6	1	0.22	0.2727	0.80	5	27	32
3	4	2011-01-01	1	1	3	6	1	0.24	0.2879	0.75	3	10	13
4	5	2011-01-01	1	1	4	6	1	0.24	0.2879	0.75	0	1	1

Correlation values of each column with "cnt" column: instant       0.278379
season        0.178056
yr            0.250495
mnth          0.120638
hr            0.394071
holiday      -0.030927
weekday       0.026900
workingday    0.030284
weathersit   -0.142426
temp          0.404772
atemp         0.400929
hum          -0.322911
windspeed     0.093234
casual        0.694564
registered    0.972151
cnt           1.000000
Name: cnt, dtype: float64

2.3) Calculate additional metrics¶

Next, we'll create a metric that categorizes the hour of the day as morning, afternoon, evening, and night. This bundles similar times together, enabling the model to make better decisions.

In [3]:

# Create a function to categorize the moment of the day
def assign_label(hour):
    """
    Takes the hour of the day as input, and returns a 1-4 integer depending on whether its a morning,
    afternoon, evening, or night hour.
    
    Args:
        hour: Hour of the day, as integer
    
    Returns:
        1-4: Moment of the day categorized as integer
    """
    if (hour >= 6) & (hour < 12):
        return 1
    if (hour >= 12) & (hour < 18):
        return 2
    if (hour >= 18) & (hour < 24):
        return 3
    if (hour >= 0) & (hour < 6):
        return 4

# Apply the function to the dataframe
bike_rentals["time_label"] = bike_rentals["hr"].apply(assign_label)

# Display head of the dataframe
display(bike_rentals.head())

	instant	dteday	season	mnth	hr	weekday	weathersit	temp	atemp	hum	casual	registered	cnt	time_label
0	1	2011-01-01	1	1	0	6	1	0.24	0.2879	0.81	3	13	16	4
1	2	2011-01-01	1	1	1	6	1	0.22	0.2727	0.80	8	32	40	4
2	3	2011-01-01	1	1	2	6	1	0.22	0.2727	0.80	5	27	32	4
3	4	2011-01-01	1	1	3	6	1	0.24	0.2879	0.75	3	10	13	4
4	5	2011-01-01	1	1	4	6	1	0.24	0.2879	0.75	0	1	1	4

3) Data analysis¶

3.1) Split the dataframe on train and test datasets¶

In [4]:

# Compute the number of rows that train dataframe will have
train_size = round(bike_rentals.shape[0] * .8)

# Select a random sample to be the train dataframe
train = bike_rentals.sample(n=train_size, random_state=0)

# Select the remaining rows to be the test dataframe
test = bike_rentals.loc[~bike_rentals.index.isin(train.index)]

3.2) Linear regression model¶

In [5]:

# List the features to train the model on
features = ["temp", "hum", "workingday", "yr", "time_label"]

# Instantiate a linear regression model
lr = LinearRegression()

# Fit the model using the train dataset
lr.fit(train[features], train["cnt"])

# Predict on the test dataset
predictions = lr.predict(test[features])

# Calculate root mean squared error and display it
rmse = mean_squared_error(test["cnt"], predictions) ** 1/2
print("RMSE value of the linear regression model:", round(rmse))

RMSE value of the linear regression model: 9936

Our linear regression model returned a RMSE value of 9936

We'll next try a decision tree algorithm, to evaluate if it increases accurate of predictions.

3.3) Decision tree algorithm¶

In [6]:

# Loop over different parameter values for min_sample_leaf to search for optimal value
rmse_values = []
parameter_list = [1, 5, 10, 15, 20, 25, 50, 100]

for parameter in parameter_list:

    # Instantiate a decision tree model
    dt = DecisionTreeRegressor(min_samples_leaf=parameter)

    # Fit the model using the train dataset
    dt.fit(train[features], train["cnt"])

    # Predict on the test dataset
    predictions = dt.predict(test[features])

    # Calculate root mean squared error and display it
    rmse = mean_squared_error(test["cnt"], predictions) ** 1/2
    rmse_values.append(rmse)
    
# Plot the results of the different models
plt.plot(parameter_list, rmse_values)
plt.xlabel("min_sample_leaf value")
plt.ylabel("RMSE value")
plt.show()

# Find the best parameter and display it
best_par = parameter_list[rmse_values.index(min(rmse_values))]
print("The best model has a min_sample_leaf parameter of", best_par, "and a RMSE value of", round(min(rmse_values)))

The best model has a min_sample_leaf parameter of 25 and a RMSE value of 7027

Our decision tree algorithm improved the RMSE by lowering it to 7027

Finally, we'll try a random forest approach using the same parameter value to see if we can improve it further.

3.4) Random forest algorithm¶

In [7]:

# Instantiate a random forest model
rf = RandomForestRegressor(min_samples_leaf=25)

# Fit the model using the train dataset
rf.fit(train[features], train["cnt"])

# Predict on the test dataset
predictions = rf.predict(test[features])

# Calculate root mean squared error and display it
rmse = mean_squared_error(test["cnt"], predictions) ** 1/2
print("RMSE value of the random forest model:", round(rmse))

RMSE value of the random forest model: 6732

Our random forest algorithm performed better by decreasing RMSE to 6729

4) Conclusion¶

After comparing the different machine learning models using our bike rental data, we can conclude that the model that had a better accuracy was the Random Forest algorithm.