Predicting Bike Rentals¶

Bike sharing systems are the new generation of traditional bike share/rentals where whole process from memberhsip, rental and return has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

In this project we will try to predict the total number of bikes people rented in a given hour using different machine learning models. The dataset we are going to use was compiled by Hadi Fanaee-T at the University of Porto and can be donwloaded from the University of California Irvine's website.

Data exploration¶

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
%matplotlib inline

In [2]:

bike_rentals = pd.read_csv('bike_rental_hour.csv')
bike_rentals.head()

Out[2]:

	instant	dteday	season	mnth	hr	weekday	weathersit	temp	atemp	hum	casual	registered	cnt
0	1	2011-01-01	1	1	0	6	1	0.24	0.2879	0.81	3	13	16
1	2	2011-01-01	1	1	1	6	1	0.22	0.2727	0.80	8	32	40
2	3	2011-01-01	1	1	2	6	1	0.22	0.2727	0.80	5	27	32
3	4	2011-01-01	1	1	3	6	1	0.24	0.2879	0.75	3	10	13
4	5	2011-01-01	1	1	4	6	1	0.24	0.2879	0.75	0	1	1

The columns in the dataframe are

instant: A unique sequential ID number for each row
dteday: The date of the rentals
season: The season in wich the rentals ocurred
yr: The year the rentals ocurred
mnth: The month the rentals ocurred
hr: The hour the rentals occurred
holiday: Whether or not the day was a holiday
weekday: The day of the week, as a number from 0 to 7
workingday: Whether or not the day was a working day
weathersit: The weather (as a categorical variable)
temp: The temperature, on a 0-1 scale
atemp: The adjusted temperature
hum: The humidity on a 0-1 scale
windspeed: The wind speed, on a 0-1 scale
casual: The number of casual riders (people who hadn't previously signed up with the bike sharing program)
registered: The number of registered riders (people who had already signed up)
cnt: The total number of bike rentals (casual + registered)

In [3]:

plt.hist(bike_rentals['cnt'])
plt.title('Distribution of the total number of bike rentals')
plt.xlabel('Total number of bike rentals')
plt.ylabel('Frequency')
plt.show()

The distribution is right skewed, we also see that from the 17380 rows of data, almost 7000 have between 0 and 100 bike rentals

In [4]:

cnt_corr = bike_rentals.corr()['cnt']
top_corr_features = cnt_corr.index
plt.figure(figsize=(20, 20))
sns.heatmap(bike_rentals[top_corr_features].corr(), annot=True, cmap='RdYlGn')

Out[4]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fc786ddc978>

In [5]:

bike_rentals.corr()["cnt"].reset_index().sort_values(by='cnt')

Out[5]:

	index	cnt
11	hum	-0.322911
8	weathersit	-0.142426
5	holiday	-0.030927
6	weekday	0.026900
7	workingday	0.030284
12	windspeed	0.093234
3	mnth	0.120638
1	season	0.178056
2	yr	0.250495
0	instant	0.278379
4	hr	0.394071
10	atemp	0.400929
9	temp	0.404772
13	casual	0.694564
14	registered	0.972151
15	cnt	1.000000

We see that the columns hr, atemp, temp, casual and registered have strong correlations with the cnt column. Looks like weather, humidity and holidays have negative correlation with the cnt column

In [6]:

bike_rentals.isnull().sum()

Out[6]:

instant       0
dteday        0
season        0
yr            0
mnth          0
hr            0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

The dataset is already cleaned and we don't have any NaN data. However if we use the data we have right now into any model, the accuracy or the error score will be high, that's because the model might not understand certain info like for example the hr column. This column has data from 1 to 24 and represent the hour of the day wich a bike was rented, when the model reads this data it will not understand that certain hours are related. So the best thing is to create a label based on ranges like this:

from 6 to 12: 1 (morning)
from 12 to 18: 2 (afternoon)
from 18 to 24: 3 (evening)
from 0 to 6: 4 (night)

In [7]:

def assign_label(row):
    if row >= 6 and row < 12:
        label = 1
    elif row >= 12 and row < 18:
        label = 2
    elif row >= 18 and row < 24:
        label = 3
    elif row >= 0 and row < 6:
        label = 4
    return label

bike_rentals['time_label'] = bike_rentals['hr'].apply(assign_label)
bike_rentals.head()

Out[7]:

	instant	dteday	season	mnth	hr	weekday	weathersit	temp	atemp	hum	casual	registered	cnt	time_label
0	1	2011-01-01	1	1	0	6	1	0.24	0.2879	0.81	3	13	16	4
1	2	2011-01-01	1	1	1	6	1	0.22	0.2727	0.80	8	32	40	4
2	3	2011-01-01	1	1	2	6	1	0.22	0.2727	0.80	5	27	32	4
3	4	2011-01-01	1	1	3	6	1	0.24	0.2879	0.75	3	10	13	4
4	5	2011-01-01	1	1	4	6	1	0.24	0.2879	0.75	0	1	1	4

Now that the data is cleaned and ready to use, we will select the metrics that we need to analyze the performance of the machine learning algorithms. I have read about this and according to this post in Towards Data Science the ideal thing to do is use R2, MSE and RMSE. I'm going to report the 3 of them but i'm going to use only RMSE in the analysis

Now that we have the metrics, let's make our train and test samples using the sample method

In [8]:

train = bike_rentals.sample(frac=0.8)
test = bike_rentals[~bike_rentals.index.isin(train.index)]

Linear Regression¶

This model might work fairly well on the data because many of the columns are highly correlated with the cnt column. We are going to use all the columns except the casual and registered ones, because they are part of the cnt.

In [9]:

features = bike_rentals.columns
features = features.drop(['dteday', 'casual', 'registered', 'cnt'])
target = 'cnt'

In [10]:

lr = LinearRegression()
lr.fit(train[features], train[target])

train_pred = lr.predict(train[features])
train_mse = mean_squared_error(train[target], train_pred)
train_rmse = np.sqrt(train_mse)
train_r2score = r2_score(train[target], train_pred)

test_pred = lr.predict(test[features])
test_mse = mean_squared_error(test[target], test_pred)
test_rmse = np.sqrt(test_mse)
test_r2score = r2_score(test[target], test_pred)

print("TRAIN ERRORS:")
print("MSE:", train_mse)
print("RMSE:", train_rmse)
print("R2 SCORE:", train_r2score)
print("-"*20)
print("TEST ERRORS:")
print("MSE:", test_mse)
print("RMSE:", test_rmse)
print("R2 SCORE:", test_r2score)

TRAIN ERRORS:
MSE: 17446.86070303239
RMSE: 132.08656518750266
R2 SCORE: 0.4665272762764584
--------------------
TEST ERRORS:
MSE: 17606.931959062764
RMSE: 132.6911148459563
R2 SCORE: 0.47720411497526505

The three error metrics are pretty close to each other. If we analyze the RMSE score, we see that it actually lost 0.5486 points approximately. We also see that even if the model isn't overffited, it looks like the model doesn't predict as well as we thought. Let's remove those columns that have a correlation less than 0.4 except the negative ones

In [11]:

feat = ['hum', 'weathersit', 'holiday', 'hr', 'atemp', 'temp']

lr = LinearRegression()
lr.fit(train[feat], train[target])

train_pred = lr.predict(train[feat])
train_mse = mean_squared_error(train[target], train_pred)
train_rmse = np.sqrt(train_mse)
train_r2score = r2_score(train[target], train_pred)

test_pred = lr.predict(test[feat])
test_mse = mean_squared_error(test[target], test_pred)
test_rmse = np.sqrt(test_mse)
test_r2score = r2_score(test[target], test_pred)

print("TRAIN ERRORS:")
print("MSE:", train_mse)
print("RMSE:", train_rmse)
print("R2 SCORE:", train_r2score)
print("-"*20)
print("TEST ERRORS:")
print("MSE:", test_mse)
print("RMSE:", test_rmse)
print("R2 SCORE:", test_r2score)

TRAIN ERRORS:
MSE: 22176.56176518386
RMSE: 148.91796992030163
R2 SCORE: 0.32190718954730113
--------------------
TEST ERRORS:
MSE: 22032.18177280162
RMSE: 148.43241483180694
R2 SCORE: 0.34580686767470237

The error increased and the model isn't overfitted or underfitted it simple does not generalize well, wich means that probably the 0.47 is the best r2 score we can get with linear regression, we also can say that linear regression isn't the most adecuate model to use with this dataset

Decision Tree Regressor¶

Now let's use a Decision Tree Regressor to see how well this model works with this data

In [12]:

clf = DecisionTreeRegressor(random_state=1)
clf.fit(train[features], train[target])

train_pred = clf.predict(train[features])
train_mse = mean_squared_error(train[target], train_pred)
train_rmse = np.sqrt(train_mse)
train_r2score = r2_score(train[target], train_pred)

test_pred = clf.predict(test[features])
test_mse = mean_squared_error(test[target], test_pred)
test_rmse = np.sqrt(test_mse)
test_r2score = r2_score(test[target], test_pred)

print("TRAIN ERRORS:")
print("MSE:", train_mse)
print("RMSE:", train_rmse)
print("R2 SCORE:", train_r2score)
print("-"*20)
print("TEST ERRORS:")
print("MSE:", test_mse)
print("RMSE:", test_rmse)
print("R2 SCORE:", test_r2score)

TRAIN ERRORS:
MSE: 0.0
RMSE: 0.0
R2 SCORE: 1.0
--------------------
TEST ERRORS:
MSE: 3094.752589182969
RMSE: 55.63050052968218
R2 SCORE: 0.9081086970429444

The errors of the train portion tells us that the model is overfitted, by the way here is a good example of why it's good to use the three metrics instead of only one. Let's experiment with some parameter of the Decision Tree Regressor to see if we can increase it's accuracy

Changing min_samples_leaf¶

The model works with a default number of 1, let's make a for loop and try between 2 and 10 samples

In [13]:

samples = [i for i in range(3,11)]
t_mse = []
t_rmse = []
t_r2score = []
tt_mse = []
tt_rmse = []
tt_r2score = []

for i in samples:
    clf = DecisionTreeRegressor(random_state=1, min_samples_leaf=i)
    clf.fit(train[features], train[target])

    train_pred = clf.predict(train[features])
    train_mse = mean_squared_error(train[target], train_pred)
    train_rmse = np.sqrt(train_mse)
    train_r2score = r2_score(train[target], train_pred)
    t_r2score.append(train_r2score)
    t_rmse.append(train_rmse)
    t_mse.append(train_mse)

    test_pred = clf.predict(test[features])
    test_mse = mean_squared_error(test[target], test_pred)
    test_rmse = np.sqrt(test_mse)
    test_r2score = r2_score(test[target], test_pred)
    tt_r2score.append(test_r2score)
    tt_rmse.append(test_rmse)
    tt_mse.append(test_mse)

print("Train Mean MSE:", np.mean(t_mse))
print("Test Mean MSE:", np.mean(tt_mse))
print('-'*20)
print("Train Mean RMSE:", np.mean(t_rmse))
print("Test Mean RMSE:", np.mean(tt_rmse))
print('-'*20)
print("Train Mean R2SCORE:", np.mean(t_r2score))
print("Test Mean R2SCORE:", np.mean(tt_r2score))

plt.plot(samples, t_rmse, color="red")
plt.plot(samples, tt_rmse, color="blue")
plt.title("Training Vs Test RMSE")
plt.xlabel("Samples")
plt.ylabel("Error")
plt.legend(["Training", "Test"], loc=0)

Train Mean MSE: 1201.671289332573
Test Mean MSE: 2555.735141820479
--------------------
Train Mean RMSE: 34.27076966935901
Test Mean RMSE: 50.54968435286302
--------------------
Train Mean R2SCORE: 0.9632564925775324
Test Mean R2SCORE: 0.9241135355970345

Out[13]:

<matplotlib.legend.Legend at 0x7fc786eb1400>

The model looks overfitted, however the R2score doesn't have a high difference (0.03656 points) but the RMSE and the graph shows us a huge difference between the values

Changing min_samples_split¶

The model works with a default number of 2, let's make a for loop and try between 3 and 10 samples

In [14]:

samples = [i for i in range(3,11)]
t_mse = []
t_rmse = []
t_r2score = []
tt_mse = []
tt_rmse = []
tt_r2score = []

for i in samples:
    clf = DecisionTreeRegressor(random_state=1, min_samples_split=i)
    clf.fit(train[features], train[target])

    train_pred = clf.predict(train[features])
    train_mse = mean_squared_error(train[target], train_pred)
    train_rmse = np.sqrt(train_mse)
    train_r2score = r2_score(train[target], train_pred)
    t_r2score.append(train_r2score)
    t_rmse.append(train_rmse)
    t_mse.append(train_mse)

    test_pred = clf.predict(test[features])
    test_mse = mean_squared_error(test[target], test_pred)
    test_rmse = np.sqrt(test_mse)
    test_r2score = r2_score(test[target], test_pred)
    tt_r2score.append(test_r2score)
    tt_rmse.append(test_rmse)
    tt_mse.append(test_mse)

print("Train Mean MSE:", np.mean(t_mse))
print("Test Mean MSE:", np.mean(tt_mse))
print('-'*20)
print("Train Mean RMSE:", np.mean(t_rmse))
print("Test Mean RMSE:", np.mean(tt_rmse))
print('-'*20)
print("Train Mean R2SCORE:", np.mean(t_r2score))
print("Test Mean R2SCORE:", np.mean(tt_r2score))

Train Mean MSE: 382.93792288106715
Test Mean MSE: 2739.235787580488
--------------------
Train Mean RMSE: 18.723942266392868
Test Mean RMSE: 52.316556926686204
--------------------
Train Mean R2SCORE: 0.9882909057271896
Test Mean R2SCORE: 0.9186649212259579

In [15]:

plt.plot(samples, t_rmse, color="red")
plt.plot(samples, tt_rmse, color="blue")
plt.title("Training Vs Test RMSE")
plt.xlabel("Samples")
plt.ylabel("Error")
plt.legend(["Training", "Test"], loc=0)

Out[15]:

<matplotlib.legend.Legend at 0x7fc7864e8668>

The RMSE values tells us that the model is overfitted, however the R2 Score is pretty similar, if we analyze the graph the training error increase with each sample and the test tends to decrease in small gaps. We know that decision trees tend to overfit, so let's build a Random Forest Regressor

Random Forest Regressor¶

Let's train a random forest regressor model with 10 to 100 number of estimators and check the errors

In [16]:

estimators = [i for i in range(10, 110)]
t_mse = []
t_rmse = []
t_r2score = []
tt_mse = []
tt_rmse = []
tt_r2score = []

for i in estimators:
    clf = RandomForestRegressor(n_estimators=i, random_state=1)
    clf.fit(train[features], train[target])
    
    train_pred = clf.predict(train[features])
    train_mse = mean_squared_error(train[target], train_pred)
    t_mse.append(train_mse)
    train_rmse = np.sqrt(train_mse)
    t_rmse.append(train_rmse)
    train_r2score = r2_score(train[target], train_pred)
    t_r2score.append(train_r2score)
    
    test_pred = clf.predict(test[features])
    test_mse = mean_squared_error(test[target], test_pred)
    tt_mse.append(test_mse)
    test_rmse = np.sqrt(test_mse)
    tt_rmse.append(test_rmse)
    test_r2score = r2_score(test[target], test_pred)
    tt_r2score.append(test_r2score)
    
print("Train Mean MSE:", np.mean(t_mse))
print("Test Mean MSE:", np.mean(tt_mse))
print('-'*20)
print("Train Mean RMSE:", np.mean(t_rmse))
print("Test Mean RMSE:", np.mean(tt_rmse))
print('-'*20)
print("Train Mean R2SCORE:", np.mean(t_r2score))
print("Test Mean R2SCORE:", np.mean(tt_r2score))

Train Mean MSE: 255.0067634042666
Test Mean MSE: 1622.397616316688
--------------------
Train Mean RMSE: 15.954551616553571
Test Mean RMSE: 40.27583985198987
--------------------
Train Mean R2SCORE: 0.9922026572598499
Test Mean R2SCORE: 0.9518267691579441

In [17]:

plt.plot(estimators, t_rmse, color="red")
plt.plot(estimators, tt_rmse, color="blue")
plt.title("Training Vs Test RMSE")
plt.xlabel("Estimators")
plt.ylabel("Error")
plt.legend(["Training", "Test"], loc=0)

Out[17]:

<matplotlib.legend.Legend at 0x7fc786461240>

The model looks overfitted even if both errors tend to decrease with the amount of estimators. However the R2 Score it's pretty close between with a 0.4 points difference, let's try with the same paremeters we used in the Decision Tree Regressor

Changing min_samples_leaf¶

The model works with a default number of 1, let's make a for loop and try between 2 and 10 samples and let's use 100 estimators

In [18]:

leafs = [a for a in range(2, 11)]
t_mse = []
t_rmse = []
t_r2score = []
tt_mse = []
tt_rmse = []
tt_r2score = []

for i in leafs:
    clf = RandomForestRegressor(n_estimators=100, min_samples_leaf=i, random_state=1)
    clf.fit(train[features], train[target])
    
    train_pred = clf.predict(train[features])
    train_mse = mean_squared_error(train[target], train_pred)
    t_mse.append(train_mse)
    train_rmse = np.sqrt(train_mse)
    t_rmse.append(train_rmse)
    train_r2score = r2_score(train[target], train_pred)
    t_r2score.append(train_r2score)
    
    test_pred = clf.predict(test[features])
    test_mse = mean_squared_error(test[target], test_pred)
    tt_mse.append(test_mse)
    test_rmse = np.sqrt(test_mse)
    tt_rmse.append(test_rmse)
    test_r2score = r2_score(test[target], test_pred)
    tt_r2score.append(test_r2score)
    
print("Train Mean MSE:", np.mean(t_mse))
print("Test Mean MSE:", np.mean(tt_mse))
print('-'*20)
print("Train Mean RMSE:", np.mean(t_rmse))
print("Test Mean RMSE:", np.mean(tt_rmse))
print('-'*20)
print("Train Mean R2SCORE:", np.mean(t_r2score))
print("Test Mean R2SCORE:", np.mean(tt_r2score))

Train Mean MSE: 1061.962955615979
Test Mean MSE: 1860.3477872116387
--------------------
Train Mean RMSE: 32.1050735173245
Test Mean RMSE: 43.09244600658648
--------------------
Train Mean R2SCORE: 0.9675283548101297
Test Mean R2SCORE: 0.9447614058979481

In [21]:

plt.plot(leafs, t_rmse, color="red")
plt.plot(leafs, tt_rmse, color="blue")
plt.title("Training Vs Test RMSE, 100 estimators")
plt.xlabel("Samples")
plt.ylabel("Error")
plt.legend(["Training", "Test"], loc=0)

Out[21]:

<matplotlib.legend.Legend at 0x7fc7850eb550>

Changing min_samples_split¶

The model works with a default number of 2, let's make a for loop and try between 3 and 10 samples

In [22]:

leafs = [a for a in range(3, 11)]
t_mse = []
t_rmse = []
t_r2score = []
tt_mse = []
tt_rmse = []
tt_r2score = []

for i in leafs:
    clf = RandomForestRegressor(n_estimators=100, min_samples_split=i, random_state=1)
    clf.fit(train[features], train[target])
    
    train_pred = clf.predict(train[features])
    train_mse = mean_squared_error(train[target], train_pred)
    t_mse.append(train_mse)
    train_rmse = np.sqrt(train_mse)
    t_rmse.append(train_rmse)
    train_r2score = r2_score(train[target], train_pred)
    t_r2score.append(train_r2score)
    
    test_pred = clf.predict(test[features])
    test_mse = mean_squared_error(test[target], test_pred)
    tt_mse.append(test_mse)
    test_rmse = np.sqrt(test_mse)
    tt_rmse.append(test_rmse)
    test_r2score = r2_score(test[target], test_pred)
    tt_r2score.append(test_r2score)
    
print("Train Mean MSE:", np.mean(t_mse))
print("Test Mean MSE:", np.mean(tt_mse))
print('-'*20)
print("Train Mean RMSE:", np.mean(t_rmse))
print("Test Mean RMSE:", np.mean(tt_rmse))
print('-'*20)
print("Train Mean R2SCORE:", np.mean(t_r2score))
print("Test Mean R2SCORE:", np.mean(tt_r2score))

Train Mean MSE: 490.63672792743193
Test Mean MSE: 1629.5071542052597
--------------------
Train Mean RMSE: 21.919418564546255
Test Mean RMSE: 40.36423374795747
--------------------
Train Mean R2SCORE: 0.9849977989701746
Test Mean R2SCORE: 0.9516156683732522

In [23]:

plt.plot(leafs, t_rmse, color="red")
plt.plot(leafs, tt_rmse, color="blue")
plt.title("Training Vs Test RMSE")
plt.xlabel("Samples")
plt.ylabel("Error")
plt.legend(["Training", "Test"], loc=0)

Out[23]:

<matplotlib.legend.Legend at 0x7fc785069588>

Final Analysis¶

In [24]:

test_rmse_errors = [132.6911148459563, 148.43241483180694, 55.63050052968218,
                   50.54968435286302, 52.316556926686204, 40.27583985198987,
                    43.09244600658648,  40.36423374795747]

models = ['Linear Regression', 'Linear Regression with Feature selection',
         'Decision Tree Regressor', 'Decission Tree Regressor (min_samples_leaf)',
         'Decision Tree Regressor (min_samples_split)', 'Random Forest Regressor',
         'Random Forest Regressor (min_samples_leaf)',
          'Random Forest Regressor (min_samples_split)']

ax = plt.subplot()
sns.barplot(test_rmse_errors, models, orient='h')
plt.title("Test RMSE Errors")

/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:1428: FutureWarning:

remove_na is deprecated and is a private function. Do not use.

Out[24]:

<matplotlib.text.Text at 0x7fc785033240>

It's clear that the Random Forest Regressor have least RMSE values. The difference between Decision Trees and Random Forests are 53 units approximately.

In Decision Trees the experiments in the parameters helped to reduce the RMSE value, however in the Random Forest there was a slightly increase in the values. The feature selection in Linear regression did not help at all.

Some of the models were considered overfitted (if we see the MSE and RMSE) and most of them were impossible to reduce. It's strange that the R2 scores did not show signs of overfit. There might be different explanations about this but my theory is that i still don't understand this completely

In [ ]: