Predicting Bike Rentals¶

Many American cities have communal bike sharing stations where you can rent bicycles by the hour or day. Washington, D.C. is one of these cities. The District collects detailed data on the number of bicycles people rent by the hour and day.

The data set we will be working on (bike_rental_hour.csv) contains 17380 rows, with each row representing the number of bike rentals for a single hour of a single day.

Following is a description of the columns in the dataset:

instant - A unique sequential ID number for each row
dteday - The date of the rentals
season - The season in which the rentals occurred
yr - The year the rentals occurred
mnth - The month the rentals occurred
hr - The hour the rentals occurred
holiday - Whether or not the day was a holiday
weekday - The day of the week (as a number, 0 to 7)
workingday - Whether or not the day was a working day
weathersit - The weather (as a categorical variable)
temp - The temperature, on a 0-1 scale
atemp - The adjusted temperature
hum - The humidity, on a 0-1 scale
windspeed - The wind speed, on a 0-1 scale
casual - The number of casual riders (people who hadn't previously signed up with the bike sharing program)
registered - The number of registered riders (people who had already signed up)
cnt - The total number of bike rentals (casual + registered)

In this project we'll try to predict the total number of bikes people rented in a given hour. We'll predict the cnt column using all the columns, except casual and registered.

Exploring the Data Set¶

In [1]:

import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt
%matplotlib inline

In [2]:

bike_rentals = pd.read_csv('bike_rental_hour.csv')
bike_rentals.head(3)

Out[2]:

	instant	dteday	season	mnth	hr	weekday	weathersit	temp	atemp	hum	casual	registered	cnt
0	1	2011-01-01	1	1	0	6	1	0.24	0.2879	0.81	3	13	16
1	2	2011-01-01	1	1	1	6	1	0.22	0.2727	0.80	8	32	40
2	3	2011-01-01	1	1	2	6	1	0.22	0.2727	0.80	5	27	32

In [3]:

print('The data set has ', bike_rentals.shape[0], ' rows, and ', bike_rentals.shape[1], ' columns.')

The data set has  17379  rows, and  17  columns.

Following is a histogram depicting the cnt column of bike_rentals

In [4]:

plt.hist(bike_rentals['cnt'])
plt.show()

Using the corr() method we can find out how each column is correlated with cnt.

In [5]:

bike_rentals.corr()['cnt'].sort_values()

Out[5]:

hum          -0.322911
weathersit   -0.142426
holiday      -0.030927
weekday       0.026900
workingday    0.030284
windspeed     0.093234
mnth          0.120638
season        0.178056
yr            0.250495
instant       0.278379
hr            0.394071
atemp         0.400929
temp          0.404772
casual        0.694564
registered    0.972151
cnt           1.000000
Name: cnt, dtype: float64

Following columns have a correlation coefficient greater than 0.3:

hr
atemp
temp
casual
registered

Adding `time_label` Feature¶

The hr column in bike_rentals contains the hours during which bikes are rented, from 1 to 24. A machine will treat each hour differently, without understanding that certain hours are related. We can introduce some order into the process by creating a new column with labels for morning, afternoon, evening, and night. This will bundle similar times together, enabling the model to make better decisions.

In [6]:

# `assign_label` method takes in a numeric value for an hour and returns
# `1`, `2`, `3`, `4` respectively, if the hour is from `6` to `12`, 
# `12` to `18`, `18` to `24`

def assign_label(hr):
    if hr >= 0 and hr < 6:
        return 4
    elif hr >= 12 and hr < 18:
        return 2
    elif hr >= 18 and hr <= 24:
        return 3
    elif hr >= 6 and hr < 12:
        return 1

# apply `assign_label` function to each item in the `hr` column and
# assign the result to `time_label` column of `bike_rentals`

bike_rentals['time_label'] = bike_rentals['hr'].apply(assign_label)
bike_rentals['time_label'].value_counts()

Out[6]:

2    4375
3    4368
1    4360
4    4276
Name: time_label, dtype: int64

Calculating Features¶

The instant column has unique sequential IDs for each row, hence we should discard it from the list of predictors.
dteday should also be discarded, since we have other columns describing the year, month, weekday/workingday details of the rental.
We'll also ignore the casual and registered columns because cnt is derived from them.
cnt should be skipped, as it's the target column.

Hence, following is the list of predictors, we can use for training and prediction.

In [7]:

features = ['season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday',
            'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'time_label']

Splitting the Data into Train and Test Sets¶

We'll split bike_rentals into training and testing sets, with 80% of the rows being part of training set, and the rest of the rows to be in testing set.

The mean squared error metric makes the most sense to evaluate our error. MSE works on continuous numeric data, which fits our data quite well.

In [8]:

train = bike_rentals.sample(frac=0.8, random_state=1)
print('Shape of train dataframe: ', train.shape)

Shape of train dataframe:  (13903, 18)

In [9]:

test = bike_rentals.loc[~bike_rentals.index.isin(train.index)]
print('Shape of test dataframe: ', test.shape)

Shape of test dataframe:  (3476, 18)

Applying Linear Regression¶

Since most of the columns are highly correlated with cnt (as seen earlier), Linear regression might work fairly well on this data.

In [10]:

lr = LinearRegression()
lr.fit(train[features], train['cnt'])

train_predictions = lr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse = np.sqrt(train_mse)

test_predictions = lr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse = np.sqrt(test_mse)

print('Linear Regression Model')
print('------------------------')
print('Train RMSE: ', train_rmse)
print('Test RMSE: ', test_rmse)

Linear Regression Model
------------------------
Train RMSE:  132.6124243916765
Test RMSE:  130.5946379586095

Error¶

The error is very high, which might be because of the extremely few or high rental counts.

In [11]:

lr_results = ['default', round(train_rmse, 2), round(test_rmse, 2)]

Applying Decision Trees¶

Decision trees tend to predict outcomes much more reliably than linear regression models. Hence, we'll apply it to our data set.

In [12]:

# default settings

dtr = DecisionTreeRegressor()
dtr.fit(train[features], train['cnt'])

train_predictions = dtr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse_1 = np.sqrt(train_mse)

test_predictions = dtr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse_1 = np.sqrt(test_mse)

print('Decision Tree - with default settings')
print('-----------------------------------')
print('Train RMSE: ', train_rmse_1)
print('Test RMSE: ', test_rmse_1)

Decision Tree - with default settings
-----------------------------------
Train RMSE:  0.5693317794626366
Test RMSE:  58.713680838510236

In [13]:

# `min_samples_leaf` parameter set to `5`

dtr = DecisionTreeRegressor(min_samples_leaf=5)
dtr.fit(train[features], train['cnt'])

train_predictions = dtr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse_2 = np.sqrt(train_mse)

test_predictions = dtr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse_2 = np.sqrt(test_mse)
print('Decision Tree - with min_samples_leaf=5')
print('-----------------------------------------')
print('Train RMSE: ', train_rmse_2)
print('Test RMSE: ', test_rmse_2)

Decision Tree - with min_samples_leaf=5
-----------------------------------------
Train RMSE:  33.41484543668025
Test RMSE:  55.05695050722245

In [14]:

# `min_samples_leaf` parameter set to `2`

dtr = DecisionTreeRegressor(min_samples_leaf=2)
dtr.fit(train[features], train['cnt'])

train_predictions = dtr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse_3 = np.sqrt(train_mse)

test_predictions = dtr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse_3 = np.sqrt(test_mse)

print('Decision Tree - with min_samples_leaf=2')
print('----------------------------------------')
print('Train RMSE: ', train_rmse_3)
print('Test RMSE: ', test_rmse_3)

Decision Tree - with min_samples_leaf=2
----------------------------------------
Train RMSE:  18.832608268417342
Test RMSE:  56.40828005943384

In [15]:

# using both min_samples_leaf and max_depth parameters

dtr = DecisionTreeRegressor(min_samples_leaf=2, max_depth=2)
dtr.fit(train[features], train['cnt'])

train_predictions = dtr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse_4 = np.sqrt(train_mse)

test_predictions = dtr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse_4 = np.sqrt(test_mse)

print('Decision Tree - with min_samples_leaf=2, max_depth=2')
print('----------------------------------------')
print('Train RMSE: ', train_rmse_4)
print('Test RMSE: ', test_rmse_4)

Decision Tree - with min_samples_leaf=2, max_depth=2
----------------------------------------
Train RMSE:  139.98027858198864
Test RMSE:  135.73585137094457

Error¶

Compared to linear regression, the decision tree regressor has higher accuracy.

The best results are obtained with min_samples_leaf parameter set to 5.

In [16]:

## saving the result for concluding remarks

dt_results = ['min_samples_leaf=5', round(train_rmse_2, 2), round(test_rmse_2, 2)]

Applying Random Forests¶

Random forest algorithm improves over the decision tree algorithm and tend to be much more accurate than simple models like linear regression.

In [17]:

# default settings

rfr = RandomForestRegressor(n_estimators=10)
rfr.fit(train[features], train['cnt'])

train_predictions = rfr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse_1 = np.sqrt(train_mse)

test_predictions = rfr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse_1 = np.sqrt(test_mse)

print('Random Forests - with default settings')
print('----------------------------------')
print('Train RMSE: ', train_rmse_1)
print('Test RMSE: ', test_rmse_1)

Random Forests - with default settings
----------------------------------
Train RMSE:  18.853830913022996
Test RMSE:  46.3275419770308

In [18]:

# setting `min_samples_leaf` parameter to 5

rfr = RandomForestRegressor(n_estimators=10, min_samples_leaf=5)
rfr.fit(train[features], train['cnt'])

train_predictions = rfr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse_2 = np.sqrt(train_mse)

test_predictions = rfr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse_2 = np.sqrt(test_mse)

print('Random Forests - with min_samples_leaf=5')
print('----------------------------------')
print('Train RMSE: ', train_rmse_2)
print('Test RMSE: ', test_rmse_2)

Random Forests - with min_samples_leaf=5
----------------------------------
Train RMSE:  34.8643507298322
Test RMSE:  46.44190434594647

In [19]:

# setting `min_samples_leaf` parameter to 2

rfr = RandomForestRegressor(n_estimators=10, min_samples_leaf=2)
rfr.fit(train[features], train['cnt'])

train_predictions = rfr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse_3 = np.sqrt(train_mse)

test_predictions = rfr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse_3 = np.sqrt(test_mse)

print('Random Forests - with min_samples_leaf=2')
print('----------------------------------------')
print('Train RMSE: ', train_rmse_3)
print('Test RMSE: ', test_rmse_3)

Random Forests - with min_samples_leaf=2
----------------------------------------
Train RMSE:  24.4655602063916
Test RMSE:  45.74983905441973

In [20]:

# setting `min_samples_leaf` parameter to 2 and `max_depth` to 2

rfr = RandomForestRegressor(n_estimators=10, min_samples_leaf=5, max_depth=3)
rfr.fit(train[features], train['cnt'])

train_predictions = rfr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse_4 = np.sqrt(train_mse)

test_predictions = rfr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse_4 = np.sqrt(test_mse)

print('Random Forests - with min_samples_leaf=2, max_depth=3')
print('------------------------------------------------------')
print('Train RMSE: ', train_rmse_4)
print('Test RMSE: ', test_rmse_4)

Random Forests - with min_samples_leaf=2, max_depth=3
------------------------------------------------------
Train RMSE:  127.8808840609669
Test RMSE:  125.55884996971498

Error¶

The random forest regressor accuracy has improved over decision tree regressor.

It yields the most accurate results when min_samples_leaf is 2.

In [21]:

## saving the result for concluding remarks

rf_results = ['min_samples_leaf=5', round(train_rmse_3, 2), round(test_rmse_3, 2)]

Calculate with Additional Features¶

Now we'll calculate additional features:

An index combining temperature, humidity and wind speed.
Another one combining features with a correlation coefficient greater than 0.3 - that is, hr, atemp, temp. Here we'll be using time_label instead of hr; and since atemp and temp are almost same we'll be using temp alone.

In [22]:

features = ['temp', 'hum', 'windspeed']

Applying Decision Trees¶

In [23]:

# default settings

dtr = DecisionTreeRegressor()
dtr.fit(train[features], train['cnt'])

train_predictions = dtr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse = np.sqrt(train_mse)

test_predictions = dtr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse = np.sqrt(test_mse)

print('Decision Tree - default settings, predictors=[temp, hum, windspeed]')
print('-------------------------------------------------------------------------')
print('Train RMSE: ', train_rmse)
print('Test RMSE: ', test_rmse)

Decision Tree - default settings, predictors=[temp, hum, windspeed]
-------------------------------------------------------------------------
Train RMSE:  112.80870625300098
Test RMSE:  184.15448005236038

In [24]:

# `min_samples_leaf` parameter set to `5`

dtr = DecisionTreeRegressor(min_samples_leaf=5)
dtr.fit(train[features], train['cnt'])

train_predictions = dtr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse = np.sqrt(train_mse)

test_predictions = dtr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse = np.sqrt(test_mse)

print('Decision Tree - with min_samples_leaf=5, predictors=[temp, hum, windspeed]')
print('-------------------------------------------------------------------------')
print('Train RMSE: ', train_rmse)
print('Test RMSE: ', test_rmse)

Decision Tree - with min_samples_leaf=5, predictors=[temp, hum, windspeed]
-------------------------------------------------------------------------
Train RMSE:  135.13446508959527
Test RMSE:  163.6345940526717

In [25]:

# `min_samples_leaf` parameter set to `2`

dtr = DecisionTreeRegressor(min_samples_leaf=2)
dtr.fit(train[features], train['cnt'])

train_predictions = dtr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse = np.sqrt(train_mse)

test_predictions = dtr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse = np.sqrt(test_mse)

print('Decision Tree - with predictors=[temp, hum, windspeed], min_samples_leaf=2')
print('-------------------------------------------------------------------------')
print('Train RMSE: ', train_rmse)
print('Test RMSE: ', test_rmse)

Decision Tree - with predictors=[temp, hum, windspeed], min_samples_leaf=2
-------------------------------------------------------------------------
Train RMSE:  123.4735214080467
Test RMSE:  170.8897244627551

Error¶

We can observe that the accuracy has lowered compared to the one obtained with the earlier features list. Hence, we should discard these set of predictors.

Applying Random Forests¶

In [26]:

# setting `min_samples_leaf` parameter to 5

rfr = RandomForestRegressor(n_estimators=10, min_samples_leaf=5)
rfr.fit(train[features], train['cnt'])

train_predictions = rfr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse = np.sqrt(train_mse)

test_predictions = rfr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse = np.sqrt(test_mse)

print('Random Forests - with predictors=[temp, hum, windspeed], min_samples_leaf=5')
print('---------------------------------------------------------------------------')
print('Train RMSE: ', train_rmse)
print('Test RMSE: ', test_rmse)

Random Forests - with predictors=[temp, hum, windspeed], min_samples_leaf=5
---------------------------------------------------------------------------
Train RMSE:  136.8351931212382
Test RMSE:  153.50974300842228

In [27]:

# setting `min_samples_leaf` parameter to 2

rfr = RandomForestRegressor(n_estimators=10, min_samples_leaf=2)
rfr.fit(train[features], train['cnt'])

train_predictions = rfr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse = np.sqrt(train_mse)

test_predictions = rfr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse = np.sqrt(test_mse)

print('Random Forests - with predictors=[temp, hum, windspeed], min_samples_leaf=2')
print('---------------------------------------------------------------------------')
print('Train RMSE: ', train_rmse)
print('Test RMSE: ', test_rmse)

Random Forests - with predictors=[temp, hum, windspeed], min_samples_leaf=2
---------------------------------------------------------------------------
Train RMSE:  127.0283035632838
Test RMSE:  160.24482158164565

For Random Forests also, we find that the accuracy has lowered compared to the one obtained with the earlier features list.

Next we'll change the features list to time_label, temp, and apply Decision Tree and Random Forests algorithms.

In [28]:

features = ['time_label', 'temp']

In [29]:

dtr = DecisionTreeRegressor()
dtr.fit(train[features], train['cnt'])

train_predictions = dtr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse = np.sqrt(train_mse)

test_predictions = dtr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse = np.sqrt(test_mse)

print('Decision Tree - with predictors=[time_label, temp]')
print('--------------------------------------------------')
print('Train RMSE: ', train_rmse)
print('Test RMSE: ', test_rmse)

Decision Tree - with predictors=[time_label, temp]
--------------------------------------------------
Train RMSE:  134.92603367210947
Test RMSE:  132.19620283052873

In [30]:

dtr = DecisionTreeRegressor(min_samples_leaf=5)
dtr.fit(train[features], train['cnt'])

train_predictions = dtr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse = np.sqrt(train_mse)

test_predictions = dtr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse = np.sqrt(test_mse)

print('Decision Tree - with predictors=[time_label, temp], min_samples_leaf=5')
print('----------------------------------------------------------------------')
print('Train RMSE: ', train_rmse)
print('Test RMSE: ', test_rmse)

Decision Tree - with predictors=[time_label, temp], min_samples_leaf=5
----------------------------------------------------------------------
Train RMSE:  134.97363925396965
Test RMSE:  132.13711955120485

In [31]:

dtr = DecisionTreeRegressor(min_samples_leaf=2)
dtr.fit(train[features], train['cnt'])

train_predictions = dtr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse = np.sqrt(train_mse)

test_predictions = dtr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse = np.sqrt(test_mse)

print('Decision Tree - with predictors=[time_label, temp], min_samples_leaf=2')
print('----------------------------------------------------------------------')
print('Train RMSE: ', train_rmse)
print('Test RMSE: ', test_rmse)

Decision Tree - with predictors=[time_label, temp], min_samples_leaf=2
----------------------------------------------------------------------
Train RMSE:  134.9484658080092
Test RMSE:  132.135078408519

Error¶

The predictors time_label and temp also dimineshes the accuracy. Hence, we skip it.

Conclusion¶

The steps above can be summarized as follows.

In [32]:

result = pd.DataFrame(index=['Linear Regression', 'Decision Tree', 'Random Forests'],
                     columns=['settings', 'Train RMSE', 'Test RMSE'],
                     data = np.array([lr_results, dt_results, rf_results]))
result

Out[32]:

	settings	Train RMSE	Test RMSE
Linear Regression	default	132.61	130.59
Decision Tree	min_samples_leaf=5	33.41	55.06
Random Forests	min_samples_leaf=5	24.47	45.75

Predicting Bike Rentals¶

Exploring the Data Set¶

Adding time_label Feature¶

Calculating Features¶

Splitting the Data into Train and Test Sets¶

Applying Linear Regression¶

Error¶

Applying Decision Trees¶

Error¶

Applying Random Forests¶

Error¶

Calculate with Additional Features¶

Applying Decision Trees¶

Error¶

Applying Random Forests¶

Error¶

Conclusion¶

Adding `time_label` Feature¶