Many American cities have communal bike sharing stations where you can rent bicycles by the hour or day. Washington, D.C. is one of these cities. The District collects detailed data on the number of bicycles people rent by the hour and day.
The data set we will be working on (bike_rental_hour.csv
) contains 17380
rows, with each row representing the number of bike rentals for a single hour of a single day.
Following is a description of the columns in the dataset:
instant
- A unique sequential ID number for each rowdteday
- The date of the rentalsseason
- The season in which the rentals occurredyr
- The year the rentals occurredmnth
- The month the rentals occurredhr
- The hour the rentals occurredholiday
- Whether or not the day was a holidayweekday
- The day of the week (as a number, 0
to 7
)workingday
- Whether or not the day was a working dayweathersit
- The weather (as a categorical variable)temp
- The temperature, on a 0-1
scaleatemp
- The adjusted temperaturehum
- The humidity, on a 0-1
scalewindspeed
- The wind speed, on a 0-1
scalecasual
- The number of casual riders (people who hadn't previously signed up with the bike sharing program)registered
- The number of registered riders (people who had already signed up)cnt
- The total number of bike rentals (casual + registered)In this project we'll try to predict the total number of bikes people rented in a given hour. We'll predict the cnt
column using all the columns, except casual
and registered
.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
%matplotlib inline
bike_rentals = pd.read_csv('bike_rental_hour.csv')
bike_rentals.head(3)
instant | dteday | season | yr | mnth | hr | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2011-01-01 | 1 | 0 | 1 | 0 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.81 | 0.0 | 3 | 13 | 16 |
1 | 2 | 2011-01-01 | 1 | 0 | 1 | 1 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0 | 8 | 32 | 40 |
2 | 3 | 2011-01-01 | 1 | 0 | 1 | 2 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0 | 5 | 27 | 32 |
print('The data set has ', bike_rentals.shape[0], ' rows, and ', bike_rentals.shape[1], ' columns.')
The data set has 17379 rows, and 17 columns.
Following is a histogram depicting the cnt
column of bike_rentals
plt.hist(bike_rentals['cnt'])
plt.show()
Using the corr()
method we can find out how each column is correlated with cnt
.
bike_rentals.corr()['cnt'].sort_values()
hum -0.322911 weathersit -0.142426 holiday -0.030927 weekday 0.026900 workingday 0.030284 windspeed 0.093234 mnth 0.120638 season 0.178056 yr 0.250495 instant 0.278379 hr 0.394071 atemp 0.400929 temp 0.404772 casual 0.694564 registered 0.972151 cnt 1.000000 Name: cnt, dtype: float64
Following columns have a correlation coefficient greater than 0.3
:
hr
atemp
temp
casual
registered
time_label
Feature¶The hr
column in bike_rentals
contains the hours during which bikes are rented, from 1
to 24
. A machine will treat each hour differently, without understanding that certain hours are related. We can introduce some order into the process by creating a new column with labels for morning
, afternoon
, evening
, and night
. This will bundle similar times together, enabling the model to make better decisions.
# `assign_label` method takes in a numeric value for an hour and returns
# `1`, `2`, `3`, `4` respectively, if the hour is from `6` to `12`,
# `12` to `18`, `18` to `24`
def assign_label(hr):
if hr >= 0 and hr < 6:
return 4
elif hr >= 12 and hr < 18:
return 2
elif hr >= 18 and hr <= 24:
return 3
elif hr >= 6 and hr < 12:
return 1
# apply `assign_label` function to each item in the `hr` column and
# assign the result to `time_label` column of `bike_rentals`
bike_rentals['time_label'] = bike_rentals['hr'].apply(assign_label)
bike_rentals['time_label'].value_counts()
2 4375 3 4368 1 4360 4 4276 Name: time_label, dtype: int64
instant
column has unique sequential IDs for each row, hence we should discard it from the list of predictors.dteday
should also be discarded, since we have other columns describing the year, month, weekday/workingday details of the rental.casual
and registered
columns because cnt
is derived from them.cnt
should be skipped, as it's the target column.Hence, following is the list of predictors, we can use for training and prediction.
features = ['season', 'yr', 'mnth', 'hr', 'holiday', 'weekday', 'workingday',
'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'time_label']
We'll split bike_rentals
into training and testing sets, with 80%
of the rows being part of training set, and the rest of the rows to be in testing set.
The mean squared error metric makes the most sense to evaluate our error. MSE works on continuous numeric data, which fits our data quite well.
train = bike_rentals.sample(frac=0.8, random_state=1)
print('Shape of train dataframe: ', train.shape)
Shape of train dataframe: (13903, 18)
test = bike_rentals.loc[~bike_rentals.index.isin(train.index)]
print('Shape of test dataframe: ', test.shape)
Shape of test dataframe: (3476, 18)
Since most of the columns are highly correlated with cnt
(as seen earlier), Linear regression might work fairly well on this data.
lr = LinearRegression()
lr.fit(train[features], train['cnt'])
train_predictions = lr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse = np.sqrt(train_mse)
test_predictions = lr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse = np.sqrt(test_mse)
print('Linear Regression Model')
print('------------------------')
print('Train RMSE: ', train_rmse)
print('Test RMSE: ', test_rmse)
Linear Regression Model ------------------------ Train RMSE: 132.6124243916765 Test RMSE: 130.5946379586095
The error is very high, which might be because of the extremely few or high rental counts.
lr_results = ['default', round(train_rmse, 2), round(test_rmse, 2)]
Decision trees tend to predict outcomes much more reliably than linear regression models. Hence, we'll apply it to our data set.
# default settings
dtr = DecisionTreeRegressor()
dtr.fit(train[features], train['cnt'])
train_predictions = dtr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse_1 = np.sqrt(train_mse)
test_predictions = dtr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse_1 = np.sqrt(test_mse)
print('Decision Tree - with default settings')
print('-----------------------------------')
print('Train RMSE: ', train_rmse_1)
print('Test RMSE: ', test_rmse_1)
Decision Tree - with default settings ----------------------------------- Train RMSE: 0.5693317794626366 Test RMSE: 58.713680838510236
# `min_samples_leaf` parameter set to `5`
dtr = DecisionTreeRegressor(min_samples_leaf=5)
dtr.fit(train[features], train['cnt'])
train_predictions = dtr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse_2 = np.sqrt(train_mse)
test_predictions = dtr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse_2 = np.sqrt(test_mse)
print('Decision Tree - with min_samples_leaf=5')
print('-----------------------------------------')
print('Train RMSE: ', train_rmse_2)
print('Test RMSE: ', test_rmse_2)
Decision Tree - with min_samples_leaf=5 ----------------------------------------- Train RMSE: 33.41484543668025 Test RMSE: 55.05695050722245
# `min_samples_leaf` parameter set to `2`
dtr = DecisionTreeRegressor(min_samples_leaf=2)
dtr.fit(train[features], train['cnt'])
train_predictions = dtr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse_3 = np.sqrt(train_mse)
test_predictions = dtr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse_3 = np.sqrt(test_mse)
print('Decision Tree - with min_samples_leaf=2')
print('----------------------------------------')
print('Train RMSE: ', train_rmse_3)
print('Test RMSE: ', test_rmse_3)
Decision Tree - with min_samples_leaf=2 ---------------------------------------- Train RMSE: 18.832608268417342 Test RMSE: 56.40828005943384
# using both min_samples_leaf and max_depth parameters
dtr = DecisionTreeRegressor(min_samples_leaf=2, max_depth=2)
dtr.fit(train[features], train['cnt'])
train_predictions = dtr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse_4 = np.sqrt(train_mse)
test_predictions = dtr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse_4 = np.sqrt(test_mse)
print('Decision Tree - with min_samples_leaf=2, max_depth=2')
print('----------------------------------------')
print('Train RMSE: ', train_rmse_4)
print('Test RMSE: ', test_rmse_4)
Decision Tree - with min_samples_leaf=2, max_depth=2 ---------------------------------------- Train RMSE: 139.98027858198864 Test RMSE: 135.73585137094457
Compared to linear regression, the decision tree regressor has higher accuracy.
The best results are obtained with min_samples_leaf
parameter set to 5
.
## saving the result for concluding remarks
dt_results = ['min_samples_leaf=5', round(train_rmse_2, 2), round(test_rmse_2, 2)]
Random forest algorithm improves over the decision tree algorithm and tend to be much more accurate than simple models like linear regression.
# default settings
rfr = RandomForestRegressor(n_estimators=10)
rfr.fit(train[features], train['cnt'])
train_predictions = rfr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse_1 = np.sqrt(train_mse)
test_predictions = rfr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse_1 = np.sqrt(test_mse)
print('Random Forests - with default settings')
print('----------------------------------')
print('Train RMSE: ', train_rmse_1)
print('Test RMSE: ', test_rmse_1)
Random Forests - with default settings ---------------------------------- Train RMSE: 18.853830913022996 Test RMSE: 46.3275419770308
# setting `min_samples_leaf` parameter to 5
rfr = RandomForestRegressor(n_estimators=10, min_samples_leaf=5)
rfr.fit(train[features], train['cnt'])
train_predictions = rfr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse_2 = np.sqrt(train_mse)
test_predictions = rfr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse_2 = np.sqrt(test_mse)
print('Random Forests - with min_samples_leaf=5')
print('----------------------------------')
print('Train RMSE: ', train_rmse_2)
print('Test RMSE: ', test_rmse_2)
Random Forests - with min_samples_leaf=5 ---------------------------------- Train RMSE: 34.8643507298322 Test RMSE: 46.44190434594647
# setting `min_samples_leaf` parameter to 2
rfr = RandomForestRegressor(n_estimators=10, min_samples_leaf=2)
rfr.fit(train[features], train['cnt'])
train_predictions = rfr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse_3 = np.sqrt(train_mse)
test_predictions = rfr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse_3 = np.sqrt(test_mse)
print('Random Forests - with min_samples_leaf=2')
print('----------------------------------------')
print('Train RMSE: ', train_rmse_3)
print('Test RMSE: ', test_rmse_3)
Random Forests - with min_samples_leaf=2 ---------------------------------------- Train RMSE: 24.4655602063916 Test RMSE: 45.74983905441973
# setting `min_samples_leaf` parameter to 2 and `max_depth` to 2
rfr = RandomForestRegressor(n_estimators=10, min_samples_leaf=5, max_depth=3)
rfr.fit(train[features], train['cnt'])
train_predictions = rfr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse_4 = np.sqrt(train_mse)
test_predictions = rfr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse_4 = np.sqrt(test_mse)
print('Random Forests - with min_samples_leaf=2, max_depth=3')
print('------------------------------------------------------')
print('Train RMSE: ', train_rmse_4)
print('Test RMSE: ', test_rmse_4)
Random Forests - with min_samples_leaf=2, max_depth=3 ------------------------------------------------------ Train RMSE: 127.8808840609669 Test RMSE: 125.55884996971498
The random forest regressor accuracy has improved over decision tree regressor.
It yields the most accurate results when min_samples_leaf
is 2
.
## saving the result for concluding remarks
rf_results = ['min_samples_leaf=5', round(train_rmse_3, 2), round(test_rmse_3, 2)]
Now we'll calculate additional features:
hr
, atemp
, temp
. Here we'll be using time_label
instead of hr
; and since atemp
and temp
are almost same we'll be using temp
alone.features = ['temp', 'hum', 'windspeed']
# default settings
dtr = DecisionTreeRegressor()
dtr.fit(train[features], train['cnt'])
train_predictions = dtr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse = np.sqrt(train_mse)
test_predictions = dtr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse = np.sqrt(test_mse)
print('Decision Tree - default settings, predictors=[temp, hum, windspeed]')
print('-------------------------------------------------------------------------')
print('Train RMSE: ', train_rmse)
print('Test RMSE: ', test_rmse)
Decision Tree - default settings, predictors=[temp, hum, windspeed] ------------------------------------------------------------------------- Train RMSE: 112.80870625300098 Test RMSE: 184.15448005236038
# `min_samples_leaf` parameter set to `5`
dtr = DecisionTreeRegressor(min_samples_leaf=5)
dtr.fit(train[features], train['cnt'])
train_predictions = dtr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse = np.sqrt(train_mse)
test_predictions = dtr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse = np.sqrt(test_mse)
print('Decision Tree - with min_samples_leaf=5, predictors=[temp, hum, windspeed]')
print('-------------------------------------------------------------------------')
print('Train RMSE: ', train_rmse)
print('Test RMSE: ', test_rmse)
Decision Tree - with min_samples_leaf=5, predictors=[temp, hum, windspeed] ------------------------------------------------------------------------- Train RMSE: 135.13446508959527 Test RMSE: 163.6345940526717
# `min_samples_leaf` parameter set to `2`
dtr = DecisionTreeRegressor(min_samples_leaf=2)
dtr.fit(train[features], train['cnt'])
train_predictions = dtr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse = np.sqrt(train_mse)
test_predictions = dtr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse = np.sqrt(test_mse)
print('Decision Tree - with predictors=[temp, hum, windspeed], min_samples_leaf=2')
print('-------------------------------------------------------------------------')
print('Train RMSE: ', train_rmse)
print('Test RMSE: ', test_rmse)
Decision Tree - with predictors=[temp, hum, windspeed], min_samples_leaf=2 ------------------------------------------------------------------------- Train RMSE: 123.4735214080467 Test RMSE: 170.8897244627551
We can observe that the accuracy has lowered compared to the one obtained with the earlier features list. Hence, we should discard these set of predictors.
# setting `min_samples_leaf` parameter to 5
rfr = RandomForestRegressor(n_estimators=10, min_samples_leaf=5)
rfr.fit(train[features], train['cnt'])
train_predictions = rfr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse = np.sqrt(train_mse)
test_predictions = rfr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse = np.sqrt(test_mse)
print('Random Forests - with predictors=[temp, hum, windspeed], min_samples_leaf=5')
print('---------------------------------------------------------------------------')
print('Train RMSE: ', train_rmse)
print('Test RMSE: ', test_rmse)
Random Forests - with predictors=[temp, hum, windspeed], min_samples_leaf=5 --------------------------------------------------------------------------- Train RMSE: 136.8351931212382 Test RMSE: 153.50974300842228
# setting `min_samples_leaf` parameter to 2
rfr = RandomForestRegressor(n_estimators=10, min_samples_leaf=2)
rfr.fit(train[features], train['cnt'])
train_predictions = rfr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse = np.sqrt(train_mse)
test_predictions = rfr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse = np.sqrt(test_mse)
print('Random Forests - with predictors=[temp, hum, windspeed], min_samples_leaf=2')
print('---------------------------------------------------------------------------')
print('Train RMSE: ', train_rmse)
print('Test RMSE: ', test_rmse)
Random Forests - with predictors=[temp, hum, windspeed], min_samples_leaf=2 --------------------------------------------------------------------------- Train RMSE: 127.0283035632838 Test RMSE: 160.24482158164565
For Random Forests also, we find that the accuracy has lowered compared to the one obtained with the earlier features list.
Next we'll change the features list to time_label
, temp
, and apply Decision Tree and Random Forests algorithms.
features = ['time_label', 'temp']
dtr = DecisionTreeRegressor()
dtr.fit(train[features], train['cnt'])
train_predictions = dtr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse = np.sqrt(train_mse)
test_predictions = dtr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse = np.sqrt(test_mse)
print('Decision Tree - with predictors=[time_label, temp]')
print('--------------------------------------------------')
print('Train RMSE: ', train_rmse)
print('Test RMSE: ', test_rmse)
Decision Tree - with predictors=[time_label, temp] -------------------------------------------------- Train RMSE: 134.92603367210947 Test RMSE: 132.19620283052873
dtr = DecisionTreeRegressor(min_samples_leaf=5)
dtr.fit(train[features], train['cnt'])
train_predictions = dtr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse = np.sqrt(train_mse)
test_predictions = dtr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse = np.sqrt(test_mse)
print('Decision Tree - with predictors=[time_label, temp], min_samples_leaf=5')
print('----------------------------------------------------------------------')
print('Train RMSE: ', train_rmse)
print('Test RMSE: ', test_rmse)
Decision Tree - with predictors=[time_label, temp], min_samples_leaf=5 ---------------------------------------------------------------------- Train RMSE: 134.97363925396965 Test RMSE: 132.13711955120485
dtr = DecisionTreeRegressor(min_samples_leaf=2)
dtr.fit(train[features], train['cnt'])
train_predictions = dtr.predict(train[features])
train_mse = mean_squared_error(train['cnt'], train_predictions)
train_rmse = np.sqrt(train_mse)
test_predictions = dtr.predict(test[features])
test_mse = mean_squared_error(test['cnt'], test_predictions)
test_rmse = np.sqrt(test_mse)
print('Decision Tree - with predictors=[time_label, temp], min_samples_leaf=2')
print('----------------------------------------------------------------------')
print('Train RMSE: ', train_rmse)
print('Test RMSE: ', test_rmse)
Decision Tree - with predictors=[time_label, temp], min_samples_leaf=2 ---------------------------------------------------------------------- Train RMSE: 134.9484658080092 Test RMSE: 132.135078408519
The predictors time_label
and temp
also dimineshes the accuracy. Hence, we skip it.
The steps above can be summarized as follows.
result = pd.DataFrame(index=['Linear Regression', 'Decision Tree', 'Random Forests'],
columns=['settings', 'Train RMSE', 'Test RMSE'],
data = np.array([lr_results, dt_results, rf_results]))
result
settings | Train RMSE | Test RMSE | |
---|---|---|---|
Linear Regression | default | 132.61 | 130.59 |
Decision Tree | min_samples_leaf=5 | 33.41 | 55.06 |
Random Forests | min_samples_leaf=5 | 24.47 | 45.75 |