Bike sharing systems are the new generation of traditional bike share/rentals where whole process from memberhsip, rental and return has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.
In this project we will try to predict the total number of bikes people rented in a given hour using different machine learning models. The dataset we are going to use was compiled by Hadi Fanaee-T at the University of Porto and can be donwloaded from the University of California Irvine's website.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
%matplotlib inline
bike_rentals = pd.read_csv('bike_rental_hour.csv')
bike_rentals.head()
instant | dteday | season | yr | mnth | hr | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2011-01-01 | 1 | 0 | 1 | 0 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.81 | 0.0 | 3 | 13 | 16 |
1 | 2 | 2011-01-01 | 1 | 0 | 1 | 1 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0 | 8 | 32 | 40 |
2 | 3 | 2011-01-01 | 1 | 0 | 1 | 2 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0 | 5 | 27 | 32 |
3 | 4 | 2011-01-01 | 1 | 0 | 1 | 3 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0 | 3 | 10 | 13 |
4 | 5 | 2011-01-01 | 1 | 0 | 1 | 4 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0 | 0 | 1 | 1 |
The columns in the dataframe are
plt.hist(bike_rentals['cnt'])
plt.title('Distribution of the total number of bike rentals')
plt.xlabel('Total number of bike rentals')
plt.ylabel('Frequency')
plt.show()
The distribution is right skewed, we also see that from the 17380 rows of data, almost 7000 have between 0 and 100 bike rentals
cnt_corr = bike_rentals.corr()['cnt']
top_corr_features = cnt_corr.index
plt.figure(figsize=(20, 20))
sns.heatmap(bike_rentals[top_corr_features].corr(), annot=True, cmap='RdYlGn')
<matplotlib.axes._subplots.AxesSubplot at 0x7fc786ddc978>
bike_rentals.corr()["cnt"].reset_index().sort_values(by='cnt')
index | cnt | |
---|---|---|
11 | hum | -0.322911 |
8 | weathersit | -0.142426 |
5 | holiday | -0.030927 |
6 | weekday | 0.026900 |
7 | workingday | 0.030284 |
12 | windspeed | 0.093234 |
3 | mnth | 0.120638 |
1 | season | 0.178056 |
2 | yr | 0.250495 |
0 | instant | 0.278379 |
4 | hr | 0.394071 |
10 | atemp | 0.400929 |
9 | temp | 0.404772 |
13 | casual | 0.694564 |
14 | registered | 0.972151 |
15 | cnt | 1.000000 |
We see that the columns hr, atemp, temp, casual and registered have strong correlations with the cnt column. Looks like weather, humidity and holidays have negative correlation with the cnt column
bike_rentals.isnull().sum()
instant 0 dteday 0 season 0 yr 0 mnth 0 hr 0 holiday 0 weekday 0 workingday 0 weathersit 0 temp 0 atemp 0 hum 0 windspeed 0 casual 0 registered 0 cnt 0 dtype: int64
The dataset is already cleaned and we don't have any NaN data. However if we use the data we have right now into any model, the accuracy or the error score will be high, that's because the model might not understand certain info like for example the hr column. This column has data from 1 to 24 and represent the hour of the day wich a bike was rented, when the model reads this data it will not understand that certain hours are related. So the best thing is to create a label based on ranges like this:
def assign_label(row):
if row >= 6 and row < 12:
label = 1
elif row >= 12 and row < 18:
label = 2
elif row >= 18 and row < 24:
label = 3
elif row >= 0 and row < 6:
label = 4
return label
bike_rentals['time_label'] = bike_rentals['hr'].apply(assign_label)
bike_rentals.head()
instant | dteday | season | yr | mnth | hr | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | time_label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2011-01-01 | 1 | 0 | 1 | 0 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.81 | 0.0 | 3 | 13 | 16 | 4 |
1 | 2 | 2011-01-01 | 1 | 0 | 1 | 1 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0 | 8 | 32 | 40 | 4 |
2 | 3 | 2011-01-01 | 1 | 0 | 1 | 2 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0 | 5 | 27 | 32 | 4 |
3 | 4 | 2011-01-01 | 1 | 0 | 1 | 3 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0 | 3 | 10 | 13 | 4 |
4 | 5 | 2011-01-01 | 1 | 0 | 1 | 4 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0 | 0 | 1 | 1 | 4 |
Now that the data is cleaned and ready to use, we will select the metrics that we need to analyze the performance of the machine learning algorithms. I have read about this and according to this post in Towards Data Science the ideal thing to do is use R2, MSE and RMSE. I'm going to report the 3 of them but i'm going to use only RMSE in the analysis
Now that we have the metrics, let's make our train and test samples using the sample method
train = bike_rentals.sample(frac=0.8)
test = bike_rentals[~bike_rentals.index.isin(train.index)]
This model might work fairly well on the data because many of the columns are highly correlated with the cnt column. We are going to use all the columns except the casual and registered ones, because they are part of the cnt.
features = bike_rentals.columns
features = features.drop(['dteday', 'casual', 'registered', 'cnt'])
target = 'cnt'
lr = LinearRegression()
lr.fit(train[features], train[target])
train_pred = lr.predict(train[features])
train_mse = mean_squared_error(train[target], train_pred)
train_rmse = np.sqrt(train_mse)
train_r2score = r2_score(train[target], train_pred)
test_pred = lr.predict(test[features])
test_mse = mean_squared_error(test[target], test_pred)
test_rmse = np.sqrt(test_mse)
test_r2score = r2_score(test[target], test_pred)
print("TRAIN ERRORS:")
print("MSE:", train_mse)
print("RMSE:", train_rmse)
print("R2 SCORE:", train_r2score)
print("-"*20)
print("TEST ERRORS:")
print("MSE:", test_mse)
print("RMSE:", test_rmse)
print("R2 SCORE:", test_r2score)
TRAIN ERRORS: MSE: 17446.86070303239 RMSE: 132.08656518750266 R2 SCORE: 0.4665272762764584 -------------------- TEST ERRORS: MSE: 17606.931959062764 RMSE: 132.6911148459563 R2 SCORE: 0.47720411497526505
The three error metrics are pretty close to each other. If we analyze the RMSE score, we see that it actually lost 0.5486 points approximately. We also see that even if the model isn't overffited, it looks like the model doesn't predict as well as we thought. Let's remove those columns that have a correlation less than 0.4 except the negative ones
feat = ['hum', 'weathersit', 'holiday', 'hr', 'atemp', 'temp']
lr = LinearRegression()
lr.fit(train[feat], train[target])
train_pred = lr.predict(train[feat])
train_mse = mean_squared_error(train[target], train_pred)
train_rmse = np.sqrt(train_mse)
train_r2score = r2_score(train[target], train_pred)
test_pred = lr.predict(test[feat])
test_mse = mean_squared_error(test[target], test_pred)
test_rmse = np.sqrt(test_mse)
test_r2score = r2_score(test[target], test_pred)
print("TRAIN ERRORS:")
print("MSE:", train_mse)
print("RMSE:", train_rmse)
print("R2 SCORE:", train_r2score)
print("-"*20)
print("TEST ERRORS:")
print("MSE:", test_mse)
print("RMSE:", test_rmse)
print("R2 SCORE:", test_r2score)
TRAIN ERRORS: MSE: 22176.56176518386 RMSE: 148.91796992030163 R2 SCORE: 0.32190718954730113 -------------------- TEST ERRORS: MSE: 22032.18177280162 RMSE: 148.43241483180694 R2 SCORE: 0.34580686767470237
The error increased and the model isn't overfitted or underfitted it simple does not generalize well, wich means that probably the 0.47 is the best r2 score we can get with linear regression, we also can say that linear regression isn't the most adecuate model to use with this dataset
Now let's use a Decision Tree Regressor to see how well this model works with this data
clf = DecisionTreeRegressor(random_state=1)
clf.fit(train[features], train[target])
train_pred = clf.predict(train[features])
train_mse = mean_squared_error(train[target], train_pred)
train_rmse = np.sqrt(train_mse)
train_r2score = r2_score(train[target], train_pred)
test_pred = clf.predict(test[features])
test_mse = mean_squared_error(test[target], test_pred)
test_rmse = np.sqrt(test_mse)
test_r2score = r2_score(test[target], test_pred)
print("TRAIN ERRORS:")
print("MSE:", train_mse)
print("RMSE:", train_rmse)
print("R2 SCORE:", train_r2score)
print("-"*20)
print("TEST ERRORS:")
print("MSE:", test_mse)
print("RMSE:", test_rmse)
print("R2 SCORE:", test_r2score)
TRAIN ERRORS: MSE: 0.0 RMSE: 0.0 R2 SCORE: 1.0 -------------------- TEST ERRORS: MSE: 3094.752589182969 RMSE: 55.63050052968218 R2 SCORE: 0.9081086970429444
The errors of the train portion tells us that the model is overfitted, by the way here is a good example of why it's good to use the three metrics instead of only one. Let's experiment with some parameter of the Decision Tree Regressor to see if we can increase it's accuracy
The model works with a default number of 1, let's make a for loop and try between 2 and 10 samples
samples = [i for i in range(3,11)]
t_mse = []
t_rmse = []
t_r2score = []
tt_mse = []
tt_rmse = []
tt_r2score = []
for i in samples:
clf = DecisionTreeRegressor(random_state=1, min_samples_leaf=i)
clf.fit(train[features], train[target])
train_pred = clf.predict(train[features])
train_mse = mean_squared_error(train[target], train_pred)
train_rmse = np.sqrt(train_mse)
train_r2score = r2_score(train[target], train_pred)
t_r2score.append(train_r2score)
t_rmse.append(train_rmse)
t_mse.append(train_mse)
test_pred = clf.predict(test[features])
test_mse = mean_squared_error(test[target], test_pred)
test_rmse = np.sqrt(test_mse)
test_r2score = r2_score(test[target], test_pred)
tt_r2score.append(test_r2score)
tt_rmse.append(test_rmse)
tt_mse.append(test_mse)
print("Train Mean MSE:", np.mean(t_mse))
print("Test Mean MSE:", np.mean(tt_mse))
print('-'*20)
print("Train Mean RMSE:", np.mean(t_rmse))
print("Test Mean RMSE:", np.mean(tt_rmse))
print('-'*20)
print("Train Mean R2SCORE:", np.mean(t_r2score))
print("Test Mean R2SCORE:", np.mean(tt_r2score))
plt.plot(samples, t_rmse, color="red")
plt.plot(samples, tt_rmse, color="blue")
plt.title("Training Vs Test RMSE")
plt.xlabel("Samples")
plt.ylabel("Error")
plt.legend(["Training", "Test"], loc=0)
Train Mean MSE: 1201.671289332573 Test Mean MSE: 2555.735141820479 -------------------- Train Mean RMSE: 34.27076966935901 Test Mean RMSE: 50.54968435286302 -------------------- Train Mean R2SCORE: 0.9632564925775324 Test Mean R2SCORE: 0.9241135355970345
<matplotlib.legend.Legend at 0x7fc786eb1400>
The model looks overfitted, however the R2score doesn't have a high difference (0.03656 points) but the RMSE and the graph shows us a huge difference between the values
The model works with a default number of 2, let's make a for loop and try between 3 and 10 samples
samples = [i for i in range(3,11)]
t_mse = []
t_rmse = []
t_r2score = []
tt_mse = []
tt_rmse = []
tt_r2score = []
for i in samples:
clf = DecisionTreeRegressor(random_state=1, min_samples_split=i)
clf.fit(train[features], train[target])
train_pred = clf.predict(train[features])
train_mse = mean_squared_error(train[target], train_pred)
train_rmse = np.sqrt(train_mse)
train_r2score = r2_score(train[target], train_pred)
t_r2score.append(train_r2score)
t_rmse.append(train_rmse)
t_mse.append(train_mse)
test_pred = clf.predict(test[features])
test_mse = mean_squared_error(test[target], test_pred)
test_rmse = np.sqrt(test_mse)
test_r2score = r2_score(test[target], test_pred)
tt_r2score.append(test_r2score)
tt_rmse.append(test_rmse)
tt_mse.append(test_mse)
print("Train Mean MSE:", np.mean(t_mse))
print("Test Mean MSE:", np.mean(tt_mse))
print('-'*20)
print("Train Mean RMSE:", np.mean(t_rmse))
print("Test Mean RMSE:", np.mean(tt_rmse))
print('-'*20)
print("Train Mean R2SCORE:", np.mean(t_r2score))
print("Test Mean R2SCORE:", np.mean(tt_r2score))
Train Mean MSE: 382.93792288106715 Test Mean MSE: 2739.235787580488 -------------------- Train Mean RMSE: 18.723942266392868 Test Mean RMSE: 52.316556926686204 -------------------- Train Mean R2SCORE: 0.9882909057271896 Test Mean R2SCORE: 0.9186649212259579
plt.plot(samples, t_rmse, color="red")
plt.plot(samples, tt_rmse, color="blue")
plt.title("Training Vs Test RMSE")
plt.xlabel("Samples")
plt.ylabel("Error")
plt.legend(["Training", "Test"], loc=0)
<matplotlib.legend.Legend at 0x7fc7864e8668>
The RMSE values tells us that the model is overfitted, however the R2 Score is pretty similar, if we analyze the graph the training error increase with each sample and the test tends to decrease in small gaps. We know that decision trees tend to overfit, so let's build a Random Forest Regressor
Let's train a random forest regressor model with 10 to 100 number of estimators and check the errors
estimators = [i for i in range(10, 110)]
t_mse = []
t_rmse = []
t_r2score = []
tt_mse = []
tt_rmse = []
tt_r2score = []
for i in estimators:
clf = RandomForestRegressor(n_estimators=i, random_state=1)
clf.fit(train[features], train[target])
train_pred = clf.predict(train[features])
train_mse = mean_squared_error(train[target], train_pred)
t_mse.append(train_mse)
train_rmse = np.sqrt(train_mse)
t_rmse.append(train_rmse)
train_r2score = r2_score(train[target], train_pred)
t_r2score.append(train_r2score)
test_pred = clf.predict(test[features])
test_mse = mean_squared_error(test[target], test_pred)
tt_mse.append(test_mse)
test_rmse = np.sqrt(test_mse)
tt_rmse.append(test_rmse)
test_r2score = r2_score(test[target], test_pred)
tt_r2score.append(test_r2score)
print("Train Mean MSE:", np.mean(t_mse))
print("Test Mean MSE:", np.mean(tt_mse))
print('-'*20)
print("Train Mean RMSE:", np.mean(t_rmse))
print("Test Mean RMSE:", np.mean(tt_rmse))
print('-'*20)
print("Train Mean R2SCORE:", np.mean(t_r2score))
print("Test Mean R2SCORE:", np.mean(tt_r2score))
Train Mean MSE: 255.0067634042666 Test Mean MSE: 1622.397616316688 -------------------- Train Mean RMSE: 15.954551616553571 Test Mean RMSE: 40.27583985198987 -------------------- Train Mean R2SCORE: 0.9922026572598499 Test Mean R2SCORE: 0.9518267691579441
plt.plot(estimators, t_rmse, color="red")
plt.plot(estimators, tt_rmse, color="blue")
plt.title("Training Vs Test RMSE")
plt.xlabel("Estimators")
plt.ylabel("Error")
plt.legend(["Training", "Test"], loc=0)
<matplotlib.legend.Legend at 0x7fc786461240>
The model looks overfitted even if both errors tend to decrease with the amount of estimators. However the R2 Score it's pretty close between with a 0.4 points difference, let's try with the same paremeters we used in the Decision Tree Regressor
The model works with a default number of 1, let's make a for loop and try between 2 and 10 samples and let's use 100 estimators
leafs = [a for a in range(2, 11)]
t_mse = []
t_rmse = []
t_r2score = []
tt_mse = []
tt_rmse = []
tt_r2score = []
for i in leafs:
clf = RandomForestRegressor(n_estimators=100, min_samples_leaf=i, random_state=1)
clf.fit(train[features], train[target])
train_pred = clf.predict(train[features])
train_mse = mean_squared_error(train[target], train_pred)
t_mse.append(train_mse)
train_rmse = np.sqrt(train_mse)
t_rmse.append(train_rmse)
train_r2score = r2_score(train[target], train_pred)
t_r2score.append(train_r2score)
test_pred = clf.predict(test[features])
test_mse = mean_squared_error(test[target], test_pred)
tt_mse.append(test_mse)
test_rmse = np.sqrt(test_mse)
tt_rmse.append(test_rmse)
test_r2score = r2_score(test[target], test_pred)
tt_r2score.append(test_r2score)
print("Train Mean MSE:", np.mean(t_mse))
print("Test Mean MSE:", np.mean(tt_mse))
print('-'*20)
print("Train Mean RMSE:", np.mean(t_rmse))
print("Test Mean RMSE:", np.mean(tt_rmse))
print('-'*20)
print("Train Mean R2SCORE:", np.mean(t_r2score))
print("Test Mean R2SCORE:", np.mean(tt_r2score))
Train Mean MSE: 1061.962955615979 Test Mean MSE: 1860.3477872116387 -------------------- Train Mean RMSE: 32.1050735173245 Test Mean RMSE: 43.09244600658648 -------------------- Train Mean R2SCORE: 0.9675283548101297 Test Mean R2SCORE: 0.9447614058979481
plt.plot(leafs, t_rmse, color="red")
plt.plot(leafs, tt_rmse, color="blue")
plt.title("Training Vs Test RMSE, 100 estimators")
plt.xlabel("Samples")
plt.ylabel("Error")
plt.legend(["Training", "Test"], loc=0)
<matplotlib.legend.Legend at 0x7fc7850eb550>
The model works with a default number of 2, let's make a for loop and try between 3 and 10 samples
leafs = [a for a in range(3, 11)]
t_mse = []
t_rmse = []
t_r2score = []
tt_mse = []
tt_rmse = []
tt_r2score = []
for i in leafs:
clf = RandomForestRegressor(n_estimators=100, min_samples_split=i, random_state=1)
clf.fit(train[features], train[target])
train_pred = clf.predict(train[features])
train_mse = mean_squared_error(train[target], train_pred)
t_mse.append(train_mse)
train_rmse = np.sqrt(train_mse)
t_rmse.append(train_rmse)
train_r2score = r2_score(train[target], train_pred)
t_r2score.append(train_r2score)
test_pred = clf.predict(test[features])
test_mse = mean_squared_error(test[target], test_pred)
tt_mse.append(test_mse)
test_rmse = np.sqrt(test_mse)
tt_rmse.append(test_rmse)
test_r2score = r2_score(test[target], test_pred)
tt_r2score.append(test_r2score)
print("Train Mean MSE:", np.mean(t_mse))
print("Test Mean MSE:", np.mean(tt_mse))
print('-'*20)
print("Train Mean RMSE:", np.mean(t_rmse))
print("Test Mean RMSE:", np.mean(tt_rmse))
print('-'*20)
print("Train Mean R2SCORE:", np.mean(t_r2score))
print("Test Mean R2SCORE:", np.mean(tt_r2score))
Train Mean MSE: 490.63672792743193 Test Mean MSE: 1629.5071542052597 -------------------- Train Mean RMSE: 21.919418564546255 Test Mean RMSE: 40.36423374795747 -------------------- Train Mean R2SCORE: 0.9849977989701746 Test Mean R2SCORE: 0.9516156683732522
plt.plot(leafs, t_rmse, color="red")
plt.plot(leafs, tt_rmse, color="blue")
plt.title("Training Vs Test RMSE")
plt.xlabel("Samples")
plt.ylabel("Error")
plt.legend(["Training", "Test"], loc=0)
<matplotlib.legend.Legend at 0x7fc785069588>
test_rmse_errors = [132.6911148459563, 148.43241483180694, 55.63050052968218,
50.54968435286302, 52.316556926686204, 40.27583985198987,
43.09244600658648, 40.36423374795747]
models = ['Linear Regression', 'Linear Regression with Feature selection',
'Decision Tree Regressor', 'Decission Tree Regressor (min_samples_leaf)',
'Decision Tree Regressor (min_samples_split)', 'Random Forest Regressor',
'Random Forest Regressor (min_samples_leaf)',
'Random Forest Regressor (min_samples_split)']
ax = plt.subplot()
sns.barplot(test_rmse_errors, models, orient='h')
plt.title("Test RMSE Errors")
/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:1428: FutureWarning: remove_na is deprecated and is a private function. Do not use.
<matplotlib.text.Text at 0x7fc785033240>
It's clear that the Random Forest Regressor have least RMSE values. The difference between Decision Trees and Random Forests are 53 units approximately.
In Decision Trees the experiments in the parameters helped to reduce the RMSE value, however in the Random Forest there was a slightly increase in the values. The feature selection in Linear regression did not help at all.
Some of the models were considered overfitted (if we see the MSE and RMSE) and most of them were impossible to reduce. It's strange that the R2 scores did not show signs of overfit. There might be different explanations about this but my theory is that i still don't understand this completely