Predicting Bike Rentals
Many U.S. cities have communal bike sharing stations where you can rent bicycles by the hour or day. Washington, D.C. is one of these cities. The District collects detailed data on the number of bicycles people rent by the hour and day.
Hadi Fanaee-T at the University of Porto compiled this data into a CSV file. The file contains 17380 rows, with each row representing the number of bike rentals for a single hour of a single day. You can download the data here from the University of California, Irvine's website.
In this project, the goal is to develop a predictor model for the total number of bikes people rented in a given hour.
I will generate the following machine learning models and evaluate their performance in terms of magnitude of prediction 'error' (lower is better):
# import key python libary modules
# to execute various code commands.
import pandas as pd
import numpy as np
import math
import random
import string
import matplotlib.pyplot as plt
from matplotlib.ticker import StrMethodFormatter
import seaborn as sns
from numpy.random import seed, randint
from IPython.display import HTML
from IPython.display import display, Markdown
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error
from sklearn import linear_model
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')
# read data file provided and observe data structure and contents.
bike_rentals = pd.read_csv('bike_rental_hour.csv', na_values=['NaN'])
print(bike_rentals.info(), '\n')
display(bike_rentals.head())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 17379 entries, 0 to 17378 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 instant 17379 non-null int64 1 dteday 17379 non-null object 2 season 17379 non-null int64 3 yr 17379 non-null int64 4 mnth 17379 non-null int64 5 hr 17379 non-null int64 6 holiday 17379 non-null int64 7 weekday 17379 non-null int64 8 workingday 17379 non-null int64 9 weathersit 17379 non-null int64 10 temp 17379 non-null float64 11 atemp 17379 non-null float64 12 hum 17379 non-null float64 13 windspeed 17379 non-null float64 14 casual 17379 non-null int64 15 registered 17379 non-null int64 16 cnt 17379 non-null int64 dtypes: float64(4), int64(12), object(1) memory usage: 2.3+ MB None
instant | dteday | season | yr | mnth | hr | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2011-01-01 | 1 | 0 | 1 | 0 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.81 | 0.0 | 3 | 13 | 16 |
1 | 2 | 2011-01-01 | 1 | 0 | 1 | 1 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0 | 8 | 32 | 40 |
2 | 3 | 2011-01-01 | 1 | 0 | 1 | 2 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0 | 5 | 27 | 32 |
3 | 4 | 2011-01-01 | 1 | 0 | 1 | 3 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0 | 3 | 10 | 13 |
4 | 5 | 2011-01-01 | 1 | 0 | 1 | 4 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0 | 0 | 1 | 1 |
Here are the descriptions for the relevant columns:
The 'target' column (dependent variable) is labeled 'cnt'. There are 14 other variables in the data file (described above) to consider as 'independent variables'. Columns labeled 'instant, 'casual', and 'registered' will not be included.
Start with a visual representation for distribution of bike rental count per hour by means of histogram.
# generate histogram to show distribution of bike rental count per hour.
ax = bike_rentals.hist(column='cnt', bins=14, grid=False,\
figsize=(12,8), color='#86bf91', zorder=2, rwidth=0.9)
ax = ax[0]
for x in ax:
# Despine
x.spines['right'].set_visible(False)
x.spines['top'].set_visible(False)
x.spines['left'].set_visible(False)
# Switch off ticks
x.tick_params(axis='both', which='both', bottom='off', top='off',\
labelbottom='on', left='off', right='off', labelleft='on')
# Draw horizontal axis lines
vals = x.get_yticks()
for tick in vals:
x.axhline(y=tick, linestyle='dashed', color='#eeeeee', zorder=1)
# Remove title
x.set_title('Distribution of Bike Rentals Per Hour', pad=20, weight='bold', fontsize=20)
# Set x-axis label
x.set_xlabel('Bike Rentals Per Hour', labelpad=15, size=15)
plt.xlim(xmin=0.0, xmax = 1150.0)
plt.xticks(np.arange(30, 1125, 70), fontsize=12)
# Set y-axis label
x.set_ylabel('Frequency of Occurrence', labelpad=15, size=15)
plt.yticks(fontsize=12)
# Format y-axis label
x.yaxis.set_major_formatter(StrMethodFormatter('{x:,g}'))
# summarize distribution.
min_rentals = bike_rentals['cnt'].min()
max_rentals = bike_rentals['cnt'].max()
print('Minimum rentals per hour =', min_rentals, '\n')
print('Maximum rentals per hour =', max_rentals, '\n')
count = 0
for n in bike_rentals['cnt']:
if n < 143:
count = count + 1
else:
None
percent = (count*100)/len(bike_rentals['cnt'])
print('{:.1f}''%'.format(percent), 'of the bike rentals per hour are less than 143.')
Minimum rentals per hour = 1 Maximum rentals per hour = 977 50.1% of the bike rentals per hour are less than 143.
The shape of the histogram above is basically exponential.
About half (50%) of the total bike rentals (17379) were less than 143 rentals per hour.
# convert hours to time intervals of 6 hours.
def assign_label(hour):
if hour >=0 and hour < 6:
return 4
elif hour >=6 and hour < 12:
return 1
elif hour >= 12 and hour < 18:
return 2
elif hour >= 18 and hour <=24:
return 3
bike_rentals['time_label'] = bike_rentals['hr'].apply(assign_label)
print(bike_rentals['time_label'].value_counts(), '\n')
print(bike_rentals.info())
2 4375 3 4368 1 4360 4 4276 Name: time_label, dtype: int64 <class 'pandas.core.frame.DataFrame'> RangeIndex: 17379 entries, 0 to 17378 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 instant 17379 non-null int64 1 dteday 17379 non-null object 2 season 17379 non-null int64 3 yr 17379 non-null int64 4 mnth 17379 non-null int64 5 hr 17379 non-null int64 6 holiday 17379 non-null int64 7 weekday 17379 non-null int64 8 workingday 17379 non-null int64 9 weathersit 17379 non-null int64 10 temp 17379 non-null float64 11 atemp 17379 non-null float64 12 hum 17379 non-null float64 13 windspeed 17379 non-null float64 14 casual 17379 non-null int64 15 registered 17379 non-null int64 16 cnt 17379 non-null int64 17 time_label 17379 non-null int64 dtypes: float64(4), int64(13), object(1) memory usage: 2.4+ MB None
Calculate absolute correlation values between independent variables and sort them in order of strength of relationship from lowest to highest.
bike_rentals_subset = bike_rentals.drop(['instant', 'dteday'], axis=1)
corrs = abs(bike_rentals_subset.corr()['cnt'])
sorted_corrs = corrs.sort_values(ascending=True)
print(sorted_corrs)
weekday 0.026900 workingday 0.030284 holiday 0.030927 windspeed 0.093234 mnth 0.120638 weathersit 0.142426 season 0.178056 yr 0.250495 hum 0.322911 time_label 0.378318 hr 0.394071 atemp 0.400929 temp 0.404772 casual 0.694564 registered 0.972151 cnt 1.000000 Name: cnt, dtype: float64
Plot a correlation matrix heatmap to help weed out any variables that have high correlation with other variables in the data set.
# plot seaborn heatmap.
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(12,10))
corrmat = bike_rentals[sorted_corrs.index]
corr2_df = corrmat.corr(method ='pearson')
ax = sns.heatmap(corr2_df)
The diagonal of white squares reflect the correlations of the variables with themselves (i.e. perfect correlation of 1.0)
The scale on the right side shows as correlation increases towards 1.0 (i.e. perfect relationship), color becomes lighter. Based on the correlation matrix heatmap (i.e. lighter color shade comparisons), we can tell that the following pair of columns are strongly correlated:
It certainly makes sense that 'temp' and 'atemp' are highly correlated. The variable 'atemp' is defined as adjusted temperature. I'm going to guess that it relates to adjusted temperature based on humidity. On that basis I will remove 'atemp' as a variable since we already have humidity ('hum') available.
There is also fairly high correlation between 'mnth' and 'season' as well. I'll remove 'season' as 'mnth' has finer increments (a little closer to continuous data than 'season').
# drop irrelevant variables.
final_corr_cols = sorted_corrs.drop(['season', 'atemp', 'casual', 'registered'])
print(final_corr_cols, '\n')
features11 = final_corr_cols.drop(['cnt']).index
display(Markdown('<h3><span style="color:blue"> Eleven Bike Rental Features </span></h3>'))
print(features11, '\n')
weekday 0.026900 workingday 0.030284 holiday 0.030927 windspeed 0.093234 mnth 0.120638 weathersit 0.142426 yr 0.250495 hum 0.322911 time_label 0.378318 hr 0.394071 temp 0.404772 cnt 1.000000 Name: cnt, dtype: float64
Index(['weekday', 'workingday', 'holiday', 'windspeed', 'mnth', 'weathersit', 'yr', 'hum', 'time_label', 'hr', 'temp'], dtype='object')
Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable. However, we have more than one independent variable here, so this should be classified as Multiple Linear Regression.
Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. Multiple regression is an extension of linear (OLS) regression that uses just one explanatory variable.
I will begin with multiple linear regression and will choose 'Mean Square Error' (MSE) as my error metric.
# split DataFrame file into 'Train' and 'Test' DataFrames.
# include randomize shuffling to prevent confounding of results.
# random state is a seed value
train = bike_rentals.sample(frac=0.8, random_state=1)
test = bike_rentals.loc[~bike_rentals.index.isin(train.index)]
# confirm correct split.
print('Train Data', train.shape, '\n')
print('Test Data', test.shape, '\n')
target = 'cnt'
clean_test = test[final_corr_cols.index].dropna()
# use sklearn linear model to calculate magnitude of errors (MSE).
lr = LinearRegression()
lr.fit(train[features11], train['cnt'])
train_predictions = lr.predict(train[features11])
test_predictions = lr.predict(clean_test[features11])
train_mse = mean_squared_error(train_predictions, train[target])
test_mse = mean_squared_error(test_predictions, clean_test[target])
print('Train Data Mean Square Error =', '{:.0f}'.format(train_mse), '\n')
print('Test Data Mean Square Error =', '{:.0f}'.format(test_mse))
Train Data (13903, 18) Test Data (3476, 18) Train Data Mean Square Error = 17764 Test Data Mean Square Error = 17236
The magnitude of 'Test Data' error (MSE), 17236, is very high relative to the range of bike rentals per hour: 1 to 977.
Mean Square Error by itself doesn't tell me a whole lot.
I prefer to determine R-Squared values. R-Squared value tells me the percentage of the total 'target' (bike rentals per hour) variation is explained by the independent variables in the model. I will expound on this further after R-Squared calculations.
# 'X' relates to indicators and 'Y' relates to target 'cnt'.
X = train[features11]
Y = train['cnt']
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, Y)
# with statsmodels
X = sm.add_constant(X) # adding a constant
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
display(Markdown('<h3><span style="color:blue"> {} Features </span></h3>'.format(11)))
print(features11, '\n')
print_model = model.summary()
print(print_model)
Index(['weekday', 'workingday', 'holiday', 'windspeed', 'mnth', 'weathersit', 'yr', 'hum', 'time_label', 'hr', 'temp'], dtype='object') OLS Regression Results ============================================================================== Dep. Variable: cnt R-squared: 0.465 Model: OLS Adj. R-squared: 0.464 Method: Least Squares F-statistic: 1097. Date: Tue, 27 Jul 2021 Prob (F-statistic): 0.00 Time: 21:16:01 Log-Likelihood: -87747. No. Observations: 13903 AIC: 1.755e+05 Df Residuals: 13891 BIC: 1.756e+05 Df Model: 11 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 119.6335 7.678 15.581 0.000 104.583 134.684 weekday 2.1554 0.567 3.799 0.000 1.043 3.268 workingday 4.0223 2.517 1.598 0.110 -0.912 8.957 holiday -25.6396 7.140 -3.591 0.000 -39.635 -11.644 windspeed 11.1141 9.892 1.124 0.261 -8.275 30.504 mnth 4.8514 0.344 14.103 0.000 4.177 5.526 weathersit -13.2459 2.004 -6.610 0.000 -17.174 -9.318 yr 84.4071 2.274 37.116 0.000 79.950 88.865 hum -154.1554 7.274 -21.192 0.000 -168.414 -139.897 time_label -47.7509 1.047 -45.629 0.000 -49.802 -45.700 hr 6.6070 0.175 37.781 0.000 6.264 6.950 temp 288.5194 6.125 47.108 0.000 276.514 300.525 ============================================================================== Omnibus: 2732.302 Durbin-Watson: 2.030 Prob(Omnibus): 0.000 Jarque-Bera (JB): 5460.778 Skew: 1.182 Prob(JB): 0.00 Kurtosis: 4.959 Cond. No. 153. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The R-Squared and Adjusted R-Squared values above are very close: 0.465 and 0.464 respectively.
The total explained variability of bike rentals per hour offered by the independent variables is only about 46%. That means that there is 54% unexplained variability in the model. This shows me why the magnitude of error from Multiple Linear Regression is high.
In summary, there are other variables that were not included in the data set that would account for the 54% unexplained variability.
Other such variables not included in the data set that could be significant in a predictor model are:
The regression table above shows two variables as having low strength of relationship regarding P>|t| values:
I will remove these and recalculate R-Squared.
features9 = ['weekday', 'holiday', 'mnth', 'weathersit',
'yr', 'hum', 'time_label', 'hr', 'temp']
X = train[features9]
Y = train['cnt']
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, Y)
# with statsmodels
X = sm.add_constant(X) # adding a constant
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
display(Markdown('<h3><span style="color:blue"> {} Features </span></h3>'.format(9)))
print(features9, '\n')
print_model = model.summary()
print(print_model)
['weekday', 'holiday', 'mnth', 'weathersit', 'yr', 'hum', 'time_label', 'hr', 'temp'] OLS Regression Results ============================================================================== Dep. Variable: cnt R-squared: 0.465 Model: OLS Adj. R-squared: 0.464 Method: Least Squares F-statistic: 1340. Date: Tue, 27 Jul 2021 Prob (F-statistic): 0.00 Time: 21:16:01 Log-Likelihood: -87749. No. Observations: 13903 AIC: 1.755e+05 Df Residuals: 13893 BIC: 1.756e+05 Df Model: 9 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 125.4344 6.890 18.204 0.000 111.928 138.941 weekday 2.1692 0.567 3.823 0.000 1.057 3.281 holiday -28.3768 6.924 -4.099 0.000 -41.948 -14.805 mnth 4.8153 0.343 14.037 0.000 4.143 5.488 weathersit -12.7482 1.978 -6.446 0.000 -16.625 -8.872 yr 84.3142 2.273 37.094 0.000 79.859 88.770 hum -156.4968 6.978 -22.429 0.000 -170.174 -142.820 time_label -47.8100 1.045 -45.734 0.000 -49.859 -45.761 hr 6.6156 0.175 37.868 0.000 6.273 6.958 temp 288.8919 6.116 47.236 0.000 276.904 300.880 ============================================================================== Omnibus: 2753.668 Durbin-Watson: 2.030 Prob(Omnibus): 0.000 Jarque-Bera (JB): 5552.804 Skew: 1.186 Prob(JB): 0.00 Kurtosis: 4.989 Cond. No. 117. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
After removing the two insignificant variables, there was no change in R Squared values.
X = test[features9]
Y = test['cnt']
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, Y)
# with statsmodels
X = sm.add_constant(X) # adding a constant
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
display(Markdown('<h3><span style="color:blue"> {} Features </span></h3>'.format(9)))
print(features9, '\n')
print_model = model.summary()
print(print_model)
['weekday', 'holiday', 'mnth', 'weathersit', 'yr', 'hum', 'time_label', 'hr', 'temp'] OLS Regression Results ============================================================================== Dep. Variable: cnt R-squared: 0.458 Model: OLS Adj. R-squared: 0.457 Method: Least Squares F-statistic: 325.5 Date: Tue, 27 Jul 2021 Prob (F-statistic): 0.00 Time: 21:16:01 Log-Likelihood: -21881. No. Observations: 3476 AIC: 4.378e+04 Df Residuals: 3466 BIC: 4.384e+04 Df Model: 9 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 127.2848 13.101 9.715 0.000 101.598 152.972 weekday 0.4529 1.120 0.405 0.686 -1.742 2.648 holiday -29.6735 12.613 -2.353 0.019 -54.404 -4.943 mnth 5.0590 0.670 7.555 0.000 3.746 6.372 weathersit -9.5172 3.940 -2.415 0.016 -17.243 -1.792 yr 73.7933 4.490 16.436 0.000 64.990 82.596 hum -163.0459 13.678 -11.921 0.000 -189.863 -136.229 time_label -45.6442 2.036 -22.419 0.000 -49.636 -41.652 hr 6.5363 0.343 19.040 0.000 5.863 7.209 temp 292.8830 11.973 24.461 0.000 269.407 316.359 ============================================================================== Omnibus: 742.916 Durbin-Watson: 1.672 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1581.468 Skew: 1.235 Prob(JB): 0.00 Kurtosis: 5.195 Cond. No. 114. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The R-Squared values for the Test Data were very close to those of the Train Data: 0.458 and 0.457 vs. 0.465 and 0.464.
However, the Test Data showed one more insignificant variable: 'weekday' (P>|t| = 0.686). I will drop it and recalculate R-Squared.
features8 = ['holiday', 'mnth', 'weathersit', 'yr', 'hum', 'time_label', 'hr', 'temp']
X = test[features8]
Y = test['cnt']
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, Y)
# with statsmodels
X = sm.add_constant(X) # adding a constant
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
display(Markdown('<h3><span style="color:blue"> {} Features </span></h3>'.format(8)))
print(features8, '\n')
print_model = model.summary()
print(print_model)
['holiday', 'mnth', 'weathersit', 'yr', 'hum', 'time_label', 'hr', 'temp'] OLS Regression Results ============================================================================== Dep. Variable: cnt R-squared: 0.458 Model: OLS Adj. R-squared: 0.457 Method: Least Squares F-statistic: 366.3 Date: Tue, 27 Jul 2021 Prob (F-statistic): 0.00 Time: 21:16:01 Log-Likelihood: -21881. No. Observations: 3476 AIC: 4.378e+04 Df Residuals: 3467 BIC: 4.383e+04 Df Model: 8 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 128.7462 12.592 10.225 0.000 104.058 153.434 holiday -30.2969 12.517 -2.420 0.016 -54.839 -5.755 mnth 5.0691 0.669 7.577 0.000 3.757 6.381 weathersit -9.4758 3.938 -2.406 0.016 -17.198 -1.754 yr 73.7665 4.489 16.434 0.000 64.966 82.567 hum -163.3729 13.652 -11.967 0.000 -190.140 -136.606 time_label -45.6433 2.036 -22.421 0.000 -49.635 -41.652 hr 6.5364 0.343 19.042 0.000 5.863 7.209 temp 292.8991 11.972 24.466 0.000 269.426 316.372 ============================================================================== Omnibus: 741.353 Durbin-Watson: 1.672 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1575.970 Skew: 1.233 Prob(JB): 0.00 Kurtosis: 5.191 Cond. No. 110. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The final model is composed of 8 variables and offers a total of about 46% explained variation of the total variation of bike rentals per hour.
Not very good!!
Decision Trees are great for obtaining non-linear relationships between input features and the target variable.
The inner working of a Decision Tree can be thought of as a bunch of if-else conditions.
It starts at the very top with one node. This node then splits into a left and right node — decision nodes. These nodes then split into their respective right and left nodes.
At the end of the leaf node, the average of the observation that occurs within that area is computed. The most bottom nodes are referred to as leaves or terminal nodes.
Now I will use the Decision Tree Algorithm to see if it will reduce prediction error compared to error obtained from Multiple Linear Regression.
I will create 'for loops' to cover a range of classifier values for:
# initialize the classifiers.
j = 10
for n in range(2,10):
clf = DecisionTreeClassifier(random_state=1, max_depth=n, min_samples_leaf=j)
clf2 = DecisionTreeClassifier(random_state=1, max_depth=n, min_samples_leaf=n)
clf.fit(train[features11], train['cnt'])
clf2.fit(train[features11], train['cnt'])
predictions = clf.predict(test[features11])
predictions2 = clf2.predict(test[features11])
print('max_depth =', n, ' min_samples_leaf =', j,\
' error =', np.mean((predictions - test['cnt']) ** 2), '\n')
print('max_depth =', n, ' min_samples_leaf =', n,\
' error =', np.mean((predictions2 - test['cnt']) ** 2), '\n')
j = j - 1
# create a second loop for other combinations of classifiers.
k = 10
for m in range(2,10):
clf = DecisionTreeClassifier(random_state=1, max_depth=(k), min_samples_leaf=m)
clf.fit(train[features11], train['cnt'])
predictions = clf.predict(test[features11])
print('max_depth =', k, ' min_samples_leaf =', m, ' error =', np.mean((predictions - test['cnt']) ** 2), '\n')
k = k - 1
max_depth = 2 min_samples_leaf = 10 error = 26841.366513233603 max_depth = 2 min_samples_leaf = 2 error = 26841.366513233603 max_depth = 3 min_samples_leaf = 9 error = 26127.992807825085 max_depth = 3 min_samples_leaf = 3 error = 26127.992807825085 max_depth = 4 min_samples_leaf = 8 error = 21809.697640966628 max_depth = 4 min_samples_leaf = 4 error = 21809.697640966628 max_depth = 5 min_samples_leaf = 7 error = 17534.776467203683 max_depth = 5 min_samples_leaf = 5 error = 17534.776467203683 max_depth = 6 min_samples_leaf = 6 error = 18490.93584579977 max_depth = 6 min_samples_leaf = 6 error = 18490.93584579977 max_depth = 7 min_samples_leaf = 5 error = 15019.472094361336 max_depth = 7 min_samples_leaf = 7 error = 15020.88751438435 max_depth = 8 min_samples_leaf = 4 error = 11355.302359033372 max_depth = 8 min_samples_leaf = 8 error = 11391.896432681242 max_depth = 9 min_samples_leaf = 3 error = 11137.639528193326 max_depth = 9 min_samples_leaf = 9 error = 11198.666283084005 max_depth = 10 min_samples_leaf = 2 error = 9931.891254315306 max_depth = 9 min_samples_leaf = 3 error = 11137.639528193326 max_depth = 8 min_samples_leaf = 4 error = 11355.302359033372 max_depth = 7 min_samples_leaf = 5 error = 15019.472094361336 max_depth = 6 min_samples_leaf = 6 error = 18490.93584579977 max_depth = 5 min_samples_leaf = 7 error = 17534.776467203683 max_depth = 4 min_samples_leaf = 8 error = 21809.697640966628 max_depth = 3 min_samples_leaf = 9 error = 26127.992807825085
The best combination of 'max_depth' and 'min_samples_leaf' for Decision Tree Algorithm that yielded overall lowest error of 9932 rounded was:
Random forest is an ensemble of decision trees. This is to say that many trees, constructed in a certain “random” way form a Random Forest.
One slight drawback with Random Forest Regression is that the predicted values are never outside the training set values for the target variable.
Finally, I will test run Random Forest Regression and observe magnitude of errors.
I will create 'for loops' to cover a range of classifier values for:
# initialize the classifiers.
j = 10
for n in range(2,10):
clf = RandomForestClassifier(random_state=1, n_estimators=n, min_samples_leaf=j)
clf2 = RandomForestClassifier(random_state=1, n_estimators=n, min_samples_leaf=n)
clf.fit(train[features11], train['cnt'])
clf2.fit(train[features11], train['cnt'])
predictions = clf.predict(test[features11])
predictions2 = clf2.predict(test[features11])
print('n_estimators =', n, ' min_samples_leaf =', j,\
' error =', np.mean((predictions - test['cnt']) ** 2), '\n')
print('n_estimators =', n, ' min_samples_leaf =', n,\
' error =', np.mean((predictions2 - test['cnt']) ** 2), '\n')
j = j - 1
# create a second loop for other combinations of classifiers.
k = 10
for m in range(2,10):
clf = RandomForestClassifier(random_state=1, n_estimators=m, min_samples_leaf=m)
clf.fit(train[features11], train["cnt"])
predictions = clf.predict(test[features11])
print('n_estimators =', k, ' min_samples_leaf =', m, ' error =', np.mean((predictions - test['cnt']) ** 2), '\n')
k = k - 1
n_estimators = 2 min_samples_leaf = 10 error = 14598.880034522439 n_estimators = 2 min_samples_leaf = 2 error = 14919.599827387801 n_estimators = 3 min_samples_leaf = 9 error = 15234.825949367088 n_estimators = 3 min_samples_leaf = 3 error = 13593.939010356731 n_estimators = 4 min_samples_leaf = 8 error = 13944.859896432681 n_estimators = 4 min_samples_leaf = 4 error = 13280.714326812427 n_estimators = 5 min_samples_leaf = 7 error = 14616.012370540851 n_estimators = 5 min_samples_leaf = 5 error = 12912.723532796317 n_estimators = 6 min_samples_leaf = 6 error = 10773.460299194476 n_estimators = 6 min_samples_leaf = 6 error = 10773.460299194476 n_estimators = 7 min_samples_leaf = 5 error = 11518.689585730725 n_estimators = 7 min_samples_leaf = 7 error = 11693.867951668584 n_estimators = 8 min_samples_leaf = 4 error = 10679.961162255466 n_estimators = 8 min_samples_leaf = 8 error = 11412.537686996548 n_estimators = 9 min_samples_leaf = 3 error = 10721.050345224396 n_estimators = 9 min_samples_leaf = 9 error = 10464.697928653624 n_estimators = 10 min_samples_leaf = 2 error = 14919.599827387801 n_estimators = 9 min_samples_leaf = 3 error = 13593.939010356731 n_estimators = 8 min_samples_leaf = 4 error = 13280.714326812427 n_estimators = 7 min_samples_leaf = 5 error = 12912.723532796317 n_estimators = 6 min_samples_leaf = 6 error = 10773.460299194476 n_estimators = 5 min_samples_leaf = 7 error = 11693.867951668584 n_estimators = 4 min_samples_leaf = 8 error = 11412.537686996548 n_estimators = 3 min_samples_leaf = 9 error = 10464.697928653624
The best combination of 'n_estimators' and 'min_samples_leaf' for Random Forest Algorithm that yielded overall lowest error of 10465 rounded was:
I have read that Random Forest Regression should usually yield lower error than Decision Tree Algorithm. Well, with the specific classifiers and range of values I chose for each in this case, that did not work out to be so.
I realize I could have chosen many more classifiers and increased ranges for the last two regression types: Decision Tree and Random Forest. I decided to limit the amount of each.
Both Decision Tree and Random Forest Regression yielded lower errors than Multiple Linear Regression.
The Decision Tree Algorithm yielded overall lowest prediction error (9932) among the three model types. However, that is still a fairly high error.
I currently do not have sufficient experience in using Decision Tree or Random Forest Regression to explain why they may often yield lower prediction error than Multiple Linear Regression.
HOWEVER: If the provided data file has a poor selection of predictor variables, it really doesn't matter how many classifiers we introduce and modify for either Decision Tree or Random Forest Regression to try and 'squeeze out' as much error as possible. The lowest error quantified will still most likely be high because of the low total explained variability (R-Squared) from the chosen variables.
That is typically the result when performing 'Happenstance Data Analysis' which is most likely what we have here. I would hazard to guess that the variables chosen for this data file were based on what was available, not necessarily chosen under controlled conditions for a 'Designed Experiment'.