Notebook

Predicting Bike Rentals

Introduction¶

Many U.S. cities have communal bike sharing stations where you can rent bicycles by the hour or day. Washington, D.C. is one of these cities. The District collects detailed data on the number of bicycles people rent by the hour and day.

Hadi Fanaee-T at the University of Porto compiled this data into a CSV file. The file contains 17380 rows, with each row representing the number of bike rentals for a single hour of a single day. You can download the data here from the University of California, Irvine's website.

Project Goal¶

In this project, the goal is to develop a predictor model for the total number of bikes people rented in a given hour.

I will generate the following machine learning models and evaluate their performance in terms of magnitude of prediction 'error' (lower is better):

Multiple Linear Regression
Decision Tree Algorithm
Random Forest Regression

In [1]:

# import key python libary modules
# to execute various code commands.

import pandas as pd
import numpy as np
import math
import random
import string
import matplotlib.pyplot as plt
from matplotlib.ticker import StrMethodFormatter
import seaborn as sns
from numpy.random import seed, randint
from IPython.display import HTML
from IPython.display import display, Markdown
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error
from sklearn import linear_model
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

import warnings
warnings.filterwarnings('ignore')

# read data file provided and observe data structure and contents.
bike_rentals = pd.read_csv('bike_rental_hour.csv', na_values=['NaN'])
print(bike_rentals.info(), '\n')
display(bike_rentals.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     17379 non-null  int64  
 1   dteday      17379 non-null  object 
 2   season      17379 non-null  int64  
 3   yr          17379 non-null  int64  
 4   mnth        17379 non-null  int64  
 5   hr          17379 non-null  int64  
 6   holiday     17379 non-null  int64  
 7   weekday     17379 non-null  int64  
 8   workingday  17379 non-null  int64  
 9   weathersit  17379 non-null  int64  
 10  temp        17379 non-null  float64
 11  atemp       17379 non-null  float64
 12  hum         17379 non-null  float64
 13  windspeed   17379 non-null  float64
 14  casual      17379 non-null  int64  
 15  registered  17379 non-null  int64  
 16  cnt         17379 non-null  int64  
dtypes: float64(4), int64(12), object(1)
memory usage: 2.3+ MB
None

	instant	dteday	season	mnth	hr	weekday	weathersit	temp	atemp	hum	casual	registered	cnt
0	1	2011-01-01	1	1	0	6	1	0.24	0.2879	0.81	3	13	16
1	2	2011-01-01	1	1	1	6	1	0.22	0.2727	0.80	8	32	40
2	3	2011-01-01	1	1	2	6	1	0.22	0.2727	0.80	5	27	32
3	4	2011-01-01	1	1	3	6	1	0.24	0.2879	0.75	3	10	13
4	5	2011-01-01	1	1	4	6	1	0.24	0.2879	0.75	0	1	1

Data Column Descriptions¶

Here are the descriptions for the relevant columns:

instant - A unique sequential ID number for each row
dteday - The date of the rentals
season - The season in which the rentals occurred
yr - The year the rentals occurred
mnth - The month the rentals occurred
hr - The hour the rentals occurred
holiday - Whether or not the day was a holiday
weekday - The day of the week (as a number, 0 to 7)
rkingday - Whether or not the day was a working day
weathersit - The weather (as a categorical variable)
temp - The temperature, on a 0-1 scale
atemp - The adjusted temperature
hum - The humidity, on a 0-1 scale
windspeed - The wind speed, on a 0-1 scale
casual - The number of casual riders (people who hadn't previously signed up with the bike sharing program)
registered - The number of registered riders (people who had already signed up)
cnt - The total number of bike rentals (casual + registered)

The 'target' column (dependent variable) is labeled 'cnt'. There are 14 other variables in the data file (described above) to consider as 'independent variables'. Columns labeled 'instant, 'casual', and 'registered' will not be included.

Histogram for Bike Rental Count Per Hour¶

Start with a visual representation for distribution of bike rental count per hour by means of histogram.

In [2]:

# generate histogram to show distribution of bike rental count per hour.
ax = bike_rentals.hist(column='cnt', bins=14, grid=False,\
                       figsize=(12,8), color='#86bf91', zorder=2, rwidth=0.9)

ax = ax[0]
for x in ax:

    # Despine
    x.spines['right'].set_visible(False)
    x.spines['top'].set_visible(False)
    x.spines['left'].set_visible(False)

    # Switch off ticks
    x.tick_params(axis='both', which='both', bottom='off', top='off',\
                  labelbottom='on', left='off', right='off', labelleft='on')

    # Draw horizontal axis lines
    vals = x.get_yticks()
    for tick in vals:
        x.axhline(y=tick, linestyle='dashed', color='#eeeeee', zorder=1)

    # Remove title
    x.set_title('Distribution of Bike Rentals Per Hour', pad=20, weight='bold', fontsize=20)

    # Set x-axis label
    x.set_xlabel('Bike Rentals Per Hour', labelpad=15, size=15)
    plt.xlim(xmin=0.0, xmax = 1150.0)
    plt.xticks(np.arange(30, 1125, 70), fontsize=12)

    # Set y-axis label
    x.set_ylabel('Frequency of Occurrence', labelpad=15, size=15)
    plt.yticks(fontsize=12)

    # Format y-axis label
    x.yaxis.set_major_formatter(StrMethodFormatter('{x:,g}'))

In [3]:

# summarize distribution.
min_rentals = bike_rentals['cnt'].min()
max_rentals = bike_rentals['cnt'].max()
print('Minimum rentals per hour =', min_rentals, '\n')
print('Maximum rentals per hour =', max_rentals, '\n')

count = 0
for n in bike_rentals['cnt']:
    if n < 143:
        count = count + 1
    else:
        None
percent = (count*100)/len(bike_rentals['cnt'])
print('{:.1f}''%'.format(percent), 'of the bike rentals per hour are less than 143.')

Minimum rentals per hour = 1 

Maximum rentals per hour = 977 

50.1% of the bike rentals per hour are less than 143.

Observations¶

The shape of the histogram above is basically exponential.

About half (50%) of the total bike rentals (17379) were less than 143 rentals per hour.

Create Time Intervals¶

In [4]:

# convert hours to time intervals of 6 hours.
def assign_label(hour):
    if hour >=0 and hour < 6:
        return 4
    elif hour >=6 and hour < 12:
        return 1
    elif hour >= 12 and hour < 18:
        return 2
    elif hour >= 18 and hour <=24:
        return 3

bike_rentals['time_label'] = bike_rentals['hr'].apply(assign_label)
print(bike_rentals['time_label'].value_counts(), '\n')
print(bike_rentals.info())

2    4375
3    4368
1    4360
4    4276
Name: time_label, dtype: int64 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 18 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     17379 non-null  int64  
 1   dteday      17379 non-null  object 
 2   season      17379 non-null  int64  
 3   yr          17379 non-null  int64  
 4   mnth        17379 non-null  int64  
 5   hr          17379 non-null  int64  
 6   holiday     17379 non-null  int64  
 7   weekday     17379 non-null  int64  
 8   workingday  17379 non-null  int64  
 9   weathersit  17379 non-null  int64  
 10  temp        17379 non-null  float64
 11  atemp       17379 non-null  float64
 12  hum         17379 non-null  float64
 13  windspeed   17379 non-null  float64
 14  casual      17379 non-null  int64  
 15  registered  17379 non-null  int64  
 16  cnt         17379 non-null  int64  
 17  time_label  17379 non-null  int64  
dtypes: float64(4), int64(13), object(1)
memory usage: 2.4+ MB
None

Compute Pairwise Correlation¶

Calculate absolute correlation values between independent variables and sort them in order of strength of relationship from lowest to highest.

In [5]:

bike_rentals_subset = bike_rentals.drop(['instant', 'dteday'], axis=1)

corrs = abs(bike_rentals_subset.corr()['cnt'])
sorted_corrs = corrs.sort_values(ascending=True)
print(sorted_corrs)

weekday       0.026900
workingday    0.030284
holiday       0.030927
windspeed     0.093234
mnth          0.120638
weathersit    0.142426
season        0.178056
yr            0.250495
hum           0.322911
time_label    0.378318
hr            0.394071
atemp         0.400929
temp          0.404772
casual        0.694564
registered    0.972151
cnt           1.000000
Name: cnt, dtype: float64

Correlation Matrix Heatmap¶

Plot a correlation matrix heatmap to help weed out any variables that have high correlation with other variables in the data set.

In [6]:

# plot seaborn heatmap.
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(12,10))

corrmat = bike_rentals[sorted_corrs.index]
corr2_df = corrmat.corr(method ='pearson')

ax = sns.heatmap(corr2_df)

Observations and Action¶

The diagonal of white squares reflect the correlations of the variables with themselves (i.e. perfect correlation of 1.0)

The scale on the right side shows as correlation increases towards 1.0 (i.e. perfect relationship), color becomes lighter. Based on the correlation matrix heatmap (i.e. lighter color shade comparisons), we can tell that the following pair of columns are strongly correlated:

'mnth' and 'season'
'temp' and 'atemp'

It certainly makes sense that 'temp' and 'atemp' are highly correlated. The variable 'atemp' is defined as adjusted temperature. I'm going to guess that it relates to adjusted temperature based on humidity. On that basis I will remove 'atemp' as a variable since we already have humidity ('hum') available.

There is also fairly high correlation between 'mnth' and 'season' as well. I'll remove 'season' as 'mnth' has finer increments (a little closer to continuous data than 'season').

In [7]:

# drop irrelevant variables.
final_corr_cols = sorted_corrs.drop(['season', 'atemp', 'casual', 'registered'])
print(final_corr_cols, '\n')
features11 = final_corr_cols.drop(['cnt']).index
display(Markdown('<h3><span style="color:blue"> Eleven Bike Rental Features  </span></h3>'))
print(features11, '\n')

weekday       0.026900
workingday    0.030284
holiday       0.030927
windspeed     0.093234
mnth          0.120638
weathersit    0.142426
yr            0.250495
hum           0.322911
time_label    0.378318
hr            0.394071
temp          0.404772
cnt           1.000000
Name: cnt, dtype: float64

Eleven Bike Rental Features

Index(['weekday', 'workingday', 'holiday', 'windspeed', 'mnth', 'weathersit',
       'yr', 'hum', 'time_label', 'hr', 'temp'],
      dtype='object')

Multiple Linear Regression

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable. However, we have more than one independent variable here, so this should be classified as Multiple Linear Regression.

Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. Multiple regression is an extension of linear (OLS) regression that uses just one explanatory variable.

I will begin with multiple linear regression and will choose 'Mean Square Error' (MSE) as my error metric.

In [8]:

# split DataFrame file into 'Train' and 'Test' DataFrames.
# include randomize shuffling to prevent confounding of results.
# random state is a seed value
train = bike_rentals.sample(frac=0.8, random_state=1)
test = bike_rentals.loc[~bike_rentals.index.isin(train.index)]
# confirm correct split.
print('Train Data', train.shape, '\n')
print('Test Data', test.shape, '\n')

target = 'cnt'
clean_test = test[final_corr_cols.index].dropna()

# use sklearn linear model to calculate magnitude of errors (MSE).
lr = LinearRegression()
lr.fit(train[features11], train['cnt'])

train_predictions = lr.predict(train[features11])
test_predictions = lr.predict(clean_test[features11])

train_mse = mean_squared_error(train_predictions, train[target])
test_mse = mean_squared_error(test_predictions, clean_test[target])

print('Train Data Mean Square Error =', '{:.0f}'.format(train_mse), '\n')
print('Test Data Mean Square Error =', '{:.0f}'.format(test_mse))

Train Data (13903, 18) 

Test Data (3476, 18) 

Train Data Mean Square Error = 17764 

Test Data Mean Square Error = 17236

Multiple Linear Regression Error (MSE)¶

The magnitude of 'Test Data' error (MSE), 17236, is very high relative to the range of bike rentals per hour: 1 to 977.

Mean Square Error by itself doesn't tell me a whole lot.

I prefer to determine R-Squared values. R-Squared value tells me the percentage of the total 'target' (bike rentals per hour) variation is explained by the independent variables in the model. I will expound on this further after R-Squared calculations.

Determine R-Squared Values¶

In [9]:

# 'X' relates to indicators and 'Y' relates to target 'cnt'.
X = train[features11]
Y = train['cnt']
 
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, Y)

# with statsmodels
X = sm.add_constant(X) # adding a constant
 
model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
display(Markdown('<h3><span style="color:blue"> {} Features  </span></h3>'.format(11)))
print(features11, '\n')

print_model = model.summary()
print(print_model)

11 Features

Index(['weekday', 'workingday', 'holiday', 'windspeed', 'mnth', 'weathersit',
       'yr', 'hum', 'time_label', 'hr', 'temp'],
      dtype='object') 

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    cnt   R-squared:                       0.465
Model:                            OLS   Adj. R-squared:                  0.464
Method:                 Least Squares   F-statistic:                     1097.
Date:                Tue, 27 Jul 2021   Prob (F-statistic):               0.00
Time:                        21:16:01   Log-Likelihood:                -87747.
No. Observations:               13903   AIC:                         1.755e+05
Df Residuals:                   13891   BIC:                         1.756e+05
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        119.6335      7.678     15.581      0.000     104.583     134.684
weekday        2.1554      0.567      3.799      0.000       1.043       3.268
workingday     4.0223      2.517      1.598      0.110      -0.912       8.957
holiday      -25.6396      7.140     -3.591      0.000     -39.635     -11.644
windspeed     11.1141      9.892      1.124      0.261      -8.275      30.504
mnth           4.8514      0.344     14.103      0.000       4.177       5.526
weathersit   -13.2459      2.004     -6.610      0.000     -17.174      -9.318
yr            84.4071      2.274     37.116      0.000      79.950      88.865
hum         -154.1554      7.274    -21.192      0.000    -168.414    -139.897
time_label   -47.7509      1.047    -45.629      0.000     -49.802     -45.700
hr             6.6070      0.175     37.781      0.000       6.264       6.950
temp         288.5194      6.125     47.108      0.000     276.514     300.525
==============================================================================
Omnibus:                     2732.302   Durbin-Watson:                   2.030
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5460.778
Skew:                           1.182   Prob(JB):                         0.00
Kurtosis:                       4.959   Cond. No.                         153.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Initial R-Squared Values¶

The R-Squared and Adjusted R-Squared values above are very close: 0.465 and 0.464 respectively.

The total explained variability of bike rentals per hour offered by the independent variables is only about 46%. That means that there is 54% unexplained variability in the model. This shows me why the magnitude of error from Multiple Linear Regression is high.

In summary, there are other variables that were not included in the data set that would account for the 54% unexplained variability.

Other such variables not included in the data set that could be significant in a predictor model are:

rental rates
rental rate specials
type of bike
bike availability
age of renter
gender of renter
duration of rental
...

Trim Insignificant Variables¶

The regression table above shows two variables as having low strength of relationship regarding P>|t| values:

workingday: P>|t| = 0.110
windspeed: P>|t| = 0.261

I will remove these and recalculate R-Squared.

In [10]:

features9 = ['weekday', 'holiday', 'mnth', 'weathersit',
       'yr', 'hum', 'time_label', 'hr', 'temp']

X = train[features9]
Y = train['cnt']
 
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, Y)

# with statsmodels
X = sm.add_constant(X) # adding a constant
 
model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
display(Markdown('<h3><span style="color:blue"> {} Features  </span></h3>'.format(9)))
print(features9, '\n')

print_model = model.summary()
print(print_model)

9 Features

['weekday', 'holiday', 'mnth', 'weathersit', 'yr', 'hum', 'time_label', 'hr', 'temp'] 

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    cnt   R-squared:                       0.465
Model:                            OLS   Adj. R-squared:                  0.464
Method:                 Least Squares   F-statistic:                     1340.
Date:                Tue, 27 Jul 2021   Prob (F-statistic):               0.00
Time:                        21:16:01   Log-Likelihood:                -87749.
No. Observations:               13903   AIC:                         1.755e+05
Df Residuals:                   13893   BIC:                         1.756e+05
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        125.4344      6.890     18.204      0.000     111.928     138.941
weekday        2.1692      0.567      3.823      0.000       1.057       3.281
holiday      -28.3768      6.924     -4.099      0.000     -41.948     -14.805
mnth           4.8153      0.343     14.037      0.000       4.143       5.488
weathersit   -12.7482      1.978     -6.446      0.000     -16.625      -8.872
yr            84.3142      2.273     37.094      0.000      79.859      88.770
hum         -156.4968      6.978    -22.429      0.000    -170.174    -142.820
time_label   -47.8100      1.045    -45.734      0.000     -49.859     -45.761
hr             6.6156      0.175     37.868      0.000       6.273       6.958
temp         288.8919      6.116     47.236      0.000     276.904     300.880
==============================================================================
Omnibus:                     2753.668   Durbin-Watson:                   2.030
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5552.804
Skew:                           1.186   Prob(JB):                         0.00
Kurtosis:                       4.989   Cond. No.                         117.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Trimmed R-Squared Values¶

After removing the two insignificant variables, there was no change in R Squared values.

Now to Check the Results on 'Test' Data ...¶

In [11]:

X = test[features9]
Y = test['cnt']
 
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, Y)

# with statsmodels
X = sm.add_constant(X) # adding a constant
 
model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
display(Markdown('<h3><span style="color:blue"> {} Features  </span></h3>'.format(9)))
print(features9, '\n')
print_model = model.summary()
print(print_model)

9 Features

['weekday', 'holiday', 'mnth', 'weathersit', 'yr', 'hum', 'time_label', 'hr', 'temp'] 

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    cnt   R-squared:                       0.458
Model:                            OLS   Adj. R-squared:                  0.457
Method:                 Least Squares   F-statistic:                     325.5
Date:                Tue, 27 Jul 2021   Prob (F-statistic):               0.00
Time:                        21:16:01   Log-Likelihood:                -21881.
No. Observations:                3476   AIC:                         4.378e+04
Df Residuals:                    3466   BIC:                         4.384e+04
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        127.2848     13.101      9.715      0.000     101.598     152.972
weekday        0.4529      1.120      0.405      0.686      -1.742       2.648
holiday      -29.6735     12.613     -2.353      0.019     -54.404      -4.943
mnth           5.0590      0.670      7.555      0.000       3.746       6.372
weathersit    -9.5172      3.940     -2.415      0.016     -17.243      -1.792
yr            73.7933      4.490     16.436      0.000      64.990      82.596
hum         -163.0459     13.678    -11.921      0.000    -189.863    -136.229
time_label   -45.6442      2.036    -22.419      0.000     -49.636     -41.652
hr             6.5363      0.343     19.040      0.000       5.863       7.209
temp         292.8830     11.973     24.461      0.000     269.407     316.359
==============================================================================
Omnibus:                      742.916   Durbin-Watson:                   1.672
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1581.468
Skew:                           1.235   Prob(JB):                         0.00
Kurtosis:                       5.195   Cond. No.                         114.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Test Data R-Squared Results¶

The R-Squared values for the Test Data were very close to those of the Train Data: 0.458 and 0.457 vs. 0.465 and 0.464.

However, the Test Data showed one more insignificant variable: 'weekday' (P>|t| = 0.686). I will drop it and recalculate R-Squared.

In [12]:

features8 = ['holiday', 'mnth', 'weathersit', 'yr', 'hum', 'time_label', 'hr', 'temp']

X = test[features8]
Y = test['cnt']
 
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, Y)

# with statsmodels
X = sm.add_constant(X) # adding a constant
 
model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
display(Markdown('<h3><span style="color:blue"> {} Features  </span></h3>'.format(8)))
print(features8, '\n')
print_model = model.summary()
print(print_model)

8 Features

['holiday', 'mnth', 'weathersit', 'yr', 'hum', 'time_label', 'hr', 'temp'] 

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    cnt   R-squared:                       0.458
Model:                            OLS   Adj. R-squared:                  0.457
Method:                 Least Squares   F-statistic:                     366.3
Date:                Tue, 27 Jul 2021   Prob (F-statistic):               0.00
Time:                        21:16:01   Log-Likelihood:                -21881.
No. Observations:                3476   AIC:                         4.378e+04
Df Residuals:                    3467   BIC:                         4.383e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        128.7462     12.592     10.225      0.000     104.058     153.434
holiday      -30.2969     12.517     -2.420      0.016     -54.839      -5.755
mnth           5.0691      0.669      7.577      0.000       3.757       6.381
weathersit    -9.4758      3.938     -2.406      0.016     -17.198      -1.754
yr            73.7665      4.489     16.434      0.000      64.966      82.567
hum         -163.3729     13.652    -11.967      0.000    -190.140    -136.606
time_label   -45.6433      2.036    -22.421      0.000     -49.635     -41.652
hr             6.5364      0.343     19.042      0.000       5.863       7.209
temp         292.8991     11.972     24.466      0.000     269.426     316.372
==============================================================================
Omnibus:                      741.353   Durbin-Watson:                   1.672
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1575.970
Skew:                           1.233   Prob(JB):                         0.00
Kurtosis:                       5.191   Cond. No.                         110.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Final Test Data R-Squared Values¶

The final model is composed of 8 variables and offers a total of about 46% explained variation of the total variation of bike rentals per hour.

Not very good!!

Decision Tree Algorithm

Decision%20Tree.jpg

Decision Trees are great for obtaining non-linear relationships between input features and the target variable.

The inner working of a Decision Tree can be thought of as a bunch of if-else conditions.

It starts at the very top with one node. This node then splits into a left and right node — decision nodes. These nodes then split into their respective right and left nodes.

At the end of the leaf node, the average of the observation that occurs within that area is computed. The most bottom nodes are referred to as leaves or terminal nodes.

Now I will use the Decision Tree Algorithm to see if it will reduce prediction error compared to error obtained from Multiple Linear Regression.

I will create 'for loops' to cover a range of classifier values for:

'max_depth'
'min_samples_leaf'

In [13]:

# initialize the classifiers.
j = 10
for n in range(2,10):
    clf = DecisionTreeClassifier(random_state=1, max_depth=n, min_samples_leaf=j)
    clf2 = DecisionTreeClassifier(random_state=1, max_depth=n, min_samples_leaf=n)
    clf.fit(train[features11], train['cnt'])
    clf2.fit(train[features11], train['cnt'])
    predictions = clf.predict(test[features11])
    predictions2 = clf2.predict(test[features11])
    print('max_depth =', n, '    min_samples_leaf =', j,\
          '    error =', np.mean((predictions - test['cnt']) ** 2), '\n')
    print('max_depth =', n, '    min_samples_leaf =', n,\
          '    error =', np.mean((predictions2 - test['cnt']) ** 2), '\n')   
    j = j - 1

# create a second loop for other combinations of classifiers.
k = 10
for m in range(2,10):
    clf = DecisionTreeClassifier(random_state=1, max_depth=(k), min_samples_leaf=m)
    clf.fit(train[features11], train['cnt'])
    predictions = clf.predict(test[features11])
    print('max_depth =', k, '    min_samples_leaf =', m, '    error =', np.mean((predictions - test['cnt']) ** 2), '\n')
    k = k - 1

max_depth = 2     min_samples_leaf = 10     error = 26841.366513233603 

max_depth = 2     min_samples_leaf = 2     error = 26841.366513233603 

max_depth = 3     min_samples_leaf = 9     error = 26127.992807825085 

max_depth = 3     min_samples_leaf = 3     error = 26127.992807825085 

max_depth = 4     min_samples_leaf = 8     error = 21809.697640966628 

max_depth = 4     min_samples_leaf = 4     error = 21809.697640966628 

max_depth = 5     min_samples_leaf = 7     error = 17534.776467203683 

max_depth = 5     min_samples_leaf = 5     error = 17534.776467203683 

max_depth = 6     min_samples_leaf = 6     error = 18490.93584579977 

max_depth = 6     min_samples_leaf = 6     error = 18490.93584579977 

max_depth = 7     min_samples_leaf = 5     error = 15019.472094361336 

max_depth = 7     min_samples_leaf = 7     error = 15020.88751438435 

max_depth = 8     min_samples_leaf = 4     error = 11355.302359033372 

max_depth = 8     min_samples_leaf = 8     error = 11391.896432681242 

max_depth = 9     min_samples_leaf = 3     error = 11137.639528193326 

max_depth = 9     min_samples_leaf = 9     error = 11198.666283084005 

max_depth = 10     min_samples_leaf = 2     error = 9931.891254315306 

max_depth = 9     min_samples_leaf = 3     error = 11137.639528193326 

max_depth = 8     min_samples_leaf = 4     error = 11355.302359033372 

max_depth = 7     min_samples_leaf = 5     error = 15019.472094361336 

max_depth = 6     min_samples_leaf = 6     error = 18490.93584579977 

max_depth = 5     min_samples_leaf = 7     error = 17534.776467203683 

max_depth = 4     min_samples_leaf = 8     error = 21809.697640966628 

max_depth = 3     min_samples_leaf = 9     error = 26127.992807825085

Decision Tree Results¶

The best combination of 'max_depth' and 'min_samples_leaf' for Decision Tree Algorithm that yielded overall lowest error of 9932 rounded was:

'max_depth' = 10
'min_samples_leaf' = 2

Random Forest Regression

Random%20Forest2.jpg

Random forest is an ensemble of decision trees. This is to say that many trees, constructed in a certain “random” way form a Random Forest.

Each tree is created from a different sample of rows and at each node, a different sample of features is selected for splitting.
Each of the trees makes its own individual prediction.
These predictions are then averaged to produce a single result.

One slight drawback with Random Forest Regression is that the predicted values are never outside the training set values for the target variable.

Finally, I will test run Random Forest Regression and observe magnitude of errors.

I will create 'for loops' to cover a range of classifier values for:

'n_estimators'
'min_samples_leaf'

In [14]:

# initialize the classifiers.
j = 10
for n in range(2,10):
    clf = RandomForestClassifier(random_state=1, n_estimators=n, min_samples_leaf=j)
    clf2 = RandomForestClassifier(random_state=1, n_estimators=n, min_samples_leaf=n)
    clf.fit(train[features11], train['cnt'])
    clf2.fit(train[features11], train['cnt'])
    predictions = clf.predict(test[features11])
    predictions2 = clf2.predict(test[features11])
    print('n_estimators =', n, '    min_samples_leaf =', j,\
          '    error =', np.mean((predictions - test['cnt']) ** 2), '\n')
    print('n_estimators =', n, '    min_samples_leaf =', n,\
          '    error =', np.mean((predictions2 - test['cnt']) ** 2), '\n')    
    j = j - 1

# create a second loop for other combinations of classifiers.    
k = 10
for m in range(2,10):
    clf = RandomForestClassifier(random_state=1, n_estimators=m, min_samples_leaf=m)
    clf.fit(train[features11], train["cnt"])
    predictions = clf.predict(test[features11])
    print('n_estimators =', k, '    min_samples_leaf =', m, '    error =', np.mean((predictions - test['cnt']) ** 2), '\n')
    k = k - 1

n_estimators = 2     min_samples_leaf = 10     error = 14598.880034522439 

n_estimators = 2     min_samples_leaf = 2     error = 14919.599827387801 

n_estimators = 3     min_samples_leaf = 9     error = 15234.825949367088 

n_estimators = 3     min_samples_leaf = 3     error = 13593.939010356731 

n_estimators = 4     min_samples_leaf = 8     error = 13944.859896432681 

n_estimators = 4     min_samples_leaf = 4     error = 13280.714326812427 

n_estimators = 5     min_samples_leaf = 7     error = 14616.012370540851 

n_estimators = 5     min_samples_leaf = 5     error = 12912.723532796317 

n_estimators = 6     min_samples_leaf = 6     error = 10773.460299194476 

n_estimators = 6     min_samples_leaf = 6     error = 10773.460299194476 

n_estimators = 7     min_samples_leaf = 5     error = 11518.689585730725 

n_estimators = 7     min_samples_leaf = 7     error = 11693.867951668584 

n_estimators = 8     min_samples_leaf = 4     error = 10679.961162255466 

n_estimators = 8     min_samples_leaf = 8     error = 11412.537686996548 

n_estimators = 9     min_samples_leaf = 3     error = 10721.050345224396 

n_estimators = 9     min_samples_leaf = 9     error = 10464.697928653624 

n_estimators = 10     min_samples_leaf = 2     error = 14919.599827387801 

n_estimators = 9     min_samples_leaf = 3     error = 13593.939010356731 

n_estimators = 8     min_samples_leaf = 4     error = 13280.714326812427 

n_estimators = 7     min_samples_leaf = 5     error = 12912.723532796317 

n_estimators = 6     min_samples_leaf = 6     error = 10773.460299194476 

n_estimators = 5     min_samples_leaf = 7     error = 11693.867951668584 

n_estimators = 4     min_samples_leaf = 8     error = 11412.537686996548 

n_estimators = 3     min_samples_leaf = 9     error = 10464.697928653624

Random Forest Regression Results¶

The best combination of 'n_estimators' and 'min_samples_leaf' for Random Forest Algorithm that yielded overall lowest error of 10465 rounded was:

'n_estimators' = 3 or 9
'min_samples_leaf' = 9

I have read that Random Forest Regression should usually yield lower error than Decision Tree Algorithm. Well, with the specific classifiers and range of values I chose for each in this case, that did not work out to be so.

I realize I could have chosen many more classifiers and increased ranges for the last two regression types: Decision Tree and Random Forest. I decided to limit the amount of each.

Conclusions

Both Decision Tree and Random Forest Regression yielded lower errors than Multiple Linear Regression.

The Decision Tree Algorithm yielded overall lowest prediction error (9932) among the three model types. However, that is still a fairly high error.

I currently do not have sufficient experience in using Decision Tree or Random Forest Regression to explain why they may often yield lower prediction error than Multiple Linear Regression.

HOWEVER: If the provided data file has a poor selection of predictor variables, it really doesn't matter how many classifiers we introduce and modify for either Decision Tree or Random Forest Regression to try and 'squeeze out' as much error as possible. The lowest error quantified will still most likely be high because of the low total explained variability (R-Squared) from the chosen variables.

That is typically the result when performing 'Happenstance Data Analysis' which is most likely what we have here. I would hazard to guess that the variables chosen for this data file were based on what was available, not necessarily chosen under controlled conditions for a 'Designed Experiment'.