House Prices: Advanced Regression Techniques¶

This goal of this project was to predict sales prices and practice feature engineering, RFs, and gradient boosting. The dataset was part of the House Prices Kaggle Competition.

Table of Contents¶

1 Summary
2 Introduction
3 Loading & Exploring the Data
- 3.1 Loading Required Libraries and Reading the Data into Python
- 3.2 Data Structure
4 Exploring the Variables
- 4.1 Exploring the Response Variable: SalePrice
- 4.2 Log-Transformation of the Response Variable
5 Data Imputation
- 5.1 Completeness of the Data
- 5.2 Impute the Missing Data
  - 5.2.1 Missing Values Corresponding to Lack of Specific Feature
  - 5.2.2 Mode Imputation: Replacing Missing Values with Most Frequent Value
6 Feature Engineering
7 LotFrontage Imputation
8 Preparing the Data for Modelling
9 SalePrice Modelling

Model Accuracy: (RMSE = 0.12220)¶

Top submissions on the public leaderboard range between 0.10-0.12

Competition Description¶

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

Competition Evaluation¶

As part of a Kaggle competition dataset, the accuracy of the sales prices was evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price.

1 Summary¶

I started this competition by focusing on getting a thorough understanding of the dataset. Particular attention was paid to impute the missing values within the dataset. The EDA process is detailed as well as visualized.

In this project, I created a predictive model that has been trained on data collected from homes in Ames, Iowa. Three algorithms were used, and their validation set RMSE and test set RMSE are listed below:

Regression Model	Validation RMSE	Test RMSE
Ridge	0.1130	0.12528
Lasso	0.1125	0.12679
XGBoost	0.1238	0.12799

Ensemble		0.12220

The Ridge regression model performed the best as a single model, likely due to the high multicollinearity. However, combining it with the Lasso and XGBoost regression models resulting in a higher prediction accuracy and a lower RMSE (0.12220 vs 0.12528).

2 Introduction¶

The dataset used for this project is the Ames Housing dataset that was compiled by Dean De Cock for use in data science education. It is an alternative to the popular but older Boston Housing dataset.

The Ames Housing dataset is also used in the Advanced Regression Techniques challenge on the Kaggle Website. These competitions is a great way to improve my skills and measure my progress as a data scientist.

Kaggle describes the competition as follows:

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

House Image

3 Loading & Exploring the Data Structure¶

3.1 Loading Required Libraries and Reading the Data into Python¶

Loading Python packages used in the project

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
from time import time

from math import sqrt
import statsmodels.api as sm
from statsmodels.formula.api import ols
import scipy.stats as st
from scipy.special import boxcox1p

from sklearn.cluster import KMeans
from sklearn import svm
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.preprocessing import RobustScaler, LabelEncoder, StandardScaler
from sklearn.linear_model import Lasso, Ridge
from sklearn.pipeline import make_pipeline
from xgboost.sklearn import XGBRegressor

%matplotlib inline

Now, we read in the csv's as datarames into Python.

In [2]:

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

3.2 Data Structure¶

In total, there are 81 columns/variables in the train dataset, including the response variable (SalePrice). I am only displaying a subset of the variables, as all of them will be discussed in more detail throughout the notebook.

The train dataset consists of character and integer variables. Many of these variables are ordinal factors, despite being represented as character or integer variables. These will require cleaning and/or feature engineering later.

In [3]:

print("Dimensions of Train Dataset:" + str(train.shape))
print("Dimensions of Test Dataset:" + str(test.shape))

Dimensions of Train Dataset:(1460, 81)
Dimensions of Test Dataset:(1459, 80)

In [4]:

train.iloc[:,0:10].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 10 columns):
Id             1460 non-null int64
MSSubClass     1460 non-null int64
MSZoning       1460 non-null object
LotFrontage    1201 non-null float64
LotArea        1460 non-null int64
Street         1460 non-null object
Alley          91 non-null object
LotShape       1460 non-null object
LandContour    1460 non-null object
Utilities      1460 non-null object
dtypes: float64(1), int64(3), object(6)
memory usage: 114.1+ KB

Next, we are going to define a few variables that will be used in later analyses as well as being required for the submission file.

In [5]:

y_train = train['SalePrice']
test_id = test['Id']

ntrain = train.shape[0]
ntest = test.shape[0]

Lastly, we are going to merge the train and test datasets to explore the data as well as impute any missing values.

In [6]:

all_data = pd.concat((train, test), sort=True).reset_index(drop=True)
all_data['Dataset'] = np.repeat(['Train', 'Test'], [ntrain, ntest], axis=0)
all_data.drop('Id', axis=1,inplace=True)

4 Exploring the Variables¶

4.1 Exploring the Response Variable: SalePrice¶

The probability distribution plot show that the sale prices are right skewed. This is to be expected as few people can afford very expensive houses.

In [7]:

sns.set_style('whitegrid')
sns.distplot(all_data['SalePrice'][~all_data['SalePrice'].isnull()], axlabel="Normal Distribution", fit=st.norm, fit_kws={"color":"red"})
plt.title('Distribution of Sales Price in Dollars')
(mu, sigma) = st.norm.fit(train['SalePrice'])
plt.legend(['Normal Distribution \n ($\mu=$ {:.2f} and  $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best', fancybox=True)
plt.show()

st.probplot(all_data['SalePrice'][~all_data['SalePrice'].isnull()], plot=plt)
plt.show()

Linear models tend to work better with normally distributed data. As such, we need to transform the response variable to make it more normally distributed.

4.2 Log-Transformation of the Response Variable¶

In [8]:

all_data['SalePrice'] = np.log1p(all_data['SalePrice'])

sns.distplot(all_data['SalePrice'][~all_data['SalePrice'].isnull()], axlabel="Normal Distribution", fit=st.norm, fit_kws={"color":"red"})

plt.title('Distribution of Transformed Sales Price in Dollars')
(mu, sigma) = st.norm.fit(train['SalePrice'])
plt.legend(['Normal Distribution \n ($\mu=$ {:.2f} and  $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best', fancybox=True)
plt.show()

st.probplot(all_data['SalePrice'][~all_data['SalePrice'].isnull()], plot=plt)
plt.show()

The skew is highly corrected and the distribution of the log-transformed sale prices appears more normally distributed.

5 Data Imputation¶

5.1 Completeness of the Data¶

We first need to find which variables contain missing values.

In [9]:

cols_with_missing_values = all_data.isnull().sum().sort_values(ascending=False)
display(pd.DataFrame(cols_with_missing_values[cols_with_missing_values[cols_with_missing_values > 0].index], 
                     columns=["Number of Missing Values"]))

	Number of Missing Values
PoolQC	2909
MiscFeature	2814
Alley	2721
Fence	2348
SalePrice	1459
FireplaceQu	1420
LotFrontage	486
GarageFinish	159
GarageQual	159
GarageYrBlt	159
GarageCond	159
GarageType	157
BsmtCond	82
BsmtExposure	82
BsmtQual	81
BsmtFinType2	80
BsmtFinType1	79
MasVnrType	24
MasVnrArea	23
MSZoning	4
Utilities	2
BsmtFullBath	2
BsmtHalfBath	2
Functional	2
Electrical	1
KitchenQual	1
Exterior2nd	1
Exterior1st	1
GarageCars	1
GarageArea	1
TotalBsmtSF	1
BsmtUnfSF	1
BsmtFinSF2	1
BsmtFinSF1	1
SaleType	1

In [10]:

cols_with_missing_values = all_data.isnull().sum().sort_values(ascending=False)
cols_with_missing_values = all_data[cols_with_missing_values[cols_with_missing_values > 0].index]
msno.bar(cols_with_missing_values)
plt.show()

5.2 Impute the Missing Data¶

5.2.1 Missing Values Corresponding to Lack of Specific Feature¶

PoolQC: data description of the variables states that NA represents "no pool". This makes sense given the vast number of missing values (>99%) and that a majority of houses do not have a pool.

In [11]:

all_data['PoolQC'].replace(np.nan, 'None', regex=True, inplace=True)

MiscFeature: data description of the variables states that NA represents "no miscellaneous feature".

In [12]:

all_data['MiscFeature'].replace(np.nan, 'None', regex=True, inplace=True)

Alley: data description of the variables states that NA represents "no alley access".

In [13]:

all_data['Alley'].replace(np.nan, 'None', regex=True, inplace=True)

Fence: data description of the variables states that NA represents "no fence".

In [14]:

all_data['Fence'].replace(np.nan, 'None', regex=True, inplace=True)

FireplaceQU: data description of the variables states that NA represents "no fireplace".

In [15]:

all_data['FireplaceQu'].replace(np.nan, 'None', regex=True, inplace=True)

GarageType, GarageFinish, GarageQual, and GarageCond: Missing values are likely due to lack of a garage. Replacing missing data with None.

In [16]:

garage_categorical = ['GarageType', 'GarageFinish', 'GarageQual', 'GarageCond']
for variable in garage_categorical:
    all_data[variable].replace(np.nan, 'None', regex=True, inplace=True)

GarageYrBlt, GarageCars, and GarageArea: Missing values are likely due to lack of a basement. Replacing missing data with 0 (since no garage means no cars in a garage).

In [17]:

garage_numeric = ['GarageYrBlt', 'GarageCars', 'GarageArea']
for variable in garage_numeric:
    all_data[variable].replace(np.nan, 0, regex=True, inplace=True)

BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, and BsmtFinType2: Missing values are likely due to lack of a basement. Replacing missing data with None.

In [18]:

basement_categorical = ['BsmtQual','BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']
for variable in basement_categorical:
    all_data[variable].replace(np.nan, 'None', regex=True, inplace=True)

BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, BsmtFullBath, and BsmtHalfBath: Missing values are likely due to lack of a basement. Replacing missing data with 0.

In [19]:

basement_numeric = ['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath']
for variable in basement_numeric:
    all_data[variable].replace(np.nan, 0, regex=True, inplace=True)

MasVnrType and MasVnrArea: Missing values are likely due to lack of masonry vaneer. We will replace the missing values with None for the type and 0 for the area.

In [20]:

all_data['MasVnrType'].replace(np.nan, 'None', regex=True, inplace=True)
all_data['MasVnrArea'].replace(np.nan, 0, regex=True, inplace=True)

Functional: data description of the variables states that NA represents "typical".

In [21]:

all_data['Functional'].replace(np.nan, 'Typ', regex=True, inplace=True)

5.2.2 Mode Imputation: Replacing Missing Values with Most Frequent Value¶

Using a mode imputation, we replace the missing values of a categorical variable with the mode of the non-missing cases of that variable. While it does have the advantage of being fast, it comes at the cost of a reduction in variance within the dataset.

Due to the low number of missing values imputed using this method, the bias introduced on the mean and standard deviation, as well as correlations with other variables are minimal.

First, we define a function "mode_impute_and_plot" to simplify the process of visualizing the categorical variables and replacing the missing variables with the most frequent value.

In [22]:

def mode_impute_and_plot(variable):
    print('# of missing values: ' + str(all_data[variable].isna().sum()))
    plt.figure(figsize=(8,4))
    ax = sns.countplot(all_data[variable])
    ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
    plt.tight_layout()
    plt.show()
    
    all_data[variable].replace(np.nan, all_data[variable].mode()[0], regex=True, inplace=True)

Now we can proceed to replace the missing values for the following variables:


MSZoning	Utilities	Electrical	[Exterior1st and Exterior2nd](#Exterior1st and Exterior2nd)	KitchenQual	SaleType

MSZoning: "RL" is by far the most common value. Missing values will be replace with "RL".

In [23]:

mode_impute_and_plot('MSZoning')

# of missing values: 4

Utilities: With the exception of one "NoSeWa" value, all records for this variable are "AllPub". Since "NoSeWa" is only in the training set, this feature will not help in predictive modelling. As such, we can safely remove it.

In [24]:

mode_impute_and_plot('Utilities')

all_data = all_data.drop('Utilities', axis=1)

# of missing values: 2

Electrical: Only one missing value, replace it with "SBrkr", the most common value.

In [25]:

mode_impute_and_plot('Electrical')

# of missing values: 1

Exterior1st and Exterior2nd: There is only one missing value for both Exterior 1 & 2. Missing values will be replaced by the most common value.

In [26]:

mode_impute_and_plot('Exterior1st')

# of missing values: 1

In [27]:

mode_impute_and_plot('Exterior2nd')

# of missing values: 1

KitchenQual: Only one missing value, replace it with "TA", the most common value.

In [28]:

mode_impute_and_plot('KitchenQual')

# of missing values: 1

SaleType: Replace missing values with "WD", the most common value.

In [29]:

mode_impute_and_plot('SaleType')

# of missing values: 1

Are there any remaining missing values?

In [30]:

cols_with_missing_values = all_data.isnull().sum().sort_values(ascending=False)
display(pd.DataFrame(cols_with_missing_values[cols_with_missing_values[cols_with_missing_values > 0].index], columns=["Number of Missing Values"]))

	Number of Missing Values
SalePrice	1459
LotFrontage	486

Due to its numeric nature and the large number of missing values, the LotFrontage variable will be imputed separately using an SVM algorithm (See Section 7 LotFrontage Imputation).

The remaining variables are all complete! Now to move on to feature engineering.

6 Feature Engineering¶

6.1 Mixed Conditions¶

In order to simplify and boost the accuracy of the preditive models, we will merge the two conditions into one variable: MixedConditions

The data descriptions states:

Condition1 represents proximity to various conditions.
Condition2 represents proximity to various conditions (if more than one is present).

If a property does not have one or multiple conditions, then it is classified as normal. However, designation of "normal" are condition 1 or condition 2 is strictly alphabetical.

For example, if a property is in proximity to a feeder street ("Feedr") and no other condition, then the data would appear as follows:

Condition1	Condition2
Feedr	Norm

However, if a property is within 200' of East-West Railroad (RRNe) and no other condition, then the data would appear as follows:

Condition1	Condition2
Norm	RRNe

Once we merge Conditions 1 & 2 into the MixedConditions variable, we will remove them from the analysis.

In [31]:

all_data['MixedConditions'] = all_data['Condition1'] + ' - ' + all_data['Condition2']
all_data.drop(labels=['Condition1', 'Condition2'], axis=1, inplace=True)

6.2 Mixed Exterior¶

The Exterior1st and Exterior2nd features are similar to the Conditions feature we merged and remove above. Properties with multiple types of exterior covering the house are assigned to Exterior1st or Exterior2nd alphabetically.

As such, we will conduct the same process to merge the two columns into a single MixedExterior variable and remove them from the analysis.

In [32]:

all_data['MixedExterior'] = all_data['Exterior1st'] + ' - ' + all_data['Exterior2nd']
all_data.drop(labels=['Exterior1st', 'Exterior2nd'], axis=1, inplace=True)

6.3 Total Square Feet¶

One of the important factors that people consider when buying a house is the total living space in square feet. Since the total square feet is not explicitly listed, we will add a new variable by adding up the square footage of the basement, first floor, and the second floor.

In [33]:

SF_df = all_data[['TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'SalePrice']]
SF_df = SF_df[~SF_df['SalePrice'].isnull()]
SF_vars = list(SF_df)
del SF_vars[-1]

SF_summary = []

for SF_type in SF_vars:
    corr_val = np.corrcoef(SF_df[SF_type], SF_df['SalePrice'])[1][0]
    SF_summary.append(corr_val)
    
pd.DataFrame([SF_summary], columns=SF_vars, index=['SalePrice Correlation'])

Out[33]:

	TotalBsmtSF	1stFlrSF	2ndFlrSF
SalePrice Correlation	0.612134	0.596981	0.3193

In [34]:

all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']

plt.figure(figsize=(10,6))
ax = sns.regplot(all_data['TotalSF'], all_data['SalePrice'], line_kws={'color': 'red'})
ax.set(ylabel='ln(SalePrice)')
plt.show()

print("Pearson Correlation Coefficient: %.3f" %  (np.corrcoef(all_data['TotalSF'].iloc[:ntrain], train['SalePrice']))[1][0])

Pearson Correlation Coefficient: 0.782

There is a very strong correlation (r = 0.78) between the TotalSF and the SalePrice, with the exception of two outliers. These outliers should be removed to increase the accuracy of our model.

6.4 Total Number of Bathrooms¶

There are 4 bathroom variables. Individually, these variables are not very important. Together, however, these predictors are likely to become a strong one.

A full bath is made up of four parts: a sink, shower, a bathtub, and a toilet. A half-bath, also called a powder room or guest bath, only has a sink and a toilet. As such, half-bathrooms will have half the value of a full bath.

In [35]:

bath_df = all_data[['BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'SalePrice']]
bath_df = bath_df[~bath_df['SalePrice'].isnull()]
bath_vars = list(bath_df)
del bath_vars[-1]

bath_summary = []

for bath_type in bath_vars:
    corr_val = np.corrcoef(bath_df[bath_type], bath_df['SalePrice'])[1][0]
    bath_summary.append(corr_val)
    
pd.DataFrame([bath_summary], columns=bath_vars, index=['SalePrice Correlation'])

Out[35]:

	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath
SalePrice Correlation	0.236224	-0.005149	0.594771	0.313982

In [36]:

all_data['TotalBath'] = all_data['BsmtFullBath'] + (all_data['BsmtHalfBath']*0.5) + all_data['FullBath'] + (all_data['HalfBath']*0.5)

plt.figure(figsize=(10,4))
sns.countplot(all_data['TotalBath'], color='grey')
plt.tight_layout()
plt.show()

plt.figure(figsize=(10,6))
sns.regplot(all_data['TotalBath'].iloc[:ntrain], train['SalePrice'], line_kws={'color': 'red'})
ax.set(ylabel='ln(SalePrice)')
plt.show()

print("Pearson Correlation Coefficient: %.3f" %  (np.corrcoef(all_data['TotalBath'].iloc[:ntrain], train['SalePrice']))[1][0])

Pearson Correlation Coefficient: 0.632

We can see a high positive correlation (r = 0.67) between TotalBath and SalePrice. The new variable correlation is much higher than any of the four original bathroom variables.

6.5 Binning the Neighbourhoods¶

In [37]:

neighborhood_prices = all_data.iloc[:ntrain].copy()
neighborhood_prices['SalePrice'] = np.exp(all_data['SalePrice'])
neighborhood_prices = neighborhood_prices[['Neighborhood', 'SalePrice']].groupby('Neighborhood').median().sort_values('SalePrice')

plt.figure(figsize=(15,7))
ax = sns.barplot(x= neighborhood_prices.index, y=neighborhood_prices['SalePrice'], color='grey')
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()

In order to bin the neighbourhoods into approriate clusters, we will use K-Mean clustering. K-Means is an unsupervised machine learning algorithm that groups a dataset into a user-specified number (k) of clusters. One potential issue with the algorithm is that it will cluster the data into k clusters, even if k is not the right number of cluster to use.

Therefore, we need to identity the optimal number of k clusters to use. To do so, we will use the Elbow Method.

Briefly, the idea of the elbow method is to run k-means clustering for a range of values and calculate the sum of squared errors (SSE). We then plot a line chart of the SSE for each value of k. Each additional k cluster will result in a lower SSE, but will eventually exhibit significantly diminished return.

The goal is the choose a small value of k with a low SSE, after which the subsequent k values exhibit diminishing returns. If the line plot looks like an arm, then the "elbow" of the arm is the optimal k value.

The plot below indicates that the optimal number of neighbourhood clusters is k = 3.

In [38]:

SS_distances = []
K = range(1,10)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(neighborhood_prices)
    SS_distances.append(km.inertia_)
    

plt.plot(K, SS_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum of squared distances')
plt.title('Elbow Method For Optimal k')
plt.show()

Now let's see how the binned neighborhoods look like.

In [39]:

neighborhood_prices['Cluster'] = KMeans(n_clusters=3).fit(neighborhood_prices).labels_

plt.figure(figsize=(15,7))
ax = sns.barplot(x= neighborhood_prices.index, y=neighborhood_prices['SalePrice'], 
                 hue=neighborhood_prices['Cluster'])
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()

Lastly, we have to create a new variable in the dataset based on the new cluster labels.

In [40]:

neighborhood_dict = dict(zip(neighborhood_prices.index, neighborhood_prices.Cluster))
all_data['Neighborhood_Class'] = all_data['Neighborhood']
all_data['Neighborhood_Class'].replace(neighborhood_dict, inplace = True)

7 LotFrontage Imputation¶

Due to the numeric nature and sheer number of missing values in the LotFrontage column, we will impute the missing values using an SVM Regressor algorithm to predict the missing values.

The first step is to subset the dataset into two groups. The train dataset (train_LotFrontage) contains the complete records while the test dataset (test_LotFrontage) contains the missing values. Next, we examine the structure of the newly created datasets. There are 486 missing values in the full dataset, or roughly 16% of the data.

7.1 LotFrontage Data Structure¶

In [41]:

train_LotFrontage = all_data[~all_data.LotFrontage.isnull()]
test_LotFrontage = all_data[all_data.LotFrontage.isnull()]

print("\n")
print("Dimensions of Train LotFrontage Dataset:" + str(train_LotFrontage.shape))
print("Dimensions of Test LotFrontage Dataset:" + str(test_LotFrontage.shape))
display(pd.DataFrame(all_data['LotFrontage'].describe()).transpose())


Dimensions of Train LotFrontage Dataset:(2433, 81)
Dimensions of Test LotFrontage Dataset:(486, 81)

	count	mean	std	min	25%	50%	75%	max
LotFrontage	2433.0	69.305795	23.344905	21.0	59.0	68.0	80.0	313.0

Now let's examine the distribution of LotFrontages values in the Train LotFrontage dataset through a boxplot and distribution plot. Through these graphs, we can see several interesting observations appear:

There is a cluster of low LotFrontage value properties, shown as a peak on the far left of the distribution plot. The boxplot indicates these values may be outliers, shown as outlier points on the left of the whisker of the boxplot.
There is a long tail of high LotFrontage value properties. These values extend beyond the Median + 1.5IQR range.

In other words, there are outliers on both ends of the LotFrontage value distributions.

In [42]:

fig, ax = plt.subplots(1,2, figsize=(16,4))
sns.boxplot(train_LotFrontage['LotFrontage'], ax=ax[0])

sns.distplot(train_LotFrontage['LotFrontage'], ax=ax[1], fit=st.norm, fit_kws={"color":"red"})
(mu, sigma) = st.norm.fit(train_LotFrontage['LotFrontage'])
plt.legend(['Normal Distribution \n ($\mu=$ {:.2f} and  $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best', fancybox=True)
plt.show()

7.2 Outlier Detection & Removal¶

Before we examine the correlations between LotFrontage and other variables, we should remove the outliers seen above.

To do so, we will use the Interquartile Range (IQR) method, where values outside the Median ± 1.5IQR are considered to be outliers. These outliers are then removed from the LotFrontage datasets.

In [43]:

def outlier_detection(data):
    Q1, Q3 = np.percentile(data, [25,75])
    IQR = Q3-Q1
    lower_cutoff = Q1 - (IQR * 1.5)
    upper_cutoff = Q3 + (IQR * 1.5)
    outliers = (data > Q3+1.5*IQR) | (data < Q1-1.5*IQR)
    return outliers

In [44]:

train_LotFrontage_no_outliers = train_LotFrontage[~outlier_detection(train_LotFrontage.LotFrontage)]
fig, ax = plt.subplots(1,2, figsize=(16,4))
sns.boxplot(train_LotFrontage_no_outliers['LotFrontage'], ax=ax[0])

sns.distplot(train_LotFrontage_no_outliers['LotFrontage'], ax=ax[1], fit=st.norm, fit_kws={"color":"red"})
(mu, sigma) = st.norm.fit(train_LotFrontage_no_outliers['LotFrontage'])
plt.legend(['Normal Distribution \n ($\mu=$ {:.2f} and  $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best', fancybox=True)
plt.show()

Two new values previously not deemed outliers by the first iteration of the IQR range method are now shown as potential outliers in the boxplot. This is caused by having a new smaller IQR value after removing the first batch of outliers. As such, we will keep these values.

7.3 Determining Relevant Variables of LotFrontage¶

Now we can explore the relationship between the LotFrontage target variable and other features. In order to confirm a relationship between these key features and we will conduct an ANOVA (Analysis of Variance) test to determine statistical significance (in this case, p < 0.01).

Using the type III ANOVA, we are confident the variables listed in the table below are correlated with LotFrontage.

Note: Variables that begin with a number (1stFlrSF, 2ndFlrSF, and 3SsnPorch) cause a syntax error within the ols formula input. I briefly converted them to an appropriate name in the temp_train dataset for the purpose of the ANOVA.

In [45]:

temp_train = train_LotFrontage_no_outliers.copy()
temp_train.rename(columns={'1stFlrSF':'X1stFlrSF', '2ndFlrSF': 'X2ndFlrSF', '3SsnPorch': 'X3SsnPorch'}, inplace=True)

mod = ols('LotFrontage ~ X1stFlrSF + X2ndFlrSF + X3SsnPorch + Alley + BedroomAbvGr + BldgType + BsmtCond + BsmtExposure + BsmtFinSF1 + BsmtFinSF2 + BsmtFinType1 + BsmtFinType2 + BsmtFullBath + BsmtHalfBath + BsmtQual + BsmtUnfSF + CentralAir + Electrical + EnclosedPorch + ExterCond + ExterQual + Fence + FireplaceQu + Fireplaces + Foundation + FullBath + Functional + GarageArea + GarageCars + GarageCond + GarageFinish + GarageQual + GarageType + GarageYrBlt + GrLivArea + HalfBath + Heating + HeatingQC + HouseStyle + KitchenAbvGr + KitchenQual + LandContour + LandSlope + LotArea + LotConfig + LotShape + LowQualFinSF + MSSubClass + MSZoning + MasVnrArea + MasVnrType + MiscFeature + MiscVal + MoSold + Neighborhood + OpenPorchSF + OverallCond + OverallQual + PavedDrive + PoolArea + PoolQC + RoofMatl + RoofStyle + SaleCondition + SaleType + ScreenPorch + Street + TotRmsAbvGrd + TotalBsmtSF + WoodDeckSF + YearBuilt + YearRemodAdd + YrSold + MixedConditions + MixedExterior + TotalSF + TotalBath + Neighborhood_Class', data=temp_train).fit()
aov_table = sm.stats.anova_lm(mod, typ=3)
display(aov_table[aov_table['PR(>F)'] < 0.01])

	sum_sq	df	F	PR(>F)
Alley	1363.913911	2.0	7.503367	5.674399e-04
BldgType	20479.667110	4.0	56.332901	5.337216e-45
GarageCond	1600.666924	5.0	3.522331	3.573473e-03
GarageFinish	1607.144665	3.0	5.894309	5.300349e-04
GarageQual	1851.155770	5.0	4.073542	1.108720e-03
GarageType	4398.086327	6.0	8.065136	1.269477e-08
HouseStyle	3641.202949	7.0	5.723294	1.433756e-06
LotConfig	43527.367185	4.0	119.729624	2.398118e-91
LotShape	2826.722888	3.0	10.367193	9.089207e-07
Neighborhood	14742.824462	23.0	7.052646	7.333663e-22
RoofMatl	2519.609152	5.0	5.544500	4.462709e-05
MixedConditions	4309.317372	18.0	2.634118	2.091806e-04
MixedExterior	12661.271705	79.0	1.763394	5.468941e-05
GarageCars	999.666184	1.0	10.999026	9.284905e-04
GarageYrBlt	1008.361079	1.0	11.094693	8.820554e-04
LotArea	5292.799666	1.0	58.235079	3.629421e-14
MSSubClass	997.020109	1.0	10.969912	9.431071e-04
TotalSF	636.362078	1.0	7.001700	8.209306e-03

7.4 LotFrontage Model Building and Evaluation¶

Now let's build a model using the relevant variables selected above. We will be using Support Vector Regressor to build the model.

The first step is to seperate the target variable, LotFrontage, select the relevant variables, and dummifying the categorical variables.

In [46]:

X = train_LotFrontage_no_outliers[aov_table[aov_table['PR(>F)'] < 0.01].index]
y = train_LotFrontage_no_outliers['LotFrontage']

X = pd.get_dummies(X)
transformer = StandardScaler().fit(X)
X = transformer.transform(X)

In order to determine the accuracy of our predictive model, we will create a Validation dataset. This dataset will have known values for the target LotFrontage variable which can we compare our model's prediction against. Using the mean absolute error, we can measure the difference between our predicted LotFrontage values with the true values.

![Train Test Split](Train Test Split.png)

In [47]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1234)

clf = svm.SVR(kernel='rbf', C=100, gamma=0.001)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
print('Mean Absolute Error:  %.3f' % mean_absolute_error(y_test, preds))

Mean Absolute Error:  6.848

These results show that using the SVR model to impute LotFrontage gives an average rror of less than 7 feet.

We will now use the same SVR model to predict the unknown LotFrontage values in the test_Frontage dataset.

In [48]:

model_data = train_LotFrontage_no_outliers.copy()
model_data= model_data.append(test_LotFrontage)
y = model_data['LotFrontage']
model_data = model_data[['MSZoning', 'Alley', 'LotArea', 'LotShape', 'LotConfig', 'Neighborhood', 'MixedConditions', 'GarageType', 'GarageCars', 'GarageArea']]

model_data = pd.get_dummies(model_data)
model_X_train = model_data[~y.isnull()]
model_X_test = model_data[y.isnull()]
model_y_train = y[~y.isnull()]

transformer = StandardScaler().fit(model_X_train)
model_X_train = transformer.transform(model_X_train)
model_X_test = transformer.transform(model_X_test)

clf = svm.SVR(kernel='rbf', C=100, gamma=0.001)
clf.fit(model_X_train, model_y_train)
LotFrontage_preds = clf.predict(model_X_test)

Now that we have the newly predicted LotFrontage values, we can examine the distribution of predicted values relative to the known LotFrontage values in the training dataset. Both distributions have a mean around 70 feet with similar tail lengths on either end.

In [49]:

sns.distplot(model_y_train)
sns.distplot(LotFrontage_preds)

Out[49]:

<matplotlib.axes._subplots.AxesSubplot at 0x1550eb4dba8>

Lastly, we need to add the predicted values back into the original dataset.

In [50]:

all_data.LotFrontage[all_data.LotFrontage.isnull()] = LotFrontage_preds

C:\Users\bashk\Anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

8 Preparing the Data for Modelling¶

8.1 Removing Outliers¶

Although we could be more in depth in our outliet detection and removal, we are only going to remove the two outliers found in the TotalSF variable. These properties exhibited large total square feet values with low SalePrice.

In [51]:

all_data.drop(all_data['TotalSF'].iloc[:ntrain].nlargest().index[:2], axis=0, inplace=True)    
ntrain = train.shape[0]-2

8.2 Correlation Between Numeric Predictors¶

We can see a large correlation between many variables, suggesting a high level of multicollinearity. There are two options to consider:

Use a regression model that deals well with multicollinearity, such as a ridge regression.
Remove highly correlated predictors from the model.

In [52]:

corr = all_data.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize=(15, 15))
cmap = sns.diverging_palette(220, 10, as_cmap=True)

sns.heatmap(corr, mask=mask, cmap=cmap,
            center=0, square=True, linewidths = .5)
plt.show()

8.3 Label Encoding¶

Many of the ordinal variables are presented as numeric values. Therefore we need to label encode these numeric characters into strings.

In [53]:

num_to_str_columns = ['MSSubClass', 'OverallQual', 'OverallCond', 'MoSold', 'YrSold']

for col in num_to_str_columns:
    all_data[col] = all_data[col].astype('str')

In [54]:

cat_cols = ['OverallQual', 'OverallCond', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 
            'BsmtFinType2', 'BsmtFinSF2', 'HeatingQC', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 
            'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Fireplaces', 'FireplaceQu', 'GarageFinish', 'GarageCars',
            'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'MoSold', 'YrSold']

for col in cat_cols:
    label = LabelEncoder()
    label.fit(list(all_data[col].values))
    all_data[col] = label.transform(list(all_data[col].values))

print('Shape all_data: {}'.format(all_data.shape))
all_data.head()

Shape all_data: (2917, 81)

Out[54]:

	1stFlrSF	2ndFlrSF	Alley	BedroomAbvGr	BldgType	BsmtCond	BsmtExposure	BsmtFinSF1	...	WoodDeckSF	YearBuilt	YearRemodAdd	YrSold	Dataset	MixedConditions	MixedExterior	TotalSF	TotalBath	Neighborhood_Class
0	856	854	None	3	1Fam	4	3	706.0	...	0	110	53	2	Train	Norm - Norm	VinylSd - VinylSd	2566.0	3.5	0
1	1262	0	None	3	1Fam	4	1	978.0	...	298	83	26	1	Train	Feedr - Norm	MetalSd - MetalSd	2524.0	2.5	0
2	920	866	None	3	1Fam	4	2	486.0	...	0	108	52	2	Train	Norm - Norm	VinylSd - VinylSd	2706.0	3.5	0
3	961	756	None	3	1Fam	1	3	216.0	...	0	25	20	0	Train	Norm - Norm	Wd Sdng - Wd Shng	2473.0	2.0	0
4	1145	1053	None	4	1Fam	4	0	655.0	...	192	107	50	2	Train	Norm - Norm	VinylSd - VinylSd	3343.0	3.5	2

5 rows × 81 columns

8.4 Skewness & Normalization of Numeric Variables¶

Skewness is a measure of asymmetry of a distribution, and can be used to define the extent to which the distribution differs from a normal distribution. Therefore, a normal distribution will have a skewness of 0. As a rule of thumb, if skewness is less than -1 or greater than 1, the distribution is highly skewed.

In order to account for skewness, we will transform the (highly) skewed data into normality using a Log Transformation. We define highly skewed data as variables with a skewness greater than 0.85. This method is similar to the approach used to normalize the SalePrice Response Variable, except we will use log+1 to avoid division by zero issues.

In [55]:

numeric_feats = ['LotFrontage', 'LotArea', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 
                 'LowQualFinSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch',
                 '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'TotalSF']

skewed_feats = all_data[numeric_feats].apply(lambda x: st.skew(x.dropna())).sort_values(ascending=False)
skewness = pd.DataFrame({'Skew Before Transformation' :skewed_feats})

skewness = skewness[abs(skewness) > 1].dropna(axis=0)
skewed_features = skewness.index
for feat in skewed_features:
    all_data[feat] = np.log1p(all_data[feat]+1)

skewed_feats = all_data[skewed_features].apply(lambda x: st.skew(x.dropna())).sort_values(ascending=False)
skewness['Skew After Transformation'] = skewed_feats
skewness

Out[55]:

	Skew Before Transformation	Skew After Transformation
MiscVal	21.939672	5.255296
PoolArea	17.688664	15.655054
LotArea	13.109495	-0.532117
LowQualFinSF	12.084539	8.605557
3SsnPorch	11.372080	8.853408
EnclosedPorch	4.002344	1.983518
ScreenPorch	3.945101	2.954087
OpenPorchSF	2.529358	0.019004
WoodDeckSF	1.844792	0.179665
1stFlrSF	1.257286	0.031286
GrLivArea	1.068750	-0.021271
LotFrontage	1.067631	-0.889232
TotalSF	1.009157	-0.428615

8.5 One Hot Encoding the Categorical Variables¶

The last step needed to prepare the data is to make sure that all categorical predictor variables are converted into a form that is usable by machine learning algorithms. This process is known as 'one-hot encoding' the categorical variables.

The process involved all non-ordinal factors receiving their own separate column with 1's and 0's, and is required by most ML algorithms.

In [56]:

all_data = pd.get_dummies(all_data)
print(all_data.shape)
all_data.head(3)

(2917, 317)

Out[56]:

	1stFlrSF	2ndFlrSF	3SsnPorch	BedroomAbvGr	BsmtCond	BsmtExposure	BsmtFinSF1	BsmtFinType1	BsmtFinType2	...
0	6.754604	854	0.693147	3	4	3	706.0	2	6	...
1	7.142037	0	0.693147	3	4	1	978.0	0	6	...
2	6.826545	866	0.693147	3	4	2	486.0	2	6	...

3 rows × 317 columns

9 SalePrice Modelling¶

Now that the data is correctly processed, we are ready to begin building our predictive models.

9.1 Obtaining Final Train and Test Sets¶

In [57]:

final_y_train = all_data['SalePrice'][~all_data['SalePrice'].isnull()]
final_X_train = all_data[all_data['Dataset_Train'] == 1].drop(['Dataset_Train', 'Dataset_Test', 'SalePrice'], axis=1)
final_X_test = all_data[all_data['Dataset_Test'] == 1].drop(['Dataset_Train', 'Dataset_Test', 'SalePrice'], axis=1)

9.2 Defining a Cross Validation Strategy¶

We will use the cross_val_score function of Sklearn and calculate the Root-Mean-Squared Error (RMSE) as a measure of accuracy.

In [58]:

n_folds = 10

def rmse_cv(model):
    kf = KFold(n_folds, shuffle=True, random_state=1234).get_n_splits(final_X_train.values)
    rmse= np.sqrt(-cross_val_score(model, final_X_train.values, final_y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

9.3 Lasso Regression Model¶

This model may be senstitive to outliers, so we need to make it more robust. This will be done using the RobustScaler function on pipeline.

In [59]:

lasso = make_pipeline(RobustScaler(), Lasso(alpha = 0.0005, random_state = 1234))

Next, we tested out multiple alpha values and compared their accuracy using the rmse_cv function above. The results below show that using a value of 0.00025 is the optimal value of alpha for a Lasso Regression.

In [60]:

lasso_alpha = [1, 0.5, 0.25, 0.1, 0.05, 0.025, 0.01, 0.005, 0.0025, 0.001, 0.0005, 0.00025, 0.0001]
lasso_rmse = []

for value in lasso_alpha:
    lasso = make_pipeline(RobustScaler(), Lasso(alpha = value, max_iter=3000, random_state = 1234))
    lasso_rmse.append(rmse_cv(lasso).mean())
    
lasso_score_table = pd.DataFrame(lasso_rmse,lasso_alpha,columns=['RMSE'])
display(lasso_score_table.transpose())

plt.semilogx(lasso_alpha, lasso_rmse)
plt.xlabel('alpha')
plt.ylabel('score')
plt.show()

print("\nLasso Score: {:.4f} (alpha = {:.5f})\n".format(min(lasso_score_table['RMSE']), lasso_score_table.idxmin()[0]))

C:\Users\bashk\Anaconda\lib\site-packages\sklearn\linear_model\coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)

	1.0	0.5	0.25	0.1	0.05	0.025	0.01	0.005	0.0025	0.001	0.0005	0.00025	0.0001
RMSE	0.398841	0.398841	0.339897	0.239978	0.189281	0.154991	0.135898	0.130271	0.122606	0.114953	0.11294	0.112421	0.114048

Lasso Score: 0.1124 (alpha = 0.00025)

Using the newly defined alpha value, we can optimize the model and predict the missing values of SalePrice. The predictions are then formatted in a appropriate layout for submission to Kaggle.

In [61]:

lasso = make_pipeline(RobustScaler(), Lasso(alpha = 0.00025, random_state = 1234))
lasso.fit(final_X_train, final_y_train)
lasso_preds = np.expm1(lasso.predict(final_X_test))

sub = pd.DataFrame()
sub['Id'] = test_id
sub['SalePrice'] = lasso_preds
#sub.to_csv('Lasso Submission 2.csv',index=False)

LASSO Regression Score: 0.12679¶

9.4 Ridge Regression Model¶

Identical steps were taken for the ridge regression model as we took for the Lasso Regression Model. In the case of the Ridge model, the optimal alpha value was 6.

In [62]:

ridge_alpha = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
ridge_rmse = []

for value in ridge_alpha:
    ridge = make_pipeline(RobustScaler(), Ridge(alpha = value, random_state = 1234))
    ridge_rmse.append(rmse_cv(ridge).mean())
    
ridge_score_table = pd.DataFrame(ridge_rmse,ridge_alpha,columns=['RMSE'])
display(ridge_score_table.transpose())

plt.semilogx(ridge_alpha, ridge_rmse)
plt.xlabel('alpha')
plt.ylabel('score')
plt.show()

print("\nRidge Score: {:.4f} (alpha = {:.4f})\n".format(min(ridge_score_table['RMSE']), ridge_score_table.idxmin()[0]))

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
RMSE	0.115314	0.113956	0.113409	0.113163	0.113059	0.113033	0.113051	0.113097	0.11316	0.113234	0.113315	0.113401	0.11349	0.113581	0.113672

Ridge Score: 0.1130 (alpha = 6.0000)

In [63]:

ridge = make_pipeline(RobustScaler(), Ridge(alpha = 6, random_state = 1234))
ridge.fit(final_X_train, final_y_train)
ridge_preds = np.expm1(ridge.predict(final_X_test))

sub = pd.DataFrame()
sub['Id'] = test_id
sub['SalePrice'] = ridge_preds
#sub.to_csv('Ridge Submission.csv',index=False)

Ridge Regression Score: 0.12528¶

9.5 XGBoost Model¶

Since there are multiple hyperparameters to tune in the XGBoost model, we will use the GridSearchCV function of Sklearn to determine the optimal values. Next, I used the train_test_split function to generate a validation set and find the RMSE of the models. This method is used in lieu of the rmse_cv function used above for the Lasso and Ridge Regression models.

In [64]:

xg_X_train, xg_X_test, xg_y_train, xg_y_test = train_test_split(final_X_train, final_y_train, test_size=0.33, random_state=1234)

xg_model = XGBRegressor(n_estimators=100, seed = 1234)
param_dict = {'max_depth': [3,4,5],
              'min_child_weight': [2,3,4],
              'learning_rate': [0.05, 0.1,0.15],
              'gamma': [0.0, 0.1, 0.2]
}

start = time()
grid_search = GridSearchCV(xg_model, param_dict)
grid_search.fit(xg_X_train, xg_y_train)
print("GridSearch took %.2f seconds to complete." % (time()-start))
display(grid_search.best_params_)

GridSearch took 177.74 seconds to complete.

{'gamma': 0.0, 'learning_rate': 0.1, 'max_depth': 4, 'min_child_weight': 4}

Now that the hyperparameter have been chosen, we can calculate the validation RMSE of the XGBoost model.

In [65]:

xg_model = XGBRegressor(n_estimators = 1000,
                        learning_rate = 0.1,
                        max_depth = 4,
                        min_child_weight = 4,
                        gamma = 0,
                        seed = 1234)
start = time()
xg_model.fit(xg_X_train, xg_y_train)
xg_preds = xg_model.predict(xg_X_test)
print("Model took %.2f seconds to complete." % (time()-start))
print("RMSE: %.4f" % sqrt(mean_squared_error(xg_y_test, xg_preds)))

Model took 12.10 seconds to complete.
RMSE: 0.1238

Lastly, we predict the SalePrice using the test data. The predictions are then formatted in a appropriate layout for submission to Kaggle.

In [66]:

xg_model = XGBRegressor(n_estimators = 1000,
                        learning_rate = 0.1,
                        max_depth = 4,
                        min_child_weight = 4,
                        gamma = 0,
                        seed = 1234)
xg_model.fit(final_X_train, final_y_train)
xg_preds = np.expm1(xg_model.predict(final_X_test))

In [67]:

sub = pd.DataFrame()
sub['Id'] = test_id
sub['SalePrice'] = xg_preds
#sub.to_csv('XGBoost Submission.csv',index=False)

XGBoost Regression Model: 0.12799¶

9.6 Ensemble Model¶

Since the Lasso, Ridge, XGBoost algorithms so different, averaging the final Saleprice predictions may improve the accuracy. Since the Ridge regression performed the best with regards to the final RMSE (0.125 vs 0.126 and 0.127), I will assign it's weight to be double that of the other two models.

Our final ensemble model performed better than any individual regression model (RMSE = 0.12220).

In [68]:

lasso = make_pipeline(RobustScaler(), Lasso(alpha = 0.00025, random_state = 1234))
lasso.fit(final_X_train, final_y_train)
lasso_preds = np.expm1(lasso.predict(final_X_test))

ridge = make_pipeline(RobustScaler(), Ridge(alpha = 6, random_state = 1234))
ridge.fit(final_X_train, final_y_train)
ridge_preds = np.expm1(ridge.predict(final_X_test))

xg_model = XGBRegressor(n_estimators = 1000,
                        learning_rate = 0.1,
                        max_depth = 4,
                        min_child_weight = 4,
                        gamma = 0,
                        seed = 1234)
xg_model.fit(final_X_train, final_y_train)
xg_preds = np.expm1(xg_model.predict(final_X_test))

weights = [0.5, 0.25, 0.25]

sub = pd.DataFrame()
sub['Id'] = test_id
sub['SalePrice'] = (ridge_preds*weights[0]) + (lasso_preds*weights[1]) + (xg_preds*weights[2])
#sub.to_csv('Ensemble Submission.csv',index=False)