This goal of this project was to predict sales prices and practice feature engineering, RFs, and gradient boosting. The dataset was part of the House Prices Kaggle Competition.
Top submissions on the public leaderboard range between 0.10-0.12
Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
As part of a Kaggle competition dataset, the accuracy of the sales prices was evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price.
I started this competition by focusing on getting a thorough understanding of the dataset. Particular attention was paid to impute the missing values within the dataset. The EDA process is detailed as well as visualized.
In this project, I created a predictive model that has been trained on data collected from homes in Ames, Iowa. Three algorithms were used, and their validation set RMSE and test set RMSE are listed below:
Regression Model | Validation RMSE | Test RMSE |
---|---|---|
Ridge | 0.1130 | 0.12528 |
Lasso | 0.1125 | 0.12679 |
XGBoost | 0.1238 | 0.12799 |
Ensemble | 0.12220 |
The Ridge regression model performed the best as a single model, likely due to the high multicollinearity. However, combining it with the Lasso and XGBoost regression models resulting in a higher prediction accuracy and a lower RMSE (0.12220 vs 0.12528).
The dataset used for this project is the Ames Housing dataset that was compiled by Dean De Cock for use in data science education. It is an alternative to the popular but older Boston Housing dataset.
The Ames Housing dataset is also used in the Advanced Regression Techniques challenge on the Kaggle Website. These competitions is a great way to improve my skills and measure my progress as a data scientist.
Kaggle describes the competition as follows:
Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
Loading Python packages used in the project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
from time import time
from math import sqrt
import statsmodels.api as sm
from statsmodels.formula.api import ols
import scipy.stats as st
from scipy.special import boxcox1p
from sklearn.cluster import KMeans
from sklearn import svm
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.preprocessing import RobustScaler, LabelEncoder, StandardScaler
from sklearn.linear_model import Lasso, Ridge
from sklearn.pipeline import make_pipeline
from xgboost.sklearn import XGBRegressor
%matplotlib inline
Now, we read in the csv's as datarames into Python.
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
In total, there are 81 columns/variables in the train dataset, including the response variable (SalePrice). I am only displaying a subset of the variables, as all of them will be discussed in more detail throughout the notebook.
The train dataset consists of character and integer variables. Many of these variables are ordinal factors, despite being represented as character or integer variables. These will require cleaning and/or feature engineering later.
print("Dimensions of Train Dataset:" + str(train.shape))
print("Dimensions of Test Dataset:" + str(test.shape))
Dimensions of Train Dataset:(1460, 81) Dimensions of Test Dataset:(1459, 80)
train.iloc[:,0:10].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 10 columns): Id 1460 non-null int64 MSSubClass 1460 non-null int64 MSZoning 1460 non-null object LotFrontage 1201 non-null float64 LotArea 1460 non-null int64 Street 1460 non-null object Alley 91 non-null object LotShape 1460 non-null object LandContour 1460 non-null object Utilities 1460 non-null object dtypes: float64(1), int64(3), object(6) memory usage: 114.1+ KB
Next, we are going to define a few variables that will be used in later analyses as well as being required for the submission file.
y_train = train['SalePrice']
test_id = test['Id']
ntrain = train.shape[0]
ntest = test.shape[0]
Lastly, we are going to merge the train and test datasets to explore the data as well as impute any missing values.
all_data = pd.concat((train, test), sort=True).reset_index(drop=True)
all_data['Dataset'] = np.repeat(['Train', 'Test'], [ntrain, ntest], axis=0)
all_data.drop('Id', axis=1,inplace=True)
The probability distribution plot show that the sale prices are right skewed. This is to be expected as few people can afford very expensive houses.
sns.set_style('whitegrid')
sns.distplot(all_data['SalePrice'][~all_data['SalePrice'].isnull()], axlabel="Normal Distribution", fit=st.norm, fit_kws={"color":"red"})
plt.title('Distribution of Sales Price in Dollars')
(mu, sigma) = st.norm.fit(train['SalePrice'])
plt.legend(['Normal Distribution \n ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
loc='best', fancybox=True)
plt.show()
st.probplot(all_data['SalePrice'][~all_data['SalePrice'].isnull()], plot=plt)
plt.show()
Linear models tend to work better with normally distributed data. As such, we need to transform the response variable to make it more normally distributed.
all_data['SalePrice'] = np.log1p(all_data['SalePrice'])
sns.distplot(all_data['SalePrice'][~all_data['SalePrice'].isnull()], axlabel="Normal Distribution", fit=st.norm, fit_kws={"color":"red"})
plt.title('Distribution of Transformed Sales Price in Dollars')
(mu, sigma) = st.norm.fit(train['SalePrice'])
plt.legend(['Normal Distribution \n ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
loc='best', fancybox=True)
plt.show()
st.probplot(all_data['SalePrice'][~all_data['SalePrice'].isnull()], plot=plt)
plt.show()
The skew is highly corrected and the distribution of the log-transformed sale prices appears more normally distributed.
We first need to find which variables contain missing values.
cols_with_missing_values = all_data.isnull().sum().sort_values(ascending=False)
display(pd.DataFrame(cols_with_missing_values[cols_with_missing_values[cols_with_missing_values > 0].index],
columns=["Number of Missing Values"]))
Number of Missing Values | |
---|---|
PoolQC | 2909 |
MiscFeature | 2814 |
Alley | 2721 |
Fence | 2348 |
SalePrice | 1459 |
FireplaceQu | 1420 |
LotFrontage | 486 |
GarageFinish | 159 |
GarageQual | 159 |
GarageYrBlt | 159 |
GarageCond | 159 |
GarageType | 157 |
BsmtCond | 82 |
BsmtExposure | 82 |
BsmtQual | 81 |
BsmtFinType2 | 80 |
BsmtFinType1 | 79 |
MasVnrType | 24 |
MasVnrArea | 23 |
MSZoning | 4 |
Utilities | 2 |
BsmtFullBath | 2 |
BsmtHalfBath | 2 |
Functional | 2 |
Electrical | 1 |
KitchenQual | 1 |
Exterior2nd | 1 |
Exterior1st | 1 |
GarageCars | 1 |
GarageArea | 1 |
TotalBsmtSF | 1 |
BsmtUnfSF | 1 |
BsmtFinSF2 | 1 |
BsmtFinSF1 | 1 |
SaleType | 1 |
cols_with_missing_values = all_data.isnull().sum().sort_values(ascending=False)
cols_with_missing_values = all_data[cols_with_missing_values[cols_with_missing_values > 0].index]
msno.bar(cols_with_missing_values)
plt.show()
all_data['PoolQC'].replace(np.nan, 'None', regex=True, inplace=True)
all_data['MiscFeature'].replace(np.nan, 'None', regex=True, inplace=True)
all_data['Alley'].replace(np.nan, 'None', regex=True, inplace=True)
all_data['Fence'].replace(np.nan, 'None', regex=True, inplace=True)
all_data['FireplaceQu'].replace(np.nan, 'None', regex=True, inplace=True)
garage_categorical = ['GarageType', 'GarageFinish', 'GarageQual', 'GarageCond']
for variable in garage_categorical:
all_data[variable].replace(np.nan, 'None', regex=True, inplace=True)
garage_numeric = ['GarageYrBlt', 'GarageCars', 'GarageArea']
for variable in garage_numeric:
all_data[variable].replace(np.nan, 0, regex=True, inplace=True)
basement_categorical = ['BsmtQual','BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']
for variable in basement_categorical:
all_data[variable].replace(np.nan, 'None', regex=True, inplace=True)
basement_numeric = ['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath']
for variable in basement_numeric:
all_data[variable].replace(np.nan, 0, regex=True, inplace=True)
all_data['MasVnrType'].replace(np.nan, 'None', regex=True, inplace=True)
all_data['MasVnrArea'].replace(np.nan, 0, regex=True, inplace=True)
all_data['Functional'].replace(np.nan, 'Typ', regex=True, inplace=True)
Using a mode imputation, we replace the missing values of a categorical variable with the mode of the non-missing cases of that variable. While it does have the advantage of being fast, it comes at the cost of a reduction in variance within the dataset.
Due to the low number of missing values imputed using this method, the bias introduced on the mean and standard deviation, as well as correlations with other variables are minimal.
First, we define a function "mode_impute_and_plot" to simplify the process of visualizing the categorical variables and replacing the missing variables with the most frequent value.
def mode_impute_and_plot(variable):
print('# of missing values: ' + str(all_data[variable].isna().sum()))
plt.figure(figsize=(8,4))
ax = sns.countplot(all_data[variable])
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()
all_data[variable].replace(np.nan, all_data[variable].mode()[0], regex=True, inplace=True)
Now we can proceed to replace the missing values for the following variables:
MSZoning | Utilities | Electrical | [Exterior1st and Exterior2nd](#Exterior1st and Exterior2nd) | KitchenQual | SaleType |
mode_impute_and_plot('MSZoning')
# of missing values: 4
mode_impute_and_plot('Utilities')
all_data = all_data.drop('Utilities', axis=1)
# of missing values: 2
mode_impute_and_plot('Electrical')
# of missing values: 1
mode_impute_and_plot('Exterior1st')
# of missing values: 1
mode_impute_and_plot('Exterior2nd')
# of missing values: 1
mode_impute_and_plot('KitchenQual')
# of missing values: 1
mode_impute_and_plot('SaleType')
# of missing values: 1
Are there any remaining missing values?
cols_with_missing_values = all_data.isnull().sum().sort_values(ascending=False)
display(pd.DataFrame(cols_with_missing_values[cols_with_missing_values[cols_with_missing_values > 0].index], columns=["Number of Missing Values"]))
Number of Missing Values | |
---|---|
SalePrice | 1459 |
LotFrontage | 486 |
Due to its numeric nature and the large number of missing values, the LotFrontage variable will be imputed separately using an SVM algorithm (See Section 7 LotFrontage Imputation).
In order to simplify and boost the accuracy of the preditive models, we will merge the two conditions into one variable: MixedConditions
The data descriptions states:
If a property does not have one or multiple conditions, then it is classified as normal. However, designation of "normal" are condition 1 or condition 2 is strictly alphabetical.
For example, if a property is in proximity to a feeder street ("Feedr") and no other condition, then the data would appear as follows:
Condition1 | Condition2 |
---|---|
Feedr | Norm |
However, if a property is within 200' of East-West Railroad (RRNe) and no other condition, then the data would appear as follows:
Condition1 | Condition2 |
---|---|
Norm | RRNe |
Once we merge Conditions 1 & 2 into the MixedConditions variable, we will remove them from the analysis.
all_data['MixedConditions'] = all_data['Condition1'] + ' - ' + all_data['Condition2']
all_data.drop(labels=['Condition1', 'Condition2'], axis=1, inplace=True)
The Exterior1st and Exterior2nd features are similar to the Conditions feature we merged and remove above. Properties with multiple types of exterior covering the house are assigned to Exterior1st or Exterior2nd alphabetically.
As such, we will conduct the same process to merge the two columns into a single MixedExterior variable and remove them from the analysis.
all_data['MixedExterior'] = all_data['Exterior1st'] + ' - ' + all_data['Exterior2nd']
all_data.drop(labels=['Exterior1st', 'Exterior2nd'], axis=1, inplace=True)
One of the important factors that people consider when buying a house is the total living space in square feet. Since the total square feet is not explicitly listed, we will add a new variable by adding up the square footage of the basement, first floor, and the second floor.
SF_df = all_data[['TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'SalePrice']]
SF_df = SF_df[~SF_df['SalePrice'].isnull()]
SF_vars = list(SF_df)
del SF_vars[-1]
SF_summary = []
for SF_type in SF_vars:
corr_val = np.corrcoef(SF_df[SF_type], SF_df['SalePrice'])[1][0]
SF_summary.append(corr_val)
pd.DataFrame([SF_summary], columns=SF_vars, index=['SalePrice Correlation'])
TotalBsmtSF | 1stFlrSF | 2ndFlrSF | |
---|---|---|---|
SalePrice Correlation | 0.612134 | 0.596981 | 0.3193 |
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']
plt.figure(figsize=(10,6))
ax = sns.regplot(all_data['TotalSF'], all_data['SalePrice'], line_kws={'color': 'red'})
ax.set(ylabel='ln(SalePrice)')
plt.show()
print("Pearson Correlation Coefficient: %.3f" % (np.corrcoef(all_data['TotalSF'].iloc[:ntrain], train['SalePrice']))[1][0])
Pearson Correlation Coefficient: 0.782
There is a very strong correlation (r = 0.78) between the TotalSF and the SalePrice, with the exception of two outliers. These outliers should be removed to increase the accuracy of our model.
There are 4 bathroom variables. Individually, these variables are not very important. Together, however, these predictors are likely to become a strong one.
A full bath is made up of four parts: a sink, shower, a bathtub, and a toilet. A half-bath, also called a powder room or guest bath, only has a sink and a toilet. As such, half-bathrooms will have half the value of a full bath.
bath_df = all_data[['BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'SalePrice']]
bath_df = bath_df[~bath_df['SalePrice'].isnull()]
bath_vars = list(bath_df)
del bath_vars[-1]
bath_summary = []
for bath_type in bath_vars:
corr_val = np.corrcoef(bath_df[bath_type], bath_df['SalePrice'])[1][0]
bath_summary.append(corr_val)
pd.DataFrame([bath_summary], columns=bath_vars, index=['SalePrice Correlation'])
BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | |
---|---|---|---|---|
SalePrice Correlation | 0.236224 | -0.005149 | 0.594771 | 0.313982 |
all_data['TotalBath'] = all_data['BsmtFullBath'] + (all_data['BsmtHalfBath']*0.5) + all_data['FullBath'] + (all_data['HalfBath']*0.5)
plt.figure(figsize=(10,4))
sns.countplot(all_data['TotalBath'], color='grey')
plt.tight_layout()
plt.show()
plt.figure(figsize=(10,6))
sns.regplot(all_data['TotalBath'].iloc[:ntrain], train['SalePrice'], line_kws={'color': 'red'})
ax.set(ylabel='ln(SalePrice)')
plt.show()
print("Pearson Correlation Coefficient: %.3f" % (np.corrcoef(all_data['TotalBath'].iloc[:ntrain], train['SalePrice']))[1][0])
Pearson Correlation Coefficient: 0.632
We can see a high positive correlation (r = 0.67) between TotalBath and SalePrice. The new variable correlation is much higher than any of the four original bathroom variables.
neighborhood_prices = all_data.iloc[:ntrain].copy()
neighborhood_prices['SalePrice'] = np.exp(all_data['SalePrice'])
neighborhood_prices = neighborhood_prices[['Neighborhood', 'SalePrice']].groupby('Neighborhood').median().sort_values('SalePrice')
plt.figure(figsize=(15,7))
ax = sns.barplot(x= neighborhood_prices.index, y=neighborhood_prices['SalePrice'], color='grey')
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()
In order to bin the neighbourhoods into approriate clusters, we will use K-Mean clustering. K-Means is an unsupervised machine learning algorithm that groups a dataset into a user-specified number (k) of clusters. One potential issue with the algorithm is that it will cluster the data into k clusters, even if k is not the right number of cluster to use.
Therefore, we need to identity the optimal number of k clusters to use. To do so, we will use the Elbow Method.
Briefly, the idea of the elbow method is to run k-means clustering for a range of values and calculate the sum of squared errors (SSE). We then plot a line chart of the SSE for each value of k. Each additional k cluster will result in a lower SSE, but will eventually exhibit significantly diminished return.
The goal is the choose a small value of k with a low SSE, after which the subsequent k values exhibit diminishing returns. If the line plot looks like an arm, then the "elbow" of the arm is the optimal k value.
The plot below indicates that the optimal number of neighbourhood clusters is k = 3.
SS_distances = []
K = range(1,10)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(neighborhood_prices)
SS_distances.append(km.inertia_)
plt.plot(K, SS_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum of squared distances')
plt.title('Elbow Method For Optimal k')
plt.show()
Now let's see how the binned neighborhoods look like.
neighborhood_prices['Cluster'] = KMeans(n_clusters=3).fit(neighborhood_prices).labels_
plt.figure(figsize=(15,7))
ax = sns.barplot(x= neighborhood_prices.index, y=neighborhood_prices['SalePrice'],
hue=neighborhood_prices['Cluster'])
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()
Lastly, we have to create a new variable in the dataset based on the new cluster labels.
neighborhood_dict = dict(zip(neighborhood_prices.index, neighborhood_prices.Cluster))
all_data['Neighborhood_Class'] = all_data['Neighborhood']
all_data['Neighborhood_Class'].replace(neighborhood_dict, inplace = True)
Due to the numeric nature and sheer number of missing values in the LotFrontage column, we will impute the missing values using an SVM Regressor algorithm to predict the missing values.
The first step is to subset the dataset into two groups. The train dataset (train_LotFrontage) contains the complete records while the test dataset (test_LotFrontage) contains the missing values. Next, we examine the structure of the newly created datasets. There are 486 missing values in the full dataset, or roughly 16% of the data.
train_LotFrontage = all_data[~all_data.LotFrontage.isnull()]
test_LotFrontage = all_data[all_data.LotFrontage.isnull()]
print("\n")
print("Dimensions of Train LotFrontage Dataset:" + str(train_LotFrontage.shape))
print("Dimensions of Test LotFrontage Dataset:" + str(test_LotFrontage.shape))
display(pd.DataFrame(all_data['LotFrontage'].describe()).transpose())
Dimensions of Train LotFrontage Dataset:(2433, 81) Dimensions of Test LotFrontage Dataset:(486, 81)
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
LotFrontage | 2433.0 | 69.305795 | 23.344905 | 21.0 | 59.0 | 68.0 | 80.0 | 313.0 |
Now let's examine the distribution of LotFrontages values in the Train LotFrontage dataset through a boxplot and distribution plot. Through these graphs, we can see several interesting observations appear:
In other words, there are outliers on both ends of the LotFrontage value distributions.
fig, ax = plt.subplots(1,2, figsize=(16,4))
sns.boxplot(train_LotFrontage['LotFrontage'], ax=ax[0])
sns.distplot(train_LotFrontage['LotFrontage'], ax=ax[1], fit=st.norm, fit_kws={"color":"red"})
(mu, sigma) = st.norm.fit(train_LotFrontage['LotFrontage'])
plt.legend(['Normal Distribution \n ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
loc='best', fancybox=True)
plt.show()
Before we examine the correlations between LotFrontage and other variables, we should remove the outliers seen above.
To do so, we will use the Interquartile Range (IQR) method, where values outside the Median ± 1.5IQR are considered to be outliers. These outliers are then removed from the LotFrontage datasets.
def outlier_detection(data):
Q1, Q3 = np.percentile(data, [25,75])
IQR = Q3-Q1
lower_cutoff = Q1 - (IQR * 1.5)
upper_cutoff = Q3 + (IQR * 1.5)
outliers = (data > Q3+1.5*IQR) | (data < Q1-1.5*IQR)
return outliers
train_LotFrontage_no_outliers = train_LotFrontage[~outlier_detection(train_LotFrontage.LotFrontage)]
fig, ax = plt.subplots(1,2, figsize=(16,4))
sns.boxplot(train_LotFrontage_no_outliers['LotFrontage'], ax=ax[0])
sns.distplot(train_LotFrontage_no_outliers['LotFrontage'], ax=ax[1], fit=st.norm, fit_kws={"color":"red"})
(mu, sigma) = st.norm.fit(train_LotFrontage_no_outliers['LotFrontage'])
plt.legend(['Normal Distribution \n ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
loc='best', fancybox=True)
plt.show()
Two new values previously not deemed outliers by the first iteration of the IQR range method are now shown as potential outliers in the boxplot. This is caused by having a new smaller IQR value after removing the first batch of outliers. As such, we will keep these values.
Now we can explore the relationship between the LotFrontage target variable and other features. In order to confirm a relationship between these key features and we will conduct an ANOVA (Analysis of Variance) test to determine statistical significance (in this case, p < 0.01).
temp_train = train_LotFrontage_no_outliers.copy()
temp_train.rename(columns={'1stFlrSF':'X1stFlrSF', '2ndFlrSF': 'X2ndFlrSF', '3SsnPorch': 'X3SsnPorch'}, inplace=True)
mod = ols('LotFrontage ~ X1stFlrSF + X2ndFlrSF + X3SsnPorch + Alley + BedroomAbvGr + BldgType + BsmtCond + BsmtExposure + BsmtFinSF1 + BsmtFinSF2 + BsmtFinType1 + BsmtFinType2 + BsmtFullBath + BsmtHalfBath + BsmtQual + BsmtUnfSF + CentralAir + Electrical + EnclosedPorch + ExterCond + ExterQual + Fence + FireplaceQu + Fireplaces + Foundation + FullBath + Functional + GarageArea + GarageCars + GarageCond + GarageFinish + GarageQual + GarageType + GarageYrBlt + GrLivArea + HalfBath + Heating + HeatingQC + HouseStyle + KitchenAbvGr + KitchenQual + LandContour + LandSlope + LotArea + LotConfig + LotShape + LowQualFinSF + MSSubClass + MSZoning + MasVnrArea + MasVnrType + MiscFeature + MiscVal + MoSold + Neighborhood + OpenPorchSF + OverallCond + OverallQual + PavedDrive + PoolArea + PoolQC + RoofMatl + RoofStyle + SaleCondition + SaleType + ScreenPorch + Street + TotRmsAbvGrd + TotalBsmtSF + WoodDeckSF + YearBuilt + YearRemodAdd + YrSold + MixedConditions + MixedExterior + TotalSF + TotalBath + Neighborhood_Class', data=temp_train).fit()
aov_table = sm.stats.anova_lm(mod, typ=3)
display(aov_table[aov_table['PR(>F)'] < 0.01])
sum_sq | df | F | PR(>F) | |
---|---|---|---|---|
Alley | 1363.913911 | 2.0 | 7.503367 | 5.674399e-04 |
BldgType | 20479.667110 | 4.0 | 56.332901 | 5.337216e-45 |
GarageCond | 1600.666924 | 5.0 | 3.522331 | 3.573473e-03 |
GarageFinish | 1607.144665 | 3.0 | 5.894309 | 5.300349e-04 |
GarageQual | 1851.155770 | 5.0 | 4.073542 | 1.108720e-03 |
GarageType | 4398.086327 | 6.0 | 8.065136 | 1.269477e-08 |
HouseStyle | 3641.202949 | 7.0 | 5.723294 | 1.433756e-06 |
LotConfig | 43527.367185 | 4.0 | 119.729624 | 2.398118e-91 |
LotShape | 2826.722888 | 3.0 | 10.367193 | 9.089207e-07 |
Neighborhood | 14742.824462 | 23.0 | 7.052646 | 7.333663e-22 |
RoofMatl | 2519.609152 | 5.0 | 5.544500 | 4.462709e-05 |
MixedConditions | 4309.317372 | 18.0 | 2.634118 | 2.091806e-04 |
MixedExterior | 12661.271705 | 79.0 | 1.763394 | 5.468941e-05 |
GarageCars | 999.666184 | 1.0 | 10.999026 | 9.284905e-04 |
GarageYrBlt | 1008.361079 | 1.0 | 11.094693 | 8.820554e-04 |
LotArea | 5292.799666 | 1.0 | 58.235079 | 3.629421e-14 |
MSSubClass | 997.020109 | 1.0 | 10.969912 | 9.431071e-04 |
TotalSF | 636.362078 | 1.0 | 7.001700 | 8.209306e-03 |
Now let's build a model using the relevant variables selected above. We will be using Support Vector Regressor to build the model.
The first step is to seperate the target variable, LotFrontage, select the relevant variables, and dummifying the categorical variables.
X = train_LotFrontage_no_outliers[aov_table[aov_table['PR(>F)'] < 0.01].index]
y = train_LotFrontage_no_outliers['LotFrontage']
X = pd.get_dummies(X)
transformer = StandardScaler().fit(X)
X = transformer.transform(X)
In order to determine the accuracy of our predictive model, we will create a Validation dataset. This dataset will have known values for the target LotFrontage variable which can we compare our model's prediction against. Using the mean absolute error, we can measure the difference between our predicted LotFrontage values with the true values.
![Train Test Split](Train Test Split.png)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1234)
clf = svm.SVR(kernel='rbf', C=100, gamma=0.001)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
print('Mean Absolute Error: %.3f' % mean_absolute_error(y_test, preds))
Mean Absolute Error: 6.848
These results show that using the SVR model to impute LotFrontage gives an average rror of less than 7 feet.
We will now use the same SVR model to predict the unknown LotFrontage values in the test_Frontage dataset.
model_data = train_LotFrontage_no_outliers.copy()
model_data= model_data.append(test_LotFrontage)
y = model_data['LotFrontage']
model_data = model_data[['MSZoning', 'Alley', 'LotArea', 'LotShape', 'LotConfig', 'Neighborhood', 'MixedConditions', 'GarageType', 'GarageCars', 'GarageArea']]
model_data = pd.get_dummies(model_data)
model_X_train = model_data[~y.isnull()]
model_X_test = model_data[y.isnull()]
model_y_train = y[~y.isnull()]
transformer = StandardScaler().fit(model_X_train)
model_X_train = transformer.transform(model_X_train)
model_X_test = transformer.transform(model_X_test)
clf = svm.SVR(kernel='rbf', C=100, gamma=0.001)
clf.fit(model_X_train, model_y_train)
LotFrontage_preds = clf.predict(model_X_test)
Now that we have the newly predicted LotFrontage values, we can examine the distribution of predicted values relative to the known LotFrontage values in the training dataset. Both distributions have a mean around 70 feet with similar tail lengths on either end.
sns.distplot(model_y_train)
sns.distplot(LotFrontage_preds)
<matplotlib.axes._subplots.AxesSubplot at 0x1550eb4dba8>
Lastly, we need to add the predicted values back into the original dataset.
all_data.LotFrontage[all_data.LotFrontage.isnull()] = LotFrontage_preds
C:\Users\bashk\Anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy """Entry point for launching an IPython kernel.
Although we could be more in depth in our outliet detection and removal, we are only going to remove the two outliers found in the TotalSF variable. These properties exhibited large total square feet values with low SalePrice.
all_data.drop(all_data['TotalSF'].iloc[:ntrain].nlargest().index[:2], axis=0, inplace=True)
ntrain = train.shape[0]-2
We can see a large correlation between many variables, suggesting a high level of multicollinearity. There are two options to consider:
corr = all_data.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(15, 15))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap,
center=0, square=True, linewidths = .5)
plt.show()
Many of the ordinal variables are presented as numeric values. Therefore we need to label encode these numeric characters into strings.
num_to_str_columns = ['MSSubClass', 'OverallQual', 'OverallCond', 'MoSold', 'YrSold']
for col in num_to_str_columns:
all_data[col] = all_data[col].astype('str')
cat_cols = ['OverallQual', 'OverallCond', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
'BsmtFinType2', 'BsmtFinSF2', 'HeatingQC', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Fireplaces', 'FireplaceQu', 'GarageFinish', 'GarageCars',
'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'MoSold', 'YrSold']
for col in cat_cols:
label = LabelEncoder()
label.fit(list(all_data[col].values))
all_data[col] = label.transform(list(all_data[col].values))
print('Shape all_data: {}'.format(all_data.shape))
all_data.head()
Shape all_data: (2917, 81)
1stFlrSF | 2ndFlrSF | 3SsnPorch | Alley | BedroomAbvGr | BldgType | BsmtCond | BsmtExposure | BsmtFinSF1 | BsmtFinSF2 | ... | WoodDeckSF | YearBuilt | YearRemodAdd | YrSold | Dataset | MixedConditions | MixedExterior | TotalSF | TotalBath | Neighborhood_Class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 856 | 854 | 0 | None | 3 | 1Fam | 4 | 3 | 706.0 | 0 | ... | 0 | 110 | 53 | 2 | Train | Norm - Norm | VinylSd - VinylSd | 2566.0 | 3.5 | 0 |
1 | 1262 | 0 | 0 | None | 3 | 1Fam | 4 | 1 | 978.0 | 0 | ... | 298 | 83 | 26 | 1 | Train | Feedr - Norm | MetalSd - MetalSd | 2524.0 | 2.5 | 0 |
2 | 920 | 866 | 0 | None | 3 | 1Fam | 4 | 2 | 486.0 | 0 | ... | 0 | 108 | 52 | 2 | Train | Norm - Norm | VinylSd - VinylSd | 2706.0 | 3.5 | 0 |
3 | 961 | 756 | 0 | None | 3 | 1Fam | 1 | 3 | 216.0 | 0 | ... | 0 | 25 | 20 | 0 | Train | Norm - Norm | Wd Sdng - Wd Shng | 2473.0 | 2.0 | 0 |
4 | 1145 | 1053 | 0 | None | 4 | 1Fam | 4 | 0 | 655.0 | 0 | ... | 192 | 107 | 50 | 2 | Train | Norm - Norm | VinylSd - VinylSd | 3343.0 | 3.5 | 2 |
5 rows × 81 columns
Skewness is a measure of asymmetry of a distribution, and can be used to define the extent to which the distribution differs from a normal distribution. Therefore, a normal distribution will have a skewness of 0. As a rule of thumb, if skewness is less than -1 or greater than 1, the distribution is highly skewed.
In order to account for skewness, we will transform the (highly) skewed data into normality using a Log Transformation. We define highly skewed data as variables with a skewness greater than 0.85. This method is similar to the approach used to normalize the SalePrice Response Variable, except we will use log+1 to avoid division by zero issues.
numeric_feats = ['LotFrontage', 'LotArea', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
'LowQualFinSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch',
'3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'TotalSF']
skewed_feats = all_data[numeric_feats].apply(lambda x: st.skew(x.dropna())).sort_values(ascending=False)
skewness = pd.DataFrame({'Skew Before Transformation' :skewed_feats})
skewness = skewness[abs(skewness) > 1].dropna(axis=0)
skewed_features = skewness.index
for feat in skewed_features:
all_data[feat] = np.log1p(all_data[feat]+1)
skewed_feats = all_data[skewed_features].apply(lambda x: st.skew(x.dropna())).sort_values(ascending=False)
skewness['Skew After Transformation'] = skewed_feats
skewness
Skew Before Transformation | Skew After Transformation | |
---|---|---|
MiscVal | 21.939672 | 5.255296 |
PoolArea | 17.688664 | 15.655054 |
LotArea | 13.109495 | -0.532117 |
LowQualFinSF | 12.084539 | 8.605557 |
3SsnPorch | 11.372080 | 8.853408 |
EnclosedPorch | 4.002344 | 1.983518 |
ScreenPorch | 3.945101 | 2.954087 |
OpenPorchSF | 2.529358 | 0.019004 |
WoodDeckSF | 1.844792 | 0.179665 |
1stFlrSF | 1.257286 | 0.031286 |
GrLivArea | 1.068750 | -0.021271 |
LotFrontage | 1.067631 | -0.889232 |
TotalSF | 1.009157 | -0.428615 |
The last step needed to prepare the data is to make sure that all categorical predictor variables are converted into a form that is usable by machine learning algorithms. This process is known as 'one-hot encoding' the categorical variables.
The process involved all non-ordinal factors receiving their own separate column with 1's and 0's, and is required by most ML algorithms.
all_data = pd.get_dummies(all_data)
print(all_data.shape)
all_data.head(3)
(2917, 317)
1stFlrSF | 2ndFlrSF | 3SsnPorch | BedroomAbvGr | BsmtCond | BsmtExposure | BsmtFinSF1 | BsmtFinSF2 | BsmtFinType1 | BsmtFinType2 | ... | MixedExterior_Wd Sdng - Stone | MixedExterior_Wd Sdng - Stucco | MixedExterior_Wd Sdng - VinylSd | MixedExterior_Wd Sdng - Wd Sdng | MixedExterior_Wd Sdng - Wd Shng | MixedExterior_WdShing - HdBoard | MixedExterior_WdShing - Plywood | MixedExterior_WdShing - Stucco | MixedExterior_WdShing - Wd Sdng | MixedExterior_WdShing - Wd Shng | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6.754604 | 854 | 0.693147 | 3 | 4 | 3 | 706.0 | 0 | 2 | 6 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 7.142037 | 0 | 0.693147 | 3 | 4 | 1 | 978.0 | 0 | 0 | 6 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 6.826545 | 866 | 0.693147 | 3 | 4 | 2 | 486.0 | 0 | 2 | 6 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 rows × 317 columns
Now that the data is correctly processed, we are ready to begin building our predictive models.
final_y_train = all_data['SalePrice'][~all_data['SalePrice'].isnull()]
final_X_train = all_data[all_data['Dataset_Train'] == 1].drop(['Dataset_Train', 'Dataset_Test', 'SalePrice'], axis=1)
final_X_test = all_data[all_data['Dataset_Test'] == 1].drop(['Dataset_Train', 'Dataset_Test', 'SalePrice'], axis=1)
We will use the cross_val_score function of Sklearn and calculate the Root-Mean-Squared Error (RMSE) as a measure of accuracy.
n_folds = 10
def rmse_cv(model):
kf = KFold(n_folds, shuffle=True, random_state=1234).get_n_splits(final_X_train.values)
rmse= np.sqrt(-cross_val_score(model, final_X_train.values, final_y_train, scoring="neg_mean_squared_error", cv = kf))
return(rmse)
This model may be senstitive to outliers, so we need to make it more robust. This will be done using the RobustScaler function on pipeline.
lasso = make_pipeline(RobustScaler(), Lasso(alpha = 0.0005, random_state = 1234))
Next, we tested out multiple alpha values and compared their accuracy using the rmse_cv function above. The results below show that using a value of 0.00025 is the optimal value of alpha for a Lasso Regression.
lasso_alpha = [1, 0.5, 0.25, 0.1, 0.05, 0.025, 0.01, 0.005, 0.0025, 0.001, 0.0005, 0.00025, 0.0001]
lasso_rmse = []
for value in lasso_alpha:
lasso = make_pipeline(RobustScaler(), Lasso(alpha = value, max_iter=3000, random_state = 1234))
lasso_rmse.append(rmse_cv(lasso).mean())
lasso_score_table = pd.DataFrame(lasso_rmse,lasso_alpha,columns=['RMSE'])
display(lasso_score_table.transpose())
plt.semilogx(lasso_alpha, lasso_rmse)
plt.xlabel('alpha')
plt.ylabel('score')
plt.show()
print("\nLasso Score: {:.4f} (alpha = {:.5f})\n".format(min(lasso_score_table['RMSE']), lasso_score_table.idxmin()[0]))
C:\Users\bashk\Anaconda\lib\site-packages\sklearn\linear_model\coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems. ConvergenceWarning)
1.0 | 0.5 | 0.25 | 0.1 | 0.05 | 0.025 | 0.01 | 0.005 | 0.0025 | 0.001 | 0.0005 | 0.00025 | 0.0001 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RMSE | 0.398841 | 0.398841 | 0.339897 | 0.239978 | 0.189281 | 0.154991 | 0.135898 | 0.130271 | 0.122606 | 0.114953 | 0.11294 | 0.112421 | 0.114048 |
Lasso Score: 0.1124 (alpha = 0.00025)
Using the newly defined alpha value, we can optimize the model and predict the missing values of SalePrice. The predictions are then formatted in a appropriate layout for submission to Kaggle.
lasso = make_pipeline(RobustScaler(), Lasso(alpha = 0.00025, random_state = 1234))
lasso.fit(final_X_train, final_y_train)
lasso_preds = np.expm1(lasso.predict(final_X_test))
sub = pd.DataFrame()
sub['Id'] = test_id
sub['SalePrice'] = lasso_preds
#sub.to_csv('Lasso Submission 2.csv',index=False)
Identical steps were taken for the ridge regression model as we took for the Lasso Regression Model. In the case of the Ridge model, the optimal alpha value was 6.
ridge_alpha = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
ridge_rmse = []
for value in ridge_alpha:
ridge = make_pipeline(RobustScaler(), Ridge(alpha = value, random_state = 1234))
ridge_rmse.append(rmse_cv(ridge).mean())
ridge_score_table = pd.DataFrame(ridge_rmse,ridge_alpha,columns=['RMSE'])
display(ridge_score_table.transpose())
plt.semilogx(ridge_alpha, ridge_rmse)
plt.xlabel('alpha')
plt.ylabel('score')
plt.show()
print("\nRidge Score: {:.4f} (alpha = {:.4f})\n".format(min(ridge_score_table['RMSE']), ridge_score_table.idxmin()[0]))
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RMSE | 0.115314 | 0.113956 | 0.113409 | 0.113163 | 0.113059 | 0.113033 | 0.113051 | 0.113097 | 0.11316 | 0.113234 | 0.113315 | 0.113401 | 0.11349 | 0.113581 | 0.113672 |
Ridge Score: 0.1130 (alpha = 6.0000)
ridge = make_pipeline(RobustScaler(), Ridge(alpha = 6, random_state = 1234))
ridge.fit(final_X_train, final_y_train)
ridge_preds = np.expm1(ridge.predict(final_X_test))
sub = pd.DataFrame()
sub['Id'] = test_id
sub['SalePrice'] = ridge_preds
#sub.to_csv('Ridge Submission.csv',index=False)
Since there are multiple hyperparameters to tune in the XGBoost model, we will use the GridSearchCV function of Sklearn to determine the optimal values. Next, I used the train_test_split function to generate a validation set and find the RMSE of the models. This method is used in lieu of the rmse_cv function used above for the Lasso and Ridge Regression models.
xg_X_train, xg_X_test, xg_y_train, xg_y_test = train_test_split(final_X_train, final_y_train, test_size=0.33, random_state=1234)
xg_model = XGBRegressor(n_estimators=100, seed = 1234)
param_dict = {'max_depth': [3,4,5],
'min_child_weight': [2,3,4],
'learning_rate': [0.05, 0.1,0.15],
'gamma': [0.0, 0.1, 0.2]
}
start = time()
grid_search = GridSearchCV(xg_model, param_dict)
grid_search.fit(xg_X_train, xg_y_train)
print("GridSearch took %.2f seconds to complete." % (time()-start))
display(grid_search.best_params_)
GridSearch took 177.74 seconds to complete.
{'gamma': 0.0, 'learning_rate': 0.1, 'max_depth': 4, 'min_child_weight': 4}
Now that the hyperparameter have been chosen, we can calculate the validation RMSE of the XGBoost model.
xg_model = XGBRegressor(n_estimators = 1000,
learning_rate = 0.1,
max_depth = 4,
min_child_weight = 4,
gamma = 0,
seed = 1234)
start = time()
xg_model.fit(xg_X_train, xg_y_train)
xg_preds = xg_model.predict(xg_X_test)
print("Model took %.2f seconds to complete." % (time()-start))
print("RMSE: %.4f" % sqrt(mean_squared_error(xg_y_test, xg_preds)))
Model took 12.10 seconds to complete. RMSE: 0.1238
Lastly, we predict the SalePrice using the test data. The predictions are then formatted in a appropriate layout for submission to Kaggle.
xg_model = XGBRegressor(n_estimators = 1000,
learning_rate = 0.1,
max_depth = 4,
min_child_weight = 4,
gamma = 0,
seed = 1234)
xg_model.fit(final_X_train, final_y_train)
xg_preds = np.expm1(xg_model.predict(final_X_test))
sub = pd.DataFrame()
sub['Id'] = test_id
sub['SalePrice'] = xg_preds
#sub.to_csv('XGBoost Submission.csv',index=False)
Since the Lasso, Ridge, XGBoost algorithms so different, averaging the final Saleprice predictions may improve the accuracy. Since the Ridge regression performed the best with regards to the final RMSE (0.125 vs 0.126 and 0.127), I will assign it's weight to be double that of the other two models.
Our final ensemble model performed better than any individual regression model (RMSE = 0.12220).
lasso = make_pipeline(RobustScaler(), Lasso(alpha = 0.00025, random_state = 1234))
lasso.fit(final_X_train, final_y_train)
lasso_preds = np.expm1(lasso.predict(final_X_test))
ridge = make_pipeline(RobustScaler(), Ridge(alpha = 6, random_state = 1234))
ridge.fit(final_X_train, final_y_train)
ridge_preds = np.expm1(ridge.predict(final_X_test))
xg_model = XGBRegressor(n_estimators = 1000,
learning_rate = 0.1,
max_depth = 4,
min_child_weight = 4,
gamma = 0,
seed = 1234)
xg_model.fit(final_X_train, final_y_train)
xg_preds = np.expm1(xg_model.predict(final_X_test))
weights = [0.5, 0.25, 0.25]
sub = pd.DataFrame()
sub['Id'] = test_id
sub['SalePrice'] = (ridge_preds*weights[0]) + (lasso_preds*weights[1]) + (xg_preds*weights[2])
#sub.to_csv('Ensemble Submission.csv',index=False)