This goal of this project was to predict sales prices and practice feature engineering, RFs, and gradient boosting. The dataset was part of the House Prices Kaggle Competition.

```
Top submissions on the public leaderboard range between 0.10-0.12
```

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

As part of a Kaggle competition dataset, the accuracy of the sales prices was evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price.

I started this competition by focusing on getting a thorough understanding of the dataset. Particular attention was paid to impute the missing values within the dataset. The EDA process is detailed as well as visualized.

In this project, I created a predictive model that has been trained on data collected from homes in Ames, Iowa. Three algorithms were used, and their validation set RMSE and test set RMSE are listed below:

Regression Model | Validation RMSE | Test RMSE |
---|---|---|

Ridge | 0.1130 | 0.12528 |

Lasso | 0.1125 | 0.12679 |

XGBoost | 0.1238 | 0.12799 |

Ensemble |
0.12220 |

The Ridge regression model performed the best as a single model, likely due to the high multicollinearity. However, combining it with the Lasso and XGBoost regression models resulting in a higher prediction accuracy and a lower RMSE (*0.12220* vs *0.12528*).

The dataset used for this project is the Ames Housing dataset that was compiled by *Dean De Cock* for use in data science education. It is an alternative to the popular but older Boston Housing dataset.

The Ames Housing dataset is also used in the Advanced Regression Techniques challenge on the Kaggle Website. These competitions is a great way to improve my skills and measure my progress as a data scientist.

Kaggle describes the competition as follows:

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

Loading Python packages used in the project

In [1]:

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
from time import time
from math import sqrt
import statsmodels.api as sm
from statsmodels.formula.api import ols
import scipy.stats as st
from scipy.special import boxcox1p
from sklearn.cluster import KMeans
from sklearn import svm
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.preprocessing import RobustScaler, LabelEncoder, StandardScaler
from sklearn.linear_model import Lasso, Ridge
from sklearn.pipeline import make_pipeline
from xgboost.sklearn import XGBRegressor
%matplotlib inline
```

Now, we read in the csv's as datarames into Python.

In [2]:

```
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
```

In total, there are 81 columns/variables in the train dataset, including the response variable (SalePrice). I am only displaying a subset of the variables, as all of them will be discussed in more detail throughout the notebook.

The train dataset consists of character and integer variables. Many of these variables are ordinal factors, despite being represented as character or integer variables. These will require cleaning and/or feature engineering later.

In [3]:

```
print("Dimensions of Train Dataset:" + str(train.shape))
print("Dimensions of Test Dataset:" + str(test.shape))
```

In [4]:

```
train.iloc[:,0:10].info()
```

Next, we are going to define a few variables that will be used in later analyses as well as being required for the submission file.

In [5]:

```
y_train = train['SalePrice']
test_id = test['Id']
ntrain = train.shape[0]
ntest = test.shape[0]
```

Lastly, we are going to merge the train and test datasets to explore the data as well as impute any missing values.

In [6]:

```
all_data = pd.concat((train, test), sort=True).reset_index(drop=True)
all_data['Dataset'] = np.repeat(['Train', 'Test'], [ntrain, ntest], axis=0)
all_data.drop('Id', axis=1,inplace=True)
```

The probability distribution plot show that the sale prices are right skewed. This is to be expected as few people can afford very expensive houses.

In [7]:

```
sns.set_style('whitegrid')
sns.distplot(all_data['SalePrice'][~all_data['SalePrice'].isnull()], axlabel="Normal Distribution", fit=st.norm, fit_kws={"color":"red"})
plt.title('Distribution of Sales Price in Dollars')
(mu, sigma) = st.norm.fit(train['SalePrice'])
plt.legend(['Normal Distribution \n ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
loc='best', fancybox=True)
plt.show()
st.probplot(all_data['SalePrice'][~all_data['SalePrice'].isnull()], plot=plt)
plt.show()
```

Linear models tend to work better with normally distributed data. As such, we need to transform the response variable to make it more normally distributed.

In [8]:

```
all_data['SalePrice'] = np.log1p(all_data['SalePrice'])
sns.distplot(all_data['SalePrice'][~all_data['SalePrice'].isnull()], axlabel="Normal Distribution", fit=st.norm, fit_kws={"color":"red"})
plt.title('Distribution of Transformed Sales Price in Dollars')
(mu, sigma) = st.norm.fit(train['SalePrice'])
plt.legend(['Normal Distribution \n ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
loc='best', fancybox=True)
plt.show()
st.probplot(all_data['SalePrice'][~all_data['SalePrice'].isnull()], plot=plt)
plt.show()
```

The skew is highly corrected and the distribution of the log-transformed sale prices appears more normally distributed.

We first need to find which variables contain missing values.

In [9]:

```
cols_with_missing_values = all_data.isnull().sum().sort_values(ascending=False)
display(pd.DataFrame(cols_with_missing_values[cols_with_missing_values[cols_with_missing_values > 0].index],
columns=["Number of Missing Values"]))
```

In [10]:

```
cols_with_missing_values = all_data.isnull().sum().sort_values(ascending=False)
cols_with_missing_values = all_data[cols_with_missing_values[cols_with_missing_values > 0].index]
msno.bar(cols_with_missing_values)
plt.show()
```

**PoolQC**: data description of the variables states that NA represents "no pool". This makes sense given the vast number of missing values (>99%) and that a majority of houses do not have a pool.

In [11]:

```
all_data['PoolQC'].replace(np.nan, 'None', regex=True, inplace=True)
```

**MiscFeature**: data description of the variables states that NA represents "no miscellaneous feature".

In [12]:

```
all_data['MiscFeature'].replace(np.nan, 'None', regex=True, inplace=True)
```

**Alley**: data description of the variables states that NA represents "no alley access".

In [13]:

```
all_data['Alley'].replace(np.nan, 'None', regex=True, inplace=True)
```

**Fence**: data description of the variables states that NA represents "no fence".

In [14]:

```
all_data['Fence'].replace(np.nan, 'None', regex=True, inplace=True)
```

**FireplaceQU**: data description of the variables states that NA represents "no fireplace".

In [15]:

```
all_data['FireplaceQu'].replace(np.nan, 'None', regex=True, inplace=True)
```

**GarageType, GarageFinish, GarageQual, and GarageCond**: Missing values are likely due to lack of a garage. Replacing missing data with None.

In [16]:

```
garage_categorical = ['GarageType', 'GarageFinish', 'GarageQual', 'GarageCond']
for variable in garage_categorical:
all_data[variable].replace(np.nan, 'None', regex=True, inplace=True)
```

**GarageYrBlt, GarageCars, and GarageArea**: Missing values are likely due to lack of a basement. Replacing missing data with 0 (since no garage means no cars in a garage).

In [17]:

```
garage_numeric = ['GarageYrBlt', 'GarageCars', 'GarageArea']
for variable in garage_numeric:
all_data[variable].replace(np.nan, 0, regex=True, inplace=True)
```

**BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, and BsmtFinType2**: Missing values are likely due to lack of a basement. Replacing missing data with None.

In [18]:

```
basement_categorical = ['BsmtQual','BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']
for variable in basement_categorical:
all_data[variable].replace(np.nan, 'None', regex=True, inplace=True)
```

**BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, BsmtFullBath, and BsmtHalfBath**: Missing values are likely due to lack of a basement. Replacing missing data with 0.

In [19]:

```
basement_numeric = ['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath']
for variable in basement_numeric:
all_data[variable].replace(np.nan, 0, regex=True, inplace=True)
```

**MasVnrType and MasVnrArea**: Missing values are likely due to lack of masonry vaneer. We will replace the missing values with None for the type and 0 for the area.

In [20]:

```
all_data['MasVnrType'].replace(np.nan, 'None', regex=True, inplace=True)
all_data['MasVnrArea'].replace(np.nan, 0, regex=True, inplace=True)
```

**Functional**: data description of the variables states that NA represents "typical".

In [21]:

```
all_data['Functional'].replace(np.nan, 'Typ', regex=True, inplace=True)
```

Using a mode imputation, we replace the missing values of a categorical variable with the mode of the non-missing cases of that variable. While it does have the advantage of being fast, it comes at the cost of a reduction in variance within the dataset.

Due to the low number of missing values imputed using this method, the bias introduced on the mean and standard deviation, as well as correlations with other variables are minimal.

First, we define a function "**mode_impute_and_plot**" to simplify the process of visualizing the categorical variables and replacing the missing variables with the most frequent value.

In [22]:

```
def mode_impute_and_plot(variable):
print('# of missing values: ' + str(all_data[variable].isna().sum()))
plt.figure(figsize=(8,4))
ax = sns.countplot(all_data[variable])
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()
all_data[variable].replace(np.nan, all_data[variable].mode()[0], regex=True, inplace=True)
```

Now we can proceed to replace the missing values for the following variables:

MSZoning | Utilities | Electrical | Exterior1st and Exterior2nd | KitchenQual | SaleType |

**MSZoning**: "RL" is by far the most common value. Missing values will be replace with "RL".

In [23]:

```
mode_impute_and_plot('MSZoning')
```

**Utilities**: With the exception of one "NoSeWa" value, all records for this variable are "AllPub". Since "NoSeWa" is only in the training set,**this feature will not help in predictive modelling**. As such, we can safely remove it.

In [24]:

```
mode_impute_and_plot('Utilities')
all_data = all_data.drop('Utilities', axis=1)
```

**Electrical**: Only one missing value, replace it with "SBrkr", the most common value.

In [25]:

```
mode_impute_and_plot('Electrical')
```

**Exterior1st and Exterior2nd**: There is only one missing value for both Exterior 1 & 2. Missing values will be replaced by the most common value.

In [26]:

```
mode_impute_and_plot('Exterior1st')
```

In [27]:

```
mode_impute_and_plot('Exterior2nd')
```

**KitchenQual**: Only one missing value, replace it with "TA", the most common value.

In [28]:

```
mode_impute_and_plot('KitchenQual')
```

**SaleType**: Replace missing values with "WD", the most common value.

In [29]:

```
mode_impute_and_plot('SaleType')
```

Are there any remaining missing values?

In [30]:

```
cols_with_missing_values = all_data.isnull().sum().sort_values(ascending=False)
display(pd.DataFrame(cols_with_missing_values[cols_with_missing_values[cols_with_missing_values > 0].index], columns=["Number of Missing Values"]))
```

Due to its numeric nature and the large number of missing values, the LotFrontage variable will be imputed separately using an SVM algorithm (See Section 7 LotFrontage Imputation).

The remaining variables are all complete! Now to move on to feature engineering.

In order to simplify and boost the accuracy of the preditive models, we will merge the two conditions into one variable: **MixedConditions**

The data descriptions states:

**Condition1**represents proximity to various conditions.**Condition2**represents proximity to various conditions (if more than one is present).

If a property does not have one or multiple conditions, then it is classified as normal. However, designation of "normal" are condition 1 or condition 2 is strictly alphabetical.

For example, if a property is in proximity to a feeder street ("Feedr") and no other condition, then the data would appear as follows:

Condition1 | Condition2 |
---|---|

Feedr | Norm |

However, if a property is within 200' of East-West Railroad (RRNe) and no other condition, then the data would appear as follows:

Condition1 | Condition2 |
---|---|

Norm | RRNe |

Once we merge Conditions 1 & 2 into the **MixedConditions** variable, we will remove them from the analysis.

In [31]:

```
all_data['MixedConditions'] = all_data['Condition1'] + ' - ' + all_data['Condition2']
all_data.drop(labels=['Condition1', 'Condition2'], axis=1, inplace=True)
```

The Exterior1st and Exterior2nd features are similar to the Conditions feature we merged and remove above. Properties with multiple types of exterior covering the house are assigned to Exterior1st or Exterior2nd alphabetically.

As such, we will conduct the same process to merge the two columns into a single **MixedExterior** variable and remove them from the analysis.

In [32]:

```
all_data['MixedExterior'] = all_data['Exterior1st'] + ' - ' + all_data['Exterior2nd']
all_data.drop(labels=['Exterior1st', 'Exterior2nd'], axis=1, inplace=True)
```

One of the important factors that people consider when buying a house is the total living space in square feet. Since the total square feet is not explicitly listed, we will add a new variable by adding up the square footage of the basement, first floor, and the second floor.

In [33]:

```
SF_df = all_data[['TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'SalePrice']]
SF_df = SF_df[~SF_df['SalePrice'].isnull()]
SF_vars = list(SF_df)
del SF_vars[-1]
SF_summary = []
for SF_type in SF_vars:
corr_val = np.corrcoef(SF_df[SF_type], SF_df['SalePrice'])[1][0]
SF_summary.append(corr_val)
pd.DataFrame([SF_summary], columns=SF_vars, index=['SalePrice Correlation'])
```

Out[33]:

In [34]:

```
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']
plt.figure(figsize=(10,6))
ax = sns.regplot(all_data['TotalSF'], all_data['SalePrice'], line_kws={'color': 'red'})
ax.set(ylabel='ln(SalePrice)')
plt.show()
print("Pearson Correlation Coefficient: %.3f" % (np.corrcoef(all_data['TotalSF'].iloc[:ntrain], train['SalePrice']))[1][0])
```

There is a very strong correlation (r = 0.78) between the TotalSF and the SalePrice, with the exception of two outliers. These outliers should be removed to increase the accuracy of our model.

There are 4 bathroom variables. Individually, these variables are not very important. Together, however, these predictors are likely to become a strong one.

A full bath is made up of four parts: a sink, shower, a bathtub, and a toilet. A half-bath, also called a powder room or guest bath, only has a sink and a toilet. As such, half-bathrooms will have half the value of a full bath.

In [35]:

```
bath_df = all_data[['BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'SalePrice']]
bath_df = bath_df[~bath_df['SalePrice'].isnull()]
bath_vars = list(bath_df)
del bath_vars[-1]
bath_summary = []
for bath_type in bath_vars:
corr_val = np.corrcoef(bath_df[bath_type], bath_df['SalePrice'])[1][0]
bath_summary.append(corr_val)
pd.DataFrame([bath_summary], columns=bath_vars, index=['SalePrice Correlation'])
```

Out[35]:

In [36]:

```
all_data['TotalBath'] = all_data['BsmtFullBath'] + (all_data['BsmtHalfBath']*0.5) + all_data['FullBath'] + (all_data['HalfBath']*0.5)
plt.figure(figsize=(10,4))
sns.countplot(all_data['TotalBath'], color='grey')
plt.tight_layout()
plt.show()
plt.figure(figsize=(10,6))
sns.regplot(all_data['TotalBath'].iloc[:ntrain], train['SalePrice'], line_kws={'color': 'red'})
ax.set(ylabel='ln(SalePrice)')
plt.show()
print("Pearson Correlation Coefficient: %.3f" % (np.corrcoef(all_data['TotalBath'].iloc[:ntrain], train['SalePrice']))[1][0])
```

We can see a high positive correlation (r = 0.67) between TotalBath and SalePrice. The new variable correlation is much higher than any of the four original bathroom variables.

In [37]:

```
neighborhood_prices = all_data.iloc[:ntrain].copy()
neighborhood_prices['SalePrice'] = np.exp(all_data['SalePrice'])
neighborhood_prices = neighborhood_prices[['Neighborhood', 'SalePrice']].groupby('Neighborhood').median().sort_values('SalePrice')
plt.figure(figsize=(15,7))
ax = sns.barplot(x= neighborhood_prices.index, y=neighborhood_prices['SalePrice'], color='grey')
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()
```

In order to bin the neighbourhoods into approriate clusters, we will use K-Mean clustering. **K-Means** is an unsupervised machine learning algorithm that groups a dataset into a user-specified number (*k*) of clusters. One potential issue with the algorithm is that it will cluster the data into *k* clusters, even if *k* is not the right number of cluster to use.

Therefore, we need to identity the optimal number of k clusters to use. To do so, we will use the *Elbow Method*.

Briefly, the idea of the elbow method is to run k-means clustering for a range of values and calculate the sum of squared errors (SSE). We then plot a line chart of the SSE for each value of *k*. Each additional *k* cluster will result in a lower SSE, but will eventually exhibit significantly diminished return.

The goal is the choose a small value of k with a low SSE, after which the subsequent k values exhibit diminishing returns. If the line plot looks like an arm, then the "elbow" of the arm is the optimal *k* value.

The plot below indicates that the optimal number of neighbourhood clusters is *k* = 3.

In [38]:

```
SS_distances = []
K = range(1,10)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(neighborhood_prices)
SS_distances.append(km.inertia_)
plt.plot(K, SS_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum of squared distances')
plt.title('Elbow Method For Optimal k')
plt.show()
```

Now let's see how the binned neighborhoods look like.

In [39]:

```
neighborhood_prices['Cluster'] = KMeans(n_clusters=3).fit(neighborhood_prices).labels_
plt.figure(figsize=(15,7))
ax = sns.barplot(x= neighborhood_prices.index, y=neighborhood_prices['SalePrice'],
hue=neighborhood_prices['Cluster'])
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()
```

Lastly, we have to create a new variable in the dataset based on the new cluster labels.

In [40]:

```
neighborhood_dict = dict(zip(neighborhood_prices.index, neighborhood_prices.Cluster))
all_data['Neighborhood_Class'] = all_data['Neighborhood']
all_data['Neighborhood_Class'].replace(neighborhood_dict, inplace = True)
```

Due to the numeric nature and sheer number of missing values in the *LotFrontage* column, we will impute the missing values using an SVM Regressor algorithm to predict the missing values.

The first step is to subset the dataset into two groups. The *train* dataset (train_LotFrontage) contains the complete records while the *test* dataset (test_LotFrontage) contains the missing values. Next, we examine the structure of the newly created datasets. There are 486 missing values in the full dataset, or roughly 16% of the data.

In [41]:

```
train_LotFrontage = all_data[~all_data.LotFrontage.isnull()]
test_LotFrontage = all_data[all_data.LotFrontage.isnull()]
print("\n")
print("Dimensions of Train LotFrontage Dataset:" + str(train_LotFrontage.shape))
print("Dimensions of Test LotFrontage Dataset:" + str(test_LotFrontage.shape))
display(pd.DataFrame(all_data['LotFrontage'].describe()).transpose())
```

Now let's examine the distribution of LotFrontages values in the Train LotFrontage dataset through a boxplot and distribution plot. Through these graphs, we can see several interesting observations appear:

- There is a cluster of low
*LotFrontage*value properties, shown as a peak on the far left of the distribution plot. The boxplot indicates these values may be outliers, shown as outlier points on the left of the whisker of the boxplot. - There is a long tail of high
*LotFrontage*value properties. These values extend beyond the Median + 1.5IQR range.

In other words, there are outliers on both ends of the *LotFrontage* value distributions.

In [42]:

```
fig, ax = plt.subplots(1,2, figsize=(16,4))
sns.boxplot(train_LotFrontage['LotFrontage'], ax=ax[0])
sns.distplot(train_LotFrontage['LotFrontage'], ax=ax[1], fit=st.norm, fit_kws={"color":"red"})
(mu, sigma) = st.norm.fit(train_LotFrontage['LotFrontage'])
plt.legend(['Normal Distribution \n ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
loc='best', fancybox=True)
plt.show()
```

Before we examine the correlations between *LotFrontage* and other variables, we should remove the outliers seen above.

To do so, we will use the Interquartile Range (IQR) method, where values outside the *Median Â± 1.5IQR* are considered to be outliers. These outliers are then removed from the *LotFrontage* datasets.

In [43]:

```
def outlier_detection(data):
Q1, Q3 = np.percentile(data, [25,75])
IQR = Q3-Q1
lower_cutoff = Q1 - (IQR * 1.5)
upper_cutoff = Q3 + (IQR * 1.5)
outliers = (data > Q3+1.5*IQR) | (data < Q1-1.5*IQR)
return outliers
```

In [44]:

```
train_LotFrontage_no_outliers = train_LotFrontage[~outlier_detection(train_LotFrontage.LotFrontage)]
fig, ax = plt.subplots(1,2, figsize=(16,4))
sns.boxplot(train_LotFrontage_no_outliers['LotFrontage'], ax=ax[0])
sns.distplot(train_LotFrontage_no_outliers['LotFrontage'], ax=ax[1], fit=st.norm, fit_kws={"color":"red"})
(mu, sigma) = st.norm.fit(train_LotFrontage_no_outliers['LotFrontage'])
plt.legend(['Normal Distribution \n ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
loc='best', fancybox=True)
plt.show()
```

Two new values previously not deemed outliers by the first iteration of the IQR range method are now shown as potential outliers in the boxplot. This is caused by having a new smaller IQR value after removing the first batch of outliers. As such, we will keep these values.

Now we can explore the relationship between the *LotFrontage* target variable and other features. In order to confirm a relationship between these key features and we will conduct an **ANOVA** (Analysis of Variance) test to determine statistical significance (in this case, *p < 0.01*).

Using the type III ANOVA, we are confident the variables listed in the table below are correlated with LotFrontage.

- Note: Variables that begin with a number (
*1stFlrSF*,*2ndFlrSF*, and*3SsnPorch*) cause a syntax error within the ols formula input. I briefly converted them to an appropriate name in the*temp_train*dataset for the purpose of the ANOVA.

In [45]:

```
temp_train = train_LotFrontage_no_outliers.copy()
temp_train.rename(columns={'1stFlrSF':'X1stFlrSF', '2ndFlrSF': 'X2ndFlrSF', '3SsnPorch': 'X3SsnPorch'}, inplace=True)
mod = ols('LotFrontage ~ X1stFlrSF + X2ndFlrSF + X3SsnPorch + Alley + BedroomAbvGr + BldgType + BsmtCond + BsmtExposure + BsmtFinSF1 + BsmtFinSF2 + BsmtFinType1 + BsmtFinType2 + BsmtFullBath + BsmtHalfBath + BsmtQual + BsmtUnfSF + CentralAir + Electrical + EnclosedPorch + ExterCond + ExterQual + Fence + FireplaceQu + Fireplaces + Foundation + FullBath + Functional + GarageArea + GarageCars + GarageCond + GarageFinish + GarageQual + GarageType + GarageYrBlt + GrLivArea + HalfBath + Heating + HeatingQC + HouseStyle + KitchenAbvGr + KitchenQual + LandContour + LandSlope + LotArea + LotConfig + LotShape + LowQualFinSF + MSSubClass + MSZoning + MasVnrArea + MasVnrType + MiscFeature + MiscVal + MoSold + Neighborhood + OpenPorchSF + OverallCond + OverallQual + PavedDrive + PoolArea + PoolQC + RoofMatl + RoofStyle + SaleCondition + SaleType + ScreenPorch + Street + TotRmsAbvGrd + TotalBsmtSF + WoodDeckSF + YearBuilt + YearRemodAdd + YrSold + MixedConditions + MixedExterior + TotalSF + TotalBath + Neighborhood_Class', data=temp_train).fit()
aov_table = sm.stats.anova_lm(mod, typ=3)
display(aov_table[aov_table['PR(>F)'] < 0.01])
```

Now let's build a model using the relevant variables selected above. We will be using Support Vector Regressor to build the model.

The first step is to seperate the target variable, *LotFrontage*, select the relevant variables, and dummifying the categorical variables.

In [46]:

```
X = train_LotFrontage_no_outliers[aov_table[aov_table['PR(>F)'] < 0.01].index]
y = train_LotFrontage_no_outliers['LotFrontage']
X = pd.get_dummies(X)
transformer = StandardScaler().fit(X)
X = transformer.transform(X)
```

In order to determine the accuracy of our predictive model, we will create a *Validation* dataset. This dataset will have known values for the target *LotFrontage* variable which can we compare our model's prediction against. Using the mean absolute error, we can measure the difference between our predicted *LotFrontage* values with the true values.

In [47]:

```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1234)
clf = svm.SVR(kernel='rbf', C=100, gamma=0.001)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
print('Mean Absolute Error: %.3f' % mean_absolute_error(y_test, preds))
```

These results show that using the SVR model to impute *LotFrontage* gives an average rror of less than 7 feet.

We will now use the same SVR model to predict the unknown *LotFrontage* values in the test_Frontage dataset.

In [48]:

```
model_data = train_LotFrontage_no_outliers.copy()
model_data= model_data.append(test_LotFrontage)
y = model_data['LotFrontage']
model_data = model_data[['MSZoning', 'Alley', 'LotArea', 'LotShape', 'LotConfig', 'Neighborhood', 'MixedConditions', 'GarageType', 'GarageCars', 'GarageArea']]
model_data = pd.get_dummies(model_data)
model_X_train = model_data[~y.isnull()]
model_X_test = model_data[y.isnull()]
model_y_train = y[~y.isnull()]
transformer = StandardScaler().fit(model_X_train)
model_X_train = transformer.transform(model_X_train)
model_X_test = transformer.transform(model_X_test)
clf = svm.SVR(kernel='rbf', C=100, gamma=0.001)
clf.fit(model_X_train, model_y_train)
LotFrontage_preds = clf.predict(model_X_test)
```

Now that we have the newly predicted *LotFrontage* values, we can examine the distribution of predicted values relative to the known *LotFrontage* values in the training dataset. Both distributions have a mean around 70 feet with similar tail lengths on either end.

In [49]:

```
sns.distplot(model_y_train)
sns.distplot(LotFrontage_preds)
```

Out[49]:

Lastly, we need to add the predicted values back into the original dataset.

In [50]:

```
all_data.LotFrontage[all_data.LotFrontage.isnull()] = LotFrontage_preds
```

Although we could be more in depth in our outliet detection and removal, we are only going to remove the two outliers found in the TotalSF variable. These properties exhibited large total square feet values with low SalePrice.

In [51]:

```
all_data.drop(all_data['TotalSF'].iloc[:ntrain].nlargest().index[:2], axis=0, inplace=True)
ntrain = train.shape[0]-2
```

We can see a large correlation between many variables, suggesting a high level of multicollinearity. There are two options to consider:

- Use a regression model that deals well with multicollinearity, such as a ridge regression.
- Remove highly correlated predictors from the model.

In [52]:

```
corr = all_data.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(15, 15))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap,
center=0, square=True, linewidths = .5)
plt.show()
```

Many of the ordinal variables are presented as numeric values. Therefore we need to label encode these numeric characters into strings.

In [53]:

```
num_to_str_columns = ['MSSubClass', 'OverallQual', 'OverallCond', 'MoSold', 'YrSold']
for col in num_to_str_columns:
all_data[col] = all_data[col].astype('str')
```

In [54]:

```
cat_cols = ['OverallQual', 'OverallCond', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
'BsmtFinType2', 'BsmtFinSF2', 'HeatingQC', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Fireplaces', 'FireplaceQu', 'GarageFinish', 'GarageCars',
'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'MoSold', 'YrSold']
for col in cat_cols:
label = LabelEncoder()
label.fit(list(all_data[col].values))
all_data[col] = label.transform(list(all_data[col].values))
print('Shape all_data: {}'.format(all_data.shape))
all_data.head()
```

Out[54]:

**Skewness** is a measure of asymmetry of a distribution, and can be used to define the extent to which the distribution differs from a normal distribution. Therefore, a normal distribution will have a skewness of 0. As a rule of thumb, if skewness is less than -1 or greater than 1, the distribution is highly skewed.

In order to account for skewness, we will transform the (highly) skewed data into normality using a Log Transformation. We define highly skewed data as variables with a skewness greater than 0.85. This method is similar to the approach used to normalize the SalePrice Response Variable, except we will use log+1 to avoid division by zero issues.

In [55]:

```
numeric_feats = ['LotFrontage', 'LotArea', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
'LowQualFinSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch',
'3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'TotalSF']
skewed_feats = all_data[numeric_feats].apply(lambda x: st.skew(x.dropna())).sort_values(ascending=False)
skewness = pd.DataFrame({'Skew Before Transformation' :skewed_feats})
skewness = skewness[abs(skewness) > 1].dropna(axis=0)
skewed_features = skewness.index
for feat in skewed_features:
all_data[feat] = np.log1p(all_data[feat]+1)
skewed_feats = all_data[skewed_features].apply(lambda x: st.skew(x.dropna())).sort_values(ascending=False)
skewness['Skew After Transformation'] = skewed_feats
skewness
```

Out[55]:

The last step needed to prepare the data is to make sure that all categorical predictor variables are converted into a form that is usable by machine learning algorithms. This process is known as 'one-hot encoding' the categorical variables.

The process involved all non-ordinal factors receiving their own separate column with 1's and 0's, and is required by most ML algorithms.

In [56]:

```
all_data = pd.get_dummies(all_data)
print(all_data.shape)
all_data.head(3)
```

Out[56]:

Now that the data is correctly processed, we are ready to begin building our predictive models.

In [57]:

```
final_y_train = all_data['SalePrice'][~all_data['SalePrice'].isnull()]
final_X_train = all_data[all_data['Dataset_Train'] == 1].drop(['Dataset_Train', 'Dataset_Test', 'SalePrice'], axis=1)
final_X_test = all_data[all_data['Dataset_Test'] == 1].drop(['Dataset_Train', 'Dataset_Test', 'SalePrice'], axis=1)
```

We will use the cross_val_score function of Sklearn and calculate the Root-Mean-Squared Error (RMSE) as a measure of accuracy.

In [58]:

```
n_folds = 10
def rmse_cv(model):
kf = KFold(n_folds, shuffle=True, random_state=1234).get_n_splits(final_X_train.values)
rmse= np.sqrt(-cross_val_score(model, final_X_train.values, final_y_train, scoring="neg_mean_squared_error", cv = kf))
return(rmse)
```

This model may be senstitive to outliers, so we need to make it more robust. This will be done using the **RobustScaler** function on pipeline.

In [59]:

```
lasso = make_pipeline(RobustScaler(), Lasso(alpha = 0.0005, random_state = 1234))
```

Next, we tested out multiple alpha values and compared their accuracy using the **rmse_cv** function above. The results below show that using a value of 0.00025 is the optimal value of alpha for a Lasso Regression.

In [60]:

```
lasso_alpha = [1, 0.5, 0.25, 0.1, 0.05, 0.025, 0.01, 0.005, 0.0025, 0.001, 0.0005, 0.00025, 0.0001]
lasso_rmse = []
for value in lasso_alpha:
lasso = make_pipeline(RobustScaler(), Lasso(alpha = value, max_iter=3000, random_state = 1234))
lasso_rmse.append(rmse_cv(lasso).mean())
lasso_score_table = pd.DataFrame(lasso_rmse,lasso_alpha,columns=['RMSE'])
display(lasso_score_table.transpose())
plt.semilogx(lasso_alpha, lasso_rmse)
plt.xlabel('alpha')
plt.ylabel('score')
plt.show()
print("\nLasso Score: {:.4f} (alpha = {:.5f})\n".format(min(lasso_score_table['RMSE']), lasso_score_table.idxmin()[0]))
```

Using the newly defined alpha value, we can optimize the model and predict the missing values of *SalePrice*. The predictions are then formatted in a appropriate layout for submission to Kaggle.

In [61]:

```
lasso = make_pipeline(RobustScaler(), Lasso(alpha = 0.00025, random_state = 1234))
lasso.fit(final_X_train, final_y_train)
lasso_preds = np.expm1(lasso.predict(final_X_test))
sub = pd.DataFrame()
sub['Id'] = test_id
sub['SalePrice'] = lasso_preds
#sub.to_csv('Lasso Submission 2.csv',index=False)
```

Identical steps were taken for the ridge regression model as we took for the Lasso Regression Model. In the case of the Ridge model, the optimal alpha value was 6.

In [62]:

```
ridge_alpha = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
ridge_rmse = []
for value in ridge_alpha:
ridge = make_pipeline(RobustScaler(), Ridge(alpha = value, random_state = 1234))
ridge_rmse.append(rmse_cv(ridge).mean())
ridge_score_table = pd.DataFrame(ridge_rmse,ridge_alpha,columns=['RMSE'])
display(ridge_score_table.transpose())
plt.semilogx(ridge_alpha, ridge_rmse)
plt.xlabel('alpha')
plt.ylabel('score')
plt.show()
print("\nRidge Score: {:.4f} (alpha = {:.4f})\n".format(min(ridge_score_table['RMSE']), ridge_score_table.idxmin()[0]))
```

In [63]:

```
ridge = make_pipeline(RobustScaler(), Ridge(alpha = 6, random_state = 1234))
ridge.fit(final_X_train, final_y_train)
ridge_preds = np.expm1(ridge.predict(final_X_test))
sub = pd.DataFrame()
sub['Id'] = test_id
sub['SalePrice'] = ridge_preds
#sub.to_csv('Ridge Submission.csv',index=False)
```

Since there are multiple hyperparameters to tune in the XGBoost model, we will use the GridSearchCV function of Sklearn to determine the optimal values. Next, I used the train_test_split function to generate a validation set and find the RMSE of the models. This method is used in lieu of the rmse_cv function used above for the Lasso and Ridge Regression models.

In [64]:

```
xg_X_train, xg_X_test, xg_y_train, xg_y_test = train_test_split(final_X_train, final_y_train, test_size=0.33, random_state=1234)
xg_model = XGBRegressor(n_estimators=100, seed = 1234)
param_dict = {'max_depth': [3,4,5],
'min_child_weight': [2,3,4],
'learning_rate': [0.05, 0.1,0.15],
'gamma': [0.0, 0.1, 0.2]
}
start = time()
grid_search = GridSearchCV(xg_model, param_dict)
grid_search.fit(xg_X_train, xg_y_train)
print("GridSearch took %.2f seconds to complete." % (time()-start))
display(grid_search.best_params_)
```

Now that the hyperparameter have been chosen, we can calculate the validation RMSE of the XGBoost model.

In [65]:

```
xg_model = XGBRegressor(n_estimators = 1000,
learning_rate = 0.1,
max_depth = 4,
min_child_weight = 4,
gamma = 0,
seed = 1234)
start = time()
xg_model.fit(xg_X_train, xg_y_train)
xg_preds = xg_model.predict(xg_X_test)
print("Model took %.2f seconds to complete." % (time()-start))
print("RMSE: %.4f" % sqrt(mean_squared_error(xg_y_test, xg_preds)))
```

Lastly, we predict the *SalePrice* using the test data. The predictions are then formatted in a appropriate layout for submission to Kaggle.

In [66]:

```
xg_model = XGBRegressor(n_estimators = 1000,
learning_rate = 0.1,
max_depth = 4,
min_child_weight = 4,
gamma = 0,
seed = 1234)
xg_model.fit(final_X_train, final_y_train)
xg_preds = np.expm1(xg_model.predict(final_X_test))
```

In [67]:

```
sub = pd.DataFrame()
sub['Id'] = test_id
sub['SalePrice'] = xg_preds
#sub.to_csv('XGBoost Submission.csv',index=False)
```

Since the Lasso, Ridge, XGBoost algorithms so different, averaging the final *Saleprice* predictions may improve the accuracy. Since the Ridge regression performed the best with regards to the final RMSE (0.125 vs 0.126 and 0.127), I will assign it's weight to be double that of the other two models.

Our final ensemble model performed better than any individual regression model (**RMSE = 0.12220**).

In [68]:

```
lasso = make_pipeline(RobustScaler(), Lasso(alpha = 0.00025, random_state = 1234))
lasso.fit(final_X_train, final_y_train)
lasso_preds = np.expm1(lasso.predict(final_X_test))
ridge = make_pipeline(RobustScaler(), Ridge(alpha = 6, random_state = 1234))
ridge.fit(final_X_train, final_y_train)
ridge_preds = np.expm1(ridge.predict(final_X_test))
xg_model = XGBRegressor(n_estimators = 1000,
learning_rate = 0.1,
max_depth = 4,
min_child_weight = 4,
gamma = 0,
seed = 1234)
xg_model.fit(final_X_train, final_y_train)
xg_preds = np.expm1(xg_model.predict(final_X_test))
weights = [0.5, 0.25, 0.25]
sub = pd.DataFrame()
sub['Id'] = test_id
sub['SalePrice'] = (ridge_preds*weights[0]) + (lasso_preds*weights[1]) + (xg_preds*weights[2])
#sub.to_csv('Ensemble Submission.csv',index=False)
```