House Prices: Advanced Regression Techniques¶

In this competitions we have 79 explanatory variables describing aspects of residential homes in Ames, Iowa. The goal is to predict the prices of these houses.

The metric to calculate the accuracy of predictions is Root Mean Squared Logarithmic Error (it penalizes an under-predicted estimate greater than an over-predicted estimate).

The RMSLE is calculated as

$$ \epsilon = \sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 } $$

Where:

$\epsilon$ is the RMSLE value (score); $n$ is the number of observations; $p_i$ is prediction; $a_i$ is the actual response for $i$; $\log(x)$ is the natural logarithm of $x$

At first I explore the data, fill missing values and visualize some features, then I try several models for prediction.

Data exploration
Dealing with missing data
Data visualization
Data preparation
Model

In [1]:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from matplotlib import rcParams
import xgboost as xgb
%matplotlib inline 
sns.set_style('whitegrid')

import scipy.stats as stats
from scipy import stats
from scipy.stats import pointbiserialr, spearmanr, skew, pearsonr

from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.linear_model import Ridge, RidgeCV, LassoCV
from sklearn import linear_model

Data exploration ¶

In [2]:

train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

In [3]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
MasVnrArea       1452 non-null float64
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422 non-null object
BsmtFinType1     1423 non-null object
BsmtFinSF1       1460 non-null int64
BsmtFinType2     1422 non-null object
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1459 non-null object
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
KitchenQual      1460 non-null object
TotRmsAbvGrd     1460 non-null int64
Functional       1460 non-null object
Fireplaces       1460 non-null int64
FireplaceQu      770 non-null object
GarageType       1379 non-null object
GarageYrBlt      1379 non-null float64
GarageFinish     1379 non-null object
GarageCars       1460 non-null int64
GarageArea       1460 non-null int64
GarageQual       1379 non-null object
GarageCond       1379 non-null object
PavedDrive       1460 non-null object
WoodDeckSF       1460 non-null int64
OpenPorchSF      1460 non-null int64
EnclosedPorch    1460 non-null int64
3SsnPorch        1460 non-null int64
ScreenPorch      1460 non-null int64
PoolArea         1460 non-null int64
PoolQC           7 non-null object
Fence            281 non-null object
MiscFeature      54 non-null object
MiscVal          1460 non-null int64
MoSold           1460 non-null int64
YrSold           1460 non-null int64
SaleType         1460 non-null object
SaleCondition    1460 non-null object
SalePrice        1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

In [4]:

data.describe(include='all')

Out[4]:

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
count	1460.000000	1460.000000	1460	1201.000000	1460.000000	1460	91	1460	1460	1460	...	1460.000000	7	281	54	1460.000000	1460.000000	1460.000000	1460	1460	1460.000000
unique	NaN	NaN	5	NaN	NaN	2	2	4	4	2	...	NaN	3	4	4	NaN	NaN	NaN	9	6	NaN
top	NaN	NaN	RL	NaN	NaN	Pave	Grvl	Reg	Lvl	AllPub	...	NaN	Gd	MnPrv	Shed	NaN	NaN	NaN	WD	Normal	NaN
freq	NaN	NaN	1151	NaN	NaN	1454	50	925	1311	1459	...	NaN	3	157	49	NaN	NaN	NaN	1267	1198	NaN
mean	730.500000	56.897260	NaN	70.049958	10516.828082	NaN	NaN	NaN	NaN	NaN	...	2.758904	NaN	NaN	NaN	43.489041	6.321918	2007.815753	NaN	NaN	180921.195890
std	421.610009	42.300571	NaN	24.284752	9981.264932	NaN	NaN	NaN	NaN	NaN	...	40.177307	NaN	NaN	NaN	496.123024	2.703626	1.328095	NaN	NaN	79442.502883
min	1.000000	20.000000	NaN	21.000000	1300.000000	NaN	NaN	NaN	NaN	NaN	...	0.000000	NaN	NaN	NaN	0.000000	1.000000	2006.000000	NaN	NaN	34900.000000
25%	365.750000	20.000000	NaN	59.000000	7553.500000	NaN	NaN	NaN	NaN	NaN	...	0.000000	NaN	NaN	NaN	0.000000	5.000000	2007.000000	NaN	NaN	129975.000000
50%	730.500000	50.000000	NaN	69.000000	9478.500000	NaN	NaN	NaN	NaN	NaN	...	0.000000	NaN	NaN	NaN	0.000000	6.000000	2008.000000	NaN	NaN	163000.000000
75%	1095.250000	70.000000	NaN	80.000000	11601.500000	NaN	NaN	NaN	NaN	NaN	...	0.000000	NaN	NaN	NaN	0.000000	8.000000	2009.000000	NaN	NaN	214000.000000
max	1460.000000	190.000000	NaN	313.000000	215245.000000	NaN	NaN	NaN	NaN	NaN	...	738.000000	NaN	NaN	NaN	15500.000000	12.000000	2010.000000	NaN	NaN	755000.000000

11 rows × 81 columns

In [5]:

data.head()

Out[5]:

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

5 rows × 81 columns

Quite a lot of variables. Many categorical variables, which makes analysis more complex. And a lot of missing values. Or are they merely missing values? There are many features for which NaN value simply means an absense of the feature (for example, no Garage).

Dealing with missing data ¶

At first, lets see which columns have missing values.

In [6]:

data[data.columns[data.isnull().sum() > 0].tolist()].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 19 columns):
LotFrontage     1201 non-null float64
Alley           91 non-null object
MasVnrType      1452 non-null object
MasVnrArea      1452 non-null float64
BsmtQual        1423 non-null object
BsmtCond        1423 non-null object
BsmtExposure    1422 non-null object
BsmtFinType1    1423 non-null object
BsmtFinType2    1422 non-null object
Electrical      1459 non-null object
FireplaceQu     770 non-null object
GarageType      1379 non-null object
GarageYrBlt     1379 non-null float64
GarageFinish    1379 non-null object
GarageQual      1379 non-null object
GarageCond      1379 non-null object
PoolQC          7 non-null object
Fence           281 non-null object
MiscFeature     54 non-null object
dtypes: float64(3), object(16)
memory usage: 216.8+ KB

Now I create lists of columns with missing values in train and test. Then I use list comprehension to get a list with columns which are present in test but not in train. Then I find the columns which are present in train but not in test.

In [7]:

list_data = data.columns[data.isnull().sum() > 0].tolist()
list_test = test.columns[test.isnull().sum() > 0].tolist()
test[list(i for i in list_test if i not in list_data)].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 15 columns):
MSZoning        1455 non-null object
Utilities       1457 non-null object
Exterior1st     1458 non-null object
Exterior2nd     1458 non-null object
BsmtFinSF1      1458 non-null float64
BsmtFinSF2      1458 non-null float64
BsmtUnfSF       1458 non-null float64
TotalBsmtSF     1458 non-null float64
BsmtFullBath    1457 non-null float64
BsmtHalfBath    1457 non-null float64
KitchenQual     1458 non-null object
Functional      1457 non-null object
GarageCars      1458 non-null float64
GarageArea      1458 non-null float64
SaleType        1458 non-null object
dtypes: float64(8), object(7)
memory usage: 171.1+ KB

In [8]:

data[list(i for i in list_data if i not in list_test)].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 1 columns):
Electrical    1459 non-null object
dtypes: object(1)
memory usage: 11.5+ KB

No value in the following columns probably means absense of it:

Alley;
MasVnrType (Masonry veneer type) and MasVnrArea (its area);
BsmtQual and BsmtCond (two parameters of basement);
BsmtExposure (Walkout or garden level basement walls);
BsmtFinType1 and BsmtFinType2 (Quality of basement finished area) and BsmtFinSF1, BsmtFinSF2 (their area);
BsmtUnfSF (Unfinished square feet of basement area);
FireplaceQu (Fireplace quality);
GarageType, GarageFinish, GarageQual, GarageCond (garage parameters) and GarageYrBlt;
KitchenQual;
PoolQC;
Fence;
MiscFeature (Miscellaneous feature not covered in other categories);

For other variables missing values could be replaced with most common parameters:

MSZoning (The general zoning classification);
Utilities;
Exterior1st and Exterior2nd (Exterior covering on house);
KitchenQual;
Functional (Home functionality rating);
SaleType;
Electrical;
LotFrontage;
GarageCars and GarageArea;

In [9]:

#Create a list of column to fill NA with "None" or 0.
to_null = ['Alley', 'MasVnrType', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu',
           'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'GarageYrBlt', 'BsmtFullBath', 'BsmtHalfBath',
           'PoolQC', 'Fence', 'MiscFeature']
for col in to_null:
    if data[col].dtype == 'object':

        data[col].fillna('None',inplace=True)
        test[col].fillna('None',inplace=True)
    else:

        data[col].fillna(0,inplace=True)
        test[col].fillna(0,inplace=True)

In [10]:

#Fill NA with common values.
test.loc[test.KitchenQual.isnull(), 'KitchenQual'] = 'TA'
test.loc[test.MSZoning.isnull(), 'MSZoning'] = 'RL'
test.loc[test.Utilities.isnull(), 'Utilities'] = 'AllPub'
test.loc[test.Exterior1st.isnull(), 'Exterior1st'] = 'VinylSd'
test.loc[test.Exterior2nd.isnull(), 'Exterior2nd'] = 'VinylSd'
test.loc[test.Functional.isnull(), 'Functional'] = 'Typ'
test.loc[test.SaleType.isnull(), 'SaleType'] = 'WD'
data.loc[data['Electrical'].isnull(), 'Electrical'] = 'SBrkr'
data.loc[data['LotFrontage'].isnull(), 'LotFrontage'] = data['LotFrontage'].mean()
test.loc[test['LotFrontage'].isnull(), 'LotFrontage'] = test['LotFrontage'].mean()

There are several additional cases: when a categorical variable is None, relevant numerical variable should be 0. For example if there is no veneer (MasVnrType is None), MasVnrArea should be 0.

In [11]:

data.loc[data.MasVnrType == 'None', 'MasVnrArea'] = 0
test.loc[test.MasVnrType == 'None', 'MasVnrArea'] = 0
test.loc[test.BsmtFinType1=='None', 'BsmtFinSF1'] = 0
test.loc[test.BsmtFinType2=='None', 'BsmtFinSF2'] = 0
test.loc[test.BsmtQual=='None', 'BsmtUnfSF'] = 0
test.loc[test.BsmtQual=='None', 'TotalBsmtSF'] = 0

In [12]:

#And there is only one line where GarageCars and GarageArea is null, but it seems that there is no garage.
test.loc[test.GarageCars.isnull() == True]

Out[12]:

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	ScreenPorch	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
1116	2577	70	RM	50.0	9060	Pave	None	Reg	Lvl	AllPub	...	0	0	None	MnPrv	None	0	3	2007	WD	Alloca

1 rows × 80 columns

In [13]:

test.loc[test.GarageCars.isnull(), 'GarageCars'] = 0
test.loc[test.GarageArea.isnull(), 'GarageArea'] = 0

Data visualization ¶

At first I'll look into data correlation, then I'll visualize some data in order to see the impact of certain features.

In [14]:

corr = data.corr()
plt.figure(figsize=(12, 12))
sns.heatmap(corr, vmax=1)

Out[14]:

<matplotlib.axes._subplots.AxesSubplot at 0x168ccc99eb8>

It seems that only several pairs of variables have high correlation. But this chart shows data only for pairs of numerical values. I'll calculate correlation for all variables.

In [15]:

threshold = 0.8 # Threshold value.
def correlation():
    for i in data.columns:
        for j in data.columns[list(data.columns).index(i) + 1:]: #Ugly, but works. This way there won't be repetitions.
            if data[i].dtype != 'object' and data[j].dtype != 'object':
                #pearson is used by default for numerical.
                if abs(pearsonr(data[i], data[j])[0]) >= threshold:
                    yield (pearsonr(data[i], data[j])[0], i, j)
            else:
                #spearman works for categorical.
                if abs(spearmanr(data[i], data[j])[0]) >= threshold:
                    yield (spearmanr(data[i], data[j])[0], i, j)

In [16]:

corr_list = list(correlation())
corr_list

D:\Programs\Anaconda3\lib\site-packages\scipy\stats\stats.py:253: RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
  "values. nan values will be ignored.", RuntimeWarning)

Out[16]:

[(0.85848725676346904, 'Exterior1st', 'Exterior2nd'),
 (-0.89606878858916439, 'BsmtFinType2', 'BsmtFinSF2'),
 (0.81952997500503311, 'TotalBsmtSF', '1stFlrSF'),
 (0.82548937430884295, 'GrLivArea', 'TotRmsAbvGrd'),
 (0.88247541428146214, 'GarageCars', 'GarageArea'),
 (-0.99999111097112325, 'PoolArea', 'PoolQC'),
 (0.9028952966055307, 'MiscFeature', 'MiscVal')]

This is a list of highly correlated features. They aren't surprising and none of them should be removed.

In [17]:

#It seems that SalePrice is skewered, so it needs to be transformed.
sns.distplot(data['SalePrice'], kde=False, color='c', hist_kws={'alpha': 0.9})

Out[17]:

<matplotlib.axes._subplots.AxesSubplot at 0x168cd121ac8>

In [18]:

#As expected price rises with the quality.
sns.regplot(x='OverallQual', y='SalePrice', data=data, color='Orange')

Out[18]:

<matplotlib.axes._subplots.AxesSubplot at 0x168cd65dc18>

In [19]:

#Price also varies depending on neighborhood.
plt.figure(figsize = (12, 6))
sns.boxplot(x='Neighborhood', y='SalePrice',  data=data)
xt = plt.xticks(rotation=30)

In [20]:

#There are many little houses.
plt.figure(figsize = (12, 6))
sns.countplot(x='HouseStyle', data=data)
xt = plt.xticks(rotation=30)

In [21]:

#And most of the houses are single-family, so it isn't surprising that most of the them aren't large.
sns.countplot(x='BldgType', data=data)
xt = plt.xticks(rotation=30)

In [22]:

#Most of fireplaces are of good or average quality. And nearly half of houses don't have fireplaces at all.
pd.crosstab(data.Fireplaces, data.FireplaceQu)

Out[22]:

FireplaceQu	Ex	Fa	Gd	None	Po	TA
Fireplaces
0	0	0	0	690	0	0
1	19	28	324	0	20	259
2	4	4	54	0	0	53
3	1	1	2	0	0	1

In [23]:

sns.factorplot('HeatingQC', 'SalePrice', hue='CentralAir', data=data)
sns.factorplot('Heating', 'SalePrice', hue='CentralAir', data=data)

Out[23]:

<seaborn.axisgrid.FacetGrid at 0x168cdac82e8>

Houses with central air conditioning cost more. And it is interesting that houses with poor and good heating quality cost nearly the same if they have central air conditioning. Also only houses with gas heating have central air conditioning.

In [24]:

#One more interesting point is that while pavement road access is valued more, for alley they quality isn't that important.
fig, ax = plt.subplots(1, 2, figsize = (12, 5))
sns.boxplot(x='Street', y='SalePrice', data=data, ax=ax[0])
sns.boxplot(x='Alley', y='SalePrice', data=data, ax=ax[1])

Out[24]:

<matplotlib.axes._subplots.AxesSubplot at 0x168cd7962e8>

In [25]:

#We can say that while quality is normally distributed, overall condition of houses is mainly average.
fig, ax = plt.subplots(1, 2, figsize = (12, 5))
sns.countplot(x='OverallCond', data=data, ax=ax[0])
sns.countplot(x='OverallQual', data=data, ax=ax[1])

Out[25]:

<matplotlib.axes._subplots.AxesSubplot at 0x168cd8d4860>

In [26]:

fig, ax = plt.subplots(2, 3, figsize = (16, 12))
ax[0,0].set_title('Gable')
ax[0,1].set_title('Hip')
ax[0,2].set_title('Gambrel')
ax[1,0].set_title('Mansard')
ax[1,1].set_title('Flat')
ax[1,2].set_title('Shed')
sns.stripplot(x="RoofMatl", y="SalePrice", data=data[data.RoofStyle == 'Gable'], jitter=True, ax=ax[0,0])
sns.stripplot(x="RoofMatl", y="SalePrice", data=data[data.RoofStyle == 'Hip'], jitter=True, ax=ax[0,1])
sns.stripplot(x="RoofMatl", y="SalePrice", data=data[data.RoofStyle == 'Gambrel'], jitter=True, ax=ax[0,2])
sns.stripplot(x="RoofMatl", y="SalePrice", data=data[data.RoofStyle == 'Mansard'], jitter=True, ax=ax[1,0])
sns.stripplot(x="RoofMatl", y="SalePrice", data=data[data.RoofStyle == 'Flat'], jitter=True, ax=ax[1,1])
sns.stripplot(x="RoofMatl", y="SalePrice", data=data[data.RoofStyle == 'Shed'], jitter=True, ax=ax[1,2])

Out[26]:

<matplotlib.axes._subplots.AxesSubplot at 0x168cdcabcc0>

These graphs show information about roof materials and style. Most houses have Gable and Hip style. And material for most roofs is standard.

In [27]:

sns.stripplot(x="GarageQual", y="SalePrice", data=data, hue='GarageFinish', jitter=True)

Out[27]:

<matplotlib.axes._subplots.AxesSubplot at 0x168cd935e10>

Most finished garages gave average quality.

In [28]:

sns.pointplot(x="PoolArea", y="SalePrice", hue="PoolQC", data=data)

Out[28]:

<matplotlib.axes._subplots.AxesSubplot at 0x168cdfa3cc0>

It is worth noting that there are only 7 different pool areas. And while for most of them mean price is ~200000-300000$, pools with area 555 cost very much. Let's see.

In [29]:

#There is only one such pool and sale condition for it is 'Abnorml'.
data.loc[data.PoolArea == 555]

Out[29]:

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
1182	1183	60	RL	160.0	15623	Pave	None	IR1	Lvl	AllPub	...	555	Ex	MnPrv	None	0	7	2007	WD	Abnorml	745000

1 rows × 81 columns

In [30]:

fig, ax = plt.subplots(1, 2, figsize = (12, 5))
sns.stripplot(x="SaleType", y="SalePrice", data=data, jitter=True, ax=ax[0])
sns.stripplot(x="SaleCondition", y="SalePrice", data=data, jitter=True, ax=ax[1])

Out[30]:

<matplotlib.axes._subplots.AxesSubplot at 0x168ce16ccf8>

Most of the sold houses are either new or sold under Warranty Deed. And only a little number of houses are sales between family, adjoining land purchases or allocation.

Data preparation ¶

In [31]:

#MSSubClass shows codes for the type of dwelling, it is clearly a categorical variable.
data['MSSubClass'].unique()

Out[31]:

array([ 60,  20,  70,  50, 190,  45,  90, 120,  30,  85,  80, 160,  75,
       180,  40], dtype=int64)

In [32]:

data['MSSubClass'] = data['MSSubClass'].astype(str)
test['MSSubClass'] = test['MSSubClass'].astype(str)

Transforming skewered data and dummifying categorical.

In [33]:

for col in data.columns:
    if data[col].dtype != 'object':
        if skew(data[col]) > 0.75:
            data[col] = np.log1p(data[col])
        pass
    else:
        dummies = pd.get_dummies(data[col], drop_first=False)
        dummies = dummies.add_prefix("{}_".format(col))
        data.drop(col, axis=1, inplace=True)
        data = data.join(dummies)
        
for col in test.columns:
    if test[col].dtype != 'object':
        if skew(test[col]) > 0.75:
            test[col] = np.log1p(test[col])
        pass
    else:
        dummies = pd.get_dummies(test[col], drop_first=False)
        dummies = dummies.add_prefix("{}_".format(col))
        test.drop(col, axis=1, inplace=True)
        test = test.join(dummies)

Maybe a good idea would be to create some new features, but I decided to do without it. It is time-consuming and model is good enough without it. Besides, the number of features if quite high already.

In [34]:

#This is how the data looks like now.
data.head()

Out[34]:

	Id	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	...	SaleType_WD	SaleCondition_Abnorml	SaleCondition_Normal
0	1	4.189655	9.042040	7	5	2003	2003	5.283204	6.561031	...	1	0	1
1	2	4.394449	9.169623	6	8	1976	1976	0.000000	6.886532	...	1	0	1
2	3	4.234107	9.328212	7	5	2001	2002	5.093750	6.188264	...	1	0	1
3	4	4.110874	9.164401	7	5	1915	1970	0.000000	5.379897	...	1	1	0
4	5	4.442651	9.565284	8	5	2000	2000	5.860786	6.486161	...	1	0	1

5 rows × 318 columns

In [35]:

X_train = data.drop('SalePrice',axis=1)
Y_train = data['SalePrice']
X_test  = test

Model ¶

In [36]:

#Function to measure accuracy.
def rmlse(val, target):
    return np.sqrt(np.sum(((np.log1p(val) - np.log1p(np.expm1(target)))**2) / len(target)))

In [37]:

Xtrain, Xtest, ytrain, ytest = train_test_split(X_train, Y_train, test_size=0.33)

I'll try several models.

Ridge is linear least squares model with l2 regularization (using squared difference).

RidgeCV is Ridge regression with built-in cross-validation.

Lasso is Linear Model trained with l1 regularization (using module).

LassoCV is Lasso linear model with iterative fitting along a regularization path.

Random Forest is usually good in cases with many features.

And XGBoost is a library which is very popular lately and usually gives good results.

In [38]:

ridge = Ridge(alpha=10, solver='auto').fit(Xtrain, ytrain)
val_ridge = np.expm1(ridge.predict(Xtest))
rmlse(val_ridge, ytest)

Out[38]:

0.13346260272941882

In [39]:

ridge_cv = RidgeCV(alphas=(0.01, 0.05, 0.1, 0.3, 1, 3, 5, 10))
ridge_cv.fit(Xtrain, ytrain)
val_ridge_cv = np.expm1(ridge_cv.predict(Xtest))
rmlse(val_ridge_cv, ytest)

Out[39]:

0.13346260273214372

In [40]:

las = linear_model.Lasso(alpha=0.0005).fit(Xtrain, ytrain)
las_ridge = np.expm1(las.predict(Xtest))
rmlse(las_ridge, ytest)

Out[40]:

0.12607216571928639

In [41]:

las_cv = LassoCV(alphas=(0.0001, 0.0005, 0.001, 0.01, 0.05, 0.1, 0.3, 1, 3, 5, 10))
las_cv.fit(Xtrain, ytrain)
val_las_cv = np.expm1(las_cv.predict(Xtest))
rmlse(val_las_cv, ytest)

D:\Programs\Anaconda3\lib\site-packages\sklearn\linear_model\coordinate_descent.py:484: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)

Out[41]:

0.12607216571928639

In [42]:

model_xgb = xgb.XGBRegressor(n_estimators=340, max_depth=2, learning_rate=0.2) #the params were tuned using xgb.cv
model_xgb.fit(Xtrain, ytrain)
xgb_preds = np.expm1(model_xgb.predict(Xtest))
rmlse(xgb_preds, ytest)

Out[42]:

0.13385573967805864

In [43]:

forest = RandomForestRegressor(min_samples_split =5,
                                min_weight_fraction_leaf = 0.0,
                                max_leaf_nodes = None,
                                max_depth = None,
                                n_estimators = 300,
                                max_features = 'auto')

forest.fit(Xtrain, ytrain)
Y_pred_RF = np.expm1(forest.predict(Xtest))
rmlse(Y_pred_RF, ytest)

Out[43]:

0.15645551722765741

So linear models perform better than the others. And lasso is the best.

Lasso model has one nice feature - it provides feature selection, as it assignes zero weights to the least important variables.

In [44]:

coef = pd.Series(las_cv.coef_, index = X_train.columns)
v = coef.loc[las_cv.coef_ != 0].count() 
print('So we have ' + str(v) + ' variables')

So we have 126 variables

In [45]:

#Basically I sort features by weights and take variables with max weights.
indices = np.argsort(abs(las_cv.coef_))[::-1][0:v]

In [46]:

#Features to be used. I do this because I want to see how good will other models perform with these features.
features = X_train.columns[indices]
for i in features:
    if i not in X_test.columns:
        print(i)

RoofMatl_ClyTile

There is only one selected feature which isn't in test data. I'll simply add this column with zero values.

In [47]:

X_test['RoofMatl_ClyTile'] = 0

In [48]:

X = X_train[features]
Xt = X_test[features]

Let's see whether something changed.

In [49]:

Xtrain1, Xtest1, ytrain1, ytest1 = train_test_split(X, Y_train, test_size=0.33)

In [50]:

ridge = Ridge(alpha=5, solver='svd').fit(Xtrain1, ytrain1)
val_ridge = np.expm1(ridge.predict(Xtest1))
rmlse(val_ridge, ytest1)

Out[50]:

0.11924552653155912

In [51]:

las_cv = LassoCV(alphas=(0.0001, 0.0005, 0.001, 0.01, 0.05, 0.1, 0.3, 1, 3, 5, 10)).fit(Xtrain1, ytrain1)
val_las = np.expm1(las_cv.predict(Xtest1))
rmlse(val_las, ytest1)

Out[51]:

0.11565867162196054

In [52]:

model_xgb = xgb.XGBRegressor(n_estimators=340, max_depth=2, learning_rate=0.2) #the params were tuned using xgb.cv
model_xgb.fit(Xtrain1, ytrain1)
xgb_preds = np.expm1(model_xgb.predict(Xtest1))
rmlse(xgb_preds, ytest1)

Out[52]:

0.12409445447687704

In [53]:

forest = RandomForestRegressor(min_samples_split =5,
                                min_weight_fraction_leaf = 0.0,
                                max_leaf_nodes = None,
                                max_depth = 100,
                                n_estimators = 300,
                                max_features = None)

forest.fit(Xtrain1, ytrain1)
Y_pred_RF = np.expm1(forest.predict(Xtest1))
rmlse(Y_pred_RF, ytest1)

Out[53]:

0.1461884843752316

The accuracy got worse, but it is due to random seed while splitting the data. It's time for prediction!

In [54]:

las_cv1 = LassoCV(alphas=(0.0001, 0.0005, 0.001, 0.01, 0.05, 0.1, 0.3, 1, 3, 5, 10))
las_cv1.fit(X, Y_train)
lasso_preds = np.expm1(las_cv1.predict(Xt))

In [55]:

#I added XGBoost as it usually improves the predictions.
model_xgb = xgb.XGBRegressor(n_estimators=340, max_depth=2, learning_rate=0.1)
model_xgb.fit(X, Y_train)
xgb_preds = np.expm1(model_xgb.predict(Xt))

In [56]:

preds = 0.7 * lasso_preds + 0.3 * xgb_preds

In [57]:

submission = pd.DataFrame({
        'Id': test['Id'].astype(int),
        'SalePrice': preds
    })
submission.to_csv('home.csv', index=False)

But the result wasn't very good. I thought for some time and then decided that the problem could lie in feature selection - maybe I selected bad features or Maybe random seed gave bad results. I decided to try selecting features based on full training dataset (not just on part of the data).

In [58]:

model_lasso = LassoCV(alphas=(0.0001, 0.0005, 0.001, 0.01, 0.05, 0.1, 0.3, 1, 3, 5, 10, 100))
model_lasso.fit(X_train, Y_train)
coef = pd.Series(model_lasso.coef_, index = X_train.columns)
v1 = coef.loc[model_lasso.coef_ != 0].count()
print('So we have ' + str(v1) + ' variables')

So we have 120 variables

In [59]:

indices = np.argsort(abs(model_lasso.coef_))[::-1][0:v1]
features_f=X_train.columns[indices]

In [60]:

print('Features in full, but not in val:')
for i in features_f:
    if i not in features:
        print(i)
print('\n' + 'Features in val, but not in full:')
for i in features:
    if i not in features_f:
        print(i)

Features in full, but not in val:
1stFlrSF
GarageCond_Fa
SaleType_New
Functional_Maj2
Foundation_BrkTil
MSSubClass_120
SaleType_COD
LandSlope_Mod
SaleCondition_Family
KitchenQual_TA
LotShape_IR1
Heating_Grav
MasVnrType_BrkCmn
BsmtFinType1_Rec
BedroomAbvGr
HeatingQC_TA
Exterior2nd_VinylSd
MasVnrType_Stone

Features in val, but not in full:
SaleCondition_Partial
LandSlope_Gtl
Neighborhood_MeadowV
Alley_None
MSZoning_RL
LandContour_Lvl
MSSubClass_60
BsmtFinType1_GLQ
Foundation_CBlock
SaleType_ConLD
Exterior2nd_HdBoard
Exterior2nd_Wd Shng
BsmtQual_Fa
BsmtFinType1_BLQ
BsmtFinType2_ALQ
Electrical_SBrkr
BsmtFinType1_ALQ
Neighborhood_Gilbert
SaleCondition_Alloca
ExterQual_Gd
BsmtCond_TA
Fence_None
HeatingQC_Gd
LotShape_Reg

A lot of difference between the selected features. I suppose that the reason for this is that there was too little data relatively to the number of features in the first case. So I'll use the features obtain with the analysis of the whole train dataset.

In [61]:

for i in features_f:
    if i not in X_test.columns:
        X_test[i] = 0
        print(i)
X = X_train[features_f]
Xt = X_test[features_f]

Now all necessary features are present in both train and test.

In [62]:

model_lasso = LassoCV(alphas=(0.0001, 0.0005, 0.001, 0.01, 0.05, 0.1, 0.3, 1, 3, 5, 10))
model_lasso.fit(X, Y_train)
lasso_preds = np.expm1(model_lasso.predict(Xt))

In [63]:

model_xgb = xgb.XGBRegressor(n_estimators=340, max_depth=2, learning_rate=0.1) #the params were tuned using xgb.cv
model_xgb.fit(X, Y_train)
xgb_preds = np.expm1(model_xgb.predict(Xt))

In [64]:

solution = pd.DataFrame({"id":test.Id, "SalePrice":0.7*lasso_preds + 0.3*xgb_preds})
solution.to_csv("House_price.csv", index = False)

The best result I got with this model was 0.12922. Currently top results are 0.10-0.11.

House Prices: Advanced Regression Techniques¶

Data exploration¶

Dealing with missing data¶

Data visualization¶

Data preparation¶

Model¶

Data exploration ¶

Dealing with missing data ¶

Data visualization ¶

Data preparation ¶

Model ¶