This data set describes the sale of individual residential property in Ames, Iowa from 2006 to 2010
The data set contains 2930 observations and a large number of explanatory variables (23 nominal, 23 ordinal, 14 discrete, and 20 continuous) involved in assessing home values. Here is a txt file. containing information on variables involved.
Data Docmentary can be found here Data can also be downloaded from here For in detail information, click here
In this project, we'll build a Linear Regression model and exploring ways to improve it by
Lets import the necessary tools and libraries
# Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn import linear_model
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression
# Display Settings
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
data = pd.read_table('AmesHousing.tsv', delimiter="\t")
data.head(3)
Order | PID | MS SubClass | MS Zoning | Lot Frontage | Lot Area | Street | Alley | Lot Shape | Land Contour | Utilities | Lot Config | Land Slope | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Mas Vnr Type | Mas Vnr Area | Exter Qual | Exter Cond | Foundation | Bsmt Qual | Bsmt Cond | Bsmt Exposure | BsmtFin Type 1 | BsmtFin SF 1 | BsmtFin Type 2 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | Heating | Heating QC | Central Air | Electrical | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | Kitchen Qual | TotRms AbvGrd | Functional | Fireplaces | Fireplace Qu | Garage Type | Garage Yr Blt | Garage Finish | Garage Cars | Garage Area | Garage Qual | Garage Cond | Paved Drive | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Pool QC | Fence | Misc Feature | Misc Val | Mo Sold | Yr Sold | Sale Type | Sale Condition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 526301100 | 20 | RL | 141.0 | 31770 | Pave | NaN | IR1 | Lvl | AllPub | Corner | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 6 | 5 | 1960 | 1960 | Hip | CompShg | BrkFace | Plywood | Stone | 112.0 | TA | TA | CBlock | TA | Gd | Gd | BLQ | 639.0 | Unf | 0.0 | 441.0 | 1080.0 | GasA | Fa | Y | SBrkr | 1656 | 0 | 0 | 1656 | 1.0 | 0.0 | 1 | 0 | 3 | 1 | TA | 7 | Typ | 2 | Gd | Attchd | 1960.0 | Fin | 2.0 | 528.0 | TA | TA | P | 210 | 62 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 5 | 2010 | WD | Normal | 215000 |
1 | 2 | 526350040 | 20 | RH | 80.0 | 11622 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | NAmes | Feedr | Norm | 1Fam | 1Story | 5 | 6 | 1961 | 1961 | Gable | CompShg | VinylSd | VinylSd | None | 0.0 | TA | TA | CBlock | TA | TA | No | Rec | 468.0 | LwQ | 144.0 | 270.0 | 882.0 | GasA | TA | Y | SBrkr | 896 | 0 | 0 | 896 | 0.0 | 0.0 | 1 | 0 | 2 | 1 | TA | 5 | Typ | 0 | NaN | Attchd | 1961.0 | Unf | 1.0 | 730.0 | TA | TA | Y | 140 | 0 | 0 | 0 | 120 | 0 | NaN | MnPrv | NaN | 0 | 6 | 2010 | WD | Normal | 105000 |
2 | 3 | 526351010 | 20 | RL | 81.0 | 14267 | Pave | NaN | IR1 | Lvl | AllPub | Corner | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 6 | 6 | 1958 | 1958 | Hip | CompShg | Wd Sdng | Wd Sdng | BrkFace | 108.0 | TA | TA | CBlock | TA | TA | No | ALQ | 923.0 | Unf | 0.0 | 406.0 | 1329.0 | GasA | TA | Y | SBrkr | 1329 | 0 | 0 | 1329 | 0.0 | 0.0 | 1 | 1 | 3 | 1 | Gd | 6 | Typ | 0 | NaN | Attchd | 1958.0 | Unf | 1.0 | 312.0 | TA | TA | Y | 393 | 36 | 0 | 0 | 0 | 0 | NaN | NaN | Gar2 | 12500 | 6 | 2010 | WD | Normal | 172000 |
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2930 entries, 0 to 2929 Data columns (total 82 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Order 2930 non-null int64 1 PID 2930 non-null int64 2 MS SubClass 2930 non-null int64 3 MS Zoning 2930 non-null object 4 Lot Frontage 2440 non-null float64 5 Lot Area 2930 non-null int64 6 Street 2930 non-null object 7 Alley 198 non-null object 8 Lot Shape 2930 non-null object 9 Land Contour 2930 non-null object 10 Utilities 2930 non-null object 11 Lot Config 2930 non-null object 12 Land Slope 2930 non-null object 13 Neighborhood 2930 non-null object 14 Condition 1 2930 non-null object 15 Condition 2 2930 non-null object 16 Bldg Type 2930 non-null object 17 House Style 2930 non-null object 18 Overall Qual 2930 non-null int64 19 Overall Cond 2930 non-null int64 20 Year Built 2930 non-null int64 21 Year Remod/Add 2930 non-null int64 22 Roof Style 2930 non-null object 23 Roof Matl 2930 non-null object 24 Exterior 1st 2930 non-null object 25 Exterior 2nd 2930 non-null object 26 Mas Vnr Type 2907 non-null object 27 Mas Vnr Area 2907 non-null float64 28 Exter Qual 2930 non-null object 29 Exter Cond 2930 non-null object 30 Foundation 2930 non-null object 31 Bsmt Qual 2850 non-null object 32 Bsmt Cond 2850 non-null object 33 Bsmt Exposure 2847 non-null object 34 BsmtFin Type 1 2850 non-null object 35 BsmtFin SF 1 2929 non-null float64 36 BsmtFin Type 2 2849 non-null object 37 BsmtFin SF 2 2929 non-null float64 38 Bsmt Unf SF 2929 non-null float64 39 Total Bsmt SF 2929 non-null float64 40 Heating 2930 non-null object 41 Heating QC 2930 non-null object 42 Central Air 2930 non-null object 43 Electrical 2929 non-null object 44 1st Flr SF 2930 non-null int64 45 2nd Flr SF 2930 non-null int64 46 Low Qual Fin SF 2930 non-null int64 47 Gr Liv Area 2930 non-null int64 48 Bsmt Full Bath 2928 non-null float64 49 Bsmt Half Bath 2928 non-null float64 50 Full Bath 2930 non-null int64 51 Half Bath 2930 non-null int64 52 Bedroom AbvGr 2930 non-null int64 53 Kitchen AbvGr 2930 non-null int64 54 Kitchen Qual 2930 non-null object 55 TotRms AbvGrd 2930 non-null int64 56 Functional 2930 non-null object 57 Fireplaces 2930 non-null int64 58 Fireplace Qu 1508 non-null object 59 Garage Type 2773 non-null object 60 Garage Yr Blt 2771 non-null float64 61 Garage Finish 2771 non-null object 62 Garage Cars 2929 non-null float64 63 Garage Area 2929 non-null float64 64 Garage Qual 2771 non-null object 65 Garage Cond 2771 non-null object 66 Paved Drive 2930 non-null object 67 Wood Deck SF 2930 non-null int64 68 Open Porch SF 2930 non-null int64 69 Enclosed Porch 2930 non-null int64 70 3Ssn Porch 2930 non-null int64 71 Screen Porch 2930 non-null int64 72 Pool Area 2930 non-null int64 73 Pool QC 13 non-null object 74 Fence 572 non-null object 75 Misc Feature 106 non-null object 76 Misc Val 2930 non-null int64 77 Mo Sold 2930 non-null int64 78 Yr Sold 2930 non-null int64 79 Sale Type 2930 non-null object 80 Sale Condition 2930 non-null object 81 SalePrice 2930 non-null int64 dtypes: float64(11), int64(28), object(43) memory usage: 1.8+ MB
data.describe()
Order | PID | MS SubClass | Lot Frontage | Lot Area | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Mas Vnr Area | BsmtFin SF 1 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | TotRms AbvGrd | Fireplaces | Garage Yr Blt | Garage Cars | Garage Area | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Misc Val | Mo Sold | Yr Sold | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 2930.00000 | 2.930000e+03 | 2930.000000 | 2440.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2907.000000 | 2929.000000 | 2929.000000 | 2929.000000 | 2929.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2928.000000 | 2928.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2771.000000 | 2929.000000 | 2929.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2930.000000 | 2930.000000 |
mean | 1465.50000 | 7.144645e+08 | 57.387372 | 69.224590 | 10147.921843 | 6.094881 | 5.563140 | 1971.356314 | 1984.266553 | 101.896801 | 442.629566 | 49.722431 | 559.262547 | 1051.614544 | 1159.557679 | 335.455973 | 4.676792 | 1499.690444 | 0.431352 | 0.061134 | 1.566553 | 0.379522 | 2.854266 | 1.044369 | 6.443003 | 0.599317 | 1978.132443 | 1.766815 | 472.819734 | 93.751877 | 47.533447 | 23.011604 | 2.592491 | 16.002048 | 2.243345 | 50.635154 | 6.216041 | 2007.790444 | 180796.060068 |
std | 845.96247 | 1.887308e+08 | 42.638025 | 23.365335 | 7880.017759 | 1.411026 | 1.111537 | 30.245361 | 20.860286 | 179.112611 | 455.590839 | 169.168476 | 439.494153 | 440.615067 | 391.890885 | 428.395715 | 46.310510 | 505.508887 | 0.524820 | 0.245254 | 0.552941 | 0.502629 | 0.827731 | 0.214076 | 1.572964 | 0.647921 | 25.528411 | 0.760566 | 215.046549 | 126.361562 | 67.483400 | 64.139059 | 25.141331 | 56.087370 | 35.597181 | 566.344288 | 2.714492 | 1.316613 | 79886.692357 |
min | 1.00000 | 5.263011e+08 | 20.000000 | 21.000000 | 1300.000000 | 1.000000 | 1.000000 | 1872.000000 | 1950.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 334.000000 | 0.000000 | 0.000000 | 334.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 1895.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 2006.000000 | 12789.000000 |
25% | 733.25000 | 5.284770e+08 | 20.000000 | 58.000000 | 7440.250000 | 5.000000 | 5.000000 | 1954.000000 | 1965.000000 | 0.000000 | 0.000000 | 0.000000 | 219.000000 | 793.000000 | 876.250000 | 0.000000 | 0.000000 | 1126.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 2.000000 | 1.000000 | 5.000000 | 0.000000 | 1960.000000 | 1.000000 | 320.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4.000000 | 2007.000000 | 129500.000000 |
50% | 1465.50000 | 5.354536e+08 | 50.000000 | 68.000000 | 9436.500000 | 6.000000 | 5.000000 | 1973.000000 | 1993.000000 | 0.000000 | 370.000000 | 0.000000 | 466.000000 | 990.000000 | 1084.000000 | 0.000000 | 0.000000 | 1442.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 3.000000 | 1.000000 | 6.000000 | 1.000000 | 1979.000000 | 2.000000 | 480.000000 | 0.000000 | 27.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 | 2008.000000 | 160000.000000 |
75% | 2197.75000 | 9.071811e+08 | 70.000000 | 80.000000 | 11555.250000 | 7.000000 | 6.000000 | 2001.000000 | 2004.000000 | 164.000000 | 734.000000 | 0.000000 | 802.000000 | 1302.000000 | 1384.000000 | 703.750000 | 0.000000 | 1742.750000 | 1.000000 | 0.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 7.000000 | 1.000000 | 2002.000000 | 2.000000 | 576.000000 | 168.000000 | 70.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 8.000000 | 2009.000000 | 213500.000000 |
max | 2930.00000 | 1.007100e+09 | 190.000000 | 313.000000 | 215245.000000 | 10.000000 | 9.000000 | 2010.000000 | 2010.000000 | 1600.000000 | 5644.000000 | 1526.000000 | 2336.000000 | 6110.000000 | 5095.000000 | 2065.000000 | 1064.000000 | 5642.000000 | 3.000000 | 2.000000 | 4.000000 | 2.000000 | 8.000000 | 3.000000 | 15.000000 | 4.000000 | 2207.000000 | 5.000000 | 1488.000000 | 1424.000000 | 742.000000 | 1012.000000 | 508.000000 | 576.000000 | 800.000000 | 17000.000000 | 12.000000 | 2010.000000 | 755000.000000 |
data.describe()['SalePrice'].head()
count 2930.000000 mean 180796.060068 std 79886.692357 min 12789.000000 25% 129500.000000 Name: SalePrice, dtype: float64
# plot
plt.figure(figsize=(20,8))
sns.histplot(data['SalePrice'])
sns.despine()
# Target's range
data['SalePrice'].max() - data['SalePrice'].min()
742211
# Check for the 'too obvious' Outliers
data[data['SalePrice'] > 700000]
Order | PID | MS SubClass | MS Zoning | Lot Frontage | Lot Area | Street | Alley | Lot Shape | Land Contour | Utilities | Lot Config | Land Slope | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Mas Vnr Type | Mas Vnr Area | Exter Qual | Exter Cond | Foundation | Bsmt Qual | Bsmt Cond | Bsmt Exposure | BsmtFin Type 1 | BsmtFin SF 1 | BsmtFin Type 2 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | Heating | Heating QC | Central Air | Electrical | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | Kitchen Qual | TotRms AbvGrd | Functional | Fireplaces | Fireplace Qu | Garage Type | Garage Yr Blt | Garage Finish | Garage Cars | Garage Area | Garage Qual | Garage Cond | Paved Drive | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Pool QC | Fence | Misc Feature | Misc Val | Mo Sold | Yr Sold | Sale Type | Sale Condition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1760 | 1761 | 528320050 | 60 | RL | 160.0 | 15623 | Pave | NaN | IR1 | Lvl | AllPub | Corner | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 10 | 5 | 1996 | 1996 | Hip | CompShg | Wd Sdng | ImStucc | None | 0.0 | Gd | TA | PConc | Ex | TA | Av | GLQ | 2096.0 | Unf | 0.0 | 300.0 | 2396.0 | GasA | Ex | Y | SBrkr | 2411 | 2065 | 0 | 4476 | 1.0 | 0.0 | 3 | 1 | 4 | 1 | Ex | 10 | Typ | 2 | TA | Attchd | 1996.0 | Fin | 3.0 | 813.0 | TA | TA | Y | 171 | 78 | 0 | 0 | 0 | 555 | Ex | MnPrv | NaN | 0 | 7 | 2007 | WD | Abnorml | 745000 |
1767 | 1768 | 528351010 | 60 | RL | 104.0 | 21535 | Pave | NaN | IR1 | Lvl | AllPub | Corner | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 10 | 6 | 1994 | 1995 | Gable | WdShngl | HdBoard | HdBoard | BrkFace | 1170.0 | Ex | TA | PConc | Ex | TA | Gd | GLQ | 1455.0 | Unf | 0.0 | 989.0 | 2444.0 | GasA | Ex | Y | SBrkr | 2444 | 1872 | 0 | 4316 | 0.0 | 1.0 | 3 | 1 | 4 | 1 | Ex | 10 | Typ | 2 | Ex | Attchd | 1994.0 | Fin | 3.0 | 832.0 | TA | TA | Y | 382 | 50 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 1 | 2007 | WD | Normal | 755000 |
- These entries look realistic at first glace. Though they could tarnish the performace of the model, lets keep them
There are several Ordinal Variables atleast one Nominal Variable that could be represented numerically but they are represented by strings.
Lets provide tailor-made mappings to the Variables that should be represented in Numeric format. Although I'd be doing the One-Hot-Encoding later, these variables are Ordinal and hence, the encoding should be in order as well
To do this mapping meticulously, I will take the exhaustive route
LotShape_mapping = {'Reg':4, 'IR1':3, 'IR2':2, 'IR1':1}
data['Lot Shape'] = data['Lot Shape'].map(LotShape_mapping, na_action='ignore')
data['Lot Shape'].value_counts(dropna=False)
4.0 1859 1.0 979 2.0 76 NaN 16 Name: Lot Shape, dtype: int64
Utilities_mapping = {'AllPub':4, 'NoSewr':3, 'NoSeWa':2, 'ELO':1}
data['Utilities'] = data['Utilities'].map(Utilities_mapping, na_action='ignore')
data['Utilities'].value_counts(dropna=False)
4 2927 3 2 2 1 Name: Utilities, dtype: int64
LandSlope_mapping = {'Gtl':3, 'Mod':2, 'Sev':1}
data['Land Slope'] = data['Land Slope'].map(LandSlope_mapping, na_action='ignore')
data['Land Slope'].value_counts(dropna=False)
3 2789 2 125 1 16 Name: Land Slope, dtype: int64
ExterQual_mapping = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1}
data['Exter Qual'] = data['Exter Qual'].map(ExterQual_mapping, na_action='ignore')
data['Exter Qual'].value_counts(dropna=False)
3 1799 4 989 5 107 2 35 Name: Exter Qual, dtype: int64
ExterCond_mapping = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1}
data['Exter Cond'] = data['Exter Cond'].map(ExterCond_mapping, na_action='ignore')
data['Exter Cond'].value_counts(dropna=False)
3 2549 4 299 2 67 5 12 1 3 Name: Exter Cond, dtype: int64
BsmtQual_mapping = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'NA':0}
data['Bsmt Qual'] = data['Bsmt Qual'].map(BsmtQual_mapping, na_action='ignore')
data['Bsmt Qual'].value_counts(dropna=False)
3.0 1283 4.0 1219 5.0 258 2.0 88 NaN 80 1.0 2 Name: Bsmt Qual, dtype: int64
BsmtCond_mapping = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'NA':0}
data['Bsmt Cond'] = data['Bsmt Cond'].map(BsmtCond_mapping, na_action='ignore')
data['Bsmt Cond'].value_counts(dropna=False)
3.0 2616 4.0 122 2.0 104 NaN 80 1.0 5 5.0 3 Name: Bsmt Cond, dtype: int64
BsmtExposure_mapping = {'Gd':4, 'Av':3, 'Mn':2, 'No':1, 'NA':0}
data['Bsmt Exposure'] = data['Bsmt Exposure'].map(BsmtExposure_mapping, na_action='ignore')
data['Bsmt Exposure'].value_counts(dropna=False)
1.0 1906 3.0 418 4.0 284 2.0 239 NaN 83 Name: Bsmt Exposure, dtype: int64
Functional_mapping = {'Typ':8, 'Min1':7, 'Min2':6, 'Mod':5, 'Maj1':4, 'Maj2':3, 'Sev':2, 'Sal':1 }
data['Functional'] = data['Functional'].map(Functional_mapping, na_action='ignore')
data['Functional'].value_counts(dropna=False)
8 2728 6 70 7 65 5 35 4 19 3 9 1 2 2 2 Name: Functional, dtype: int64
BsmtFin_mapping = {'GLQ':6, 'ALQ':5, 'BLQ':4, 'Rec':3, 'LwQ':2, 'Unf':1, 'NA':0}
data['BsmtFin Type 1'] = data['BsmtFin Type 1'].map(BsmtFin_mapping, na_action='ignore')
data['BsmtFin Type 1'].value_counts(dropna=False)
6.0 859 1.0 851 5.0 429 3.0 288 4.0 269 2.0 154 NaN 80 Name: BsmtFin Type 1, dtype: int64
HeatingQC_mapping = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1}
data['Heating QC'] = data['Heating QC'].map(HeatingQC_mapping, na_action='ignore')
data['Heating QC'].value_counts(dropna=False)
5 1495 3 864 4 476 2 92 1 3 Name: Heating QC, dtype: int64
CentralAir_mapping = {'Y': 1, 'N':0}
data['Central Air'] = data['Central Air'].map(CentralAir_mapping, na_action='ignore')
data['Central Air'].value_counts(dropna=False)
1 2734 0 196 Name: Central Air, dtype: int64
Electrical_mapping = {'SBrkr':5, 'FuseA':4, 'Mix':3, 'FuseF':2, 'FuseP':1 } # Assumed category 'Mix' to be at medium
data['Electrical'] = data['Electrical'].map(Electrical_mapping, na_action='ignore')
data['Electrical'].value_counts(dropna=False)
5.0 2682 4.0 188 2.0 50 1.0 8 3.0 1 NaN 1 Name: Electrical, dtype: int64
KitchenQual_mapping = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1}
data['Kitchen Qual'] = data['Kitchen Qual'].map(KitchenQual_mapping, na_action='ignore')
data['Kitchen Qual'].value_counts(dropna=False)
3 1494 4 1160 5 205 2 70 1 1 Name: Kitchen Qual, dtype: int64
FireplaceQu_mapping = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'NA':0}
data['Fireplace Qu'] = data['Fireplace Qu'].map(FireplaceQu_mapping, na_action='ignore')
data['Fireplace Qu'].value_counts(dropna=False)
NaN 1422 4.0 744 3.0 600 2.0 75 1.0 46 5.0 43 Name: Fireplace Qu, dtype: int64
GarageFinish_mapping = {'Fin':3, 'RFn':2, 'UnF':1, 'NA':0}
data['Garage Finish'] = data['Garage Finish'].map(GarageFinish_mapping, na_action='ignore')
data['Garage Finish'].value_counts(dropna=False)
NaN 1390 2.0 812 3.0 728 Name: Garage Finish, dtype: int64
GarageQual_mapping = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'NA':0}
data['Garage Qual'] = data['Garage Qual'].map(GarageQual_mapping, na_action='ignore')
data['Garage Qual'].value_counts(dropna=False)
3.0 2615 NaN 159 2.0 124 4.0 24 1.0 5 5.0 3 Name: Garage Qual, dtype: int64
GarageCond_mapping = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'NA':0}
data['Garage Cond'] = data['Garage Cond'].map(GarageCond_mapping, na_action='ignore')
data['Garage Cond'].value_counts(dropna=False)
3.0 2665 NaN 159 2.0 74 4.0 15 1.0 14 5.0 3 Name: Garage Cond, dtype: int64
PavedDrive_mapping = {'Y':3, 'P':2, 'N':1}
data['Paved Drive'] = data['Paved Drive'].map(PavedDrive_mapping, na_action='ignore')
data['Paved Drive'].value_counts(dropna=False)
3 2652 1 216 2 62 Name: Paved Drive, dtype: int64
PoolQC_mapping = {'Ex':4, 'Gd':3, 'TA':2, 'Fa':1, 'NA':0}
data['Pool QC'] = data['Pool QC'].map(PoolQC_mapping, na_action='ignore')
data['Pool QC'].value_counts(dropna=False)
NaN 2917 3.0 4 4.0 4 2.0 3 1.0 2 Name: Pool QC, dtype: int64
Fence_mapping = {'GdPrv': 4, 'MnPrv':3, 'GdWo':2, 'MnWw':1, 'NA':0}
data['Fence'] = data['Fence'].map(Fence_mapping, na_action='ignore')
data['Fence'].value_counts(dropna=False)
NaN 2358 3.0 330 4.0 118 2.0 112 1.0 12 Name: Fence, dtype: int64
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2930 entries, 0 to 2929 Data columns (total 82 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Order 2930 non-null int64 1 PID 2930 non-null int64 2 MS SubClass 2930 non-null int64 3 MS Zoning 2930 non-null object 4 Lot Frontage 2440 non-null float64 5 Lot Area 2930 non-null int64 6 Street 2930 non-null object 7 Alley 198 non-null object 8 Lot Shape 2914 non-null float64 9 Land Contour 2930 non-null object 10 Utilities 2930 non-null int64 11 Lot Config 2930 non-null object 12 Land Slope 2930 non-null int64 13 Neighborhood 2930 non-null object 14 Condition 1 2930 non-null object 15 Condition 2 2930 non-null object 16 Bldg Type 2930 non-null object 17 House Style 2930 non-null object 18 Overall Qual 2930 non-null int64 19 Overall Cond 2930 non-null int64 20 Year Built 2930 non-null int64 21 Year Remod/Add 2930 non-null int64 22 Roof Style 2930 non-null object 23 Roof Matl 2930 non-null object 24 Exterior 1st 2930 non-null object 25 Exterior 2nd 2930 non-null object 26 Mas Vnr Type 2907 non-null object 27 Mas Vnr Area 2907 non-null float64 28 Exter Qual 2930 non-null int64 29 Exter Cond 2930 non-null int64 30 Foundation 2930 non-null object 31 Bsmt Qual 2850 non-null float64 32 Bsmt Cond 2850 non-null float64 33 Bsmt Exposure 2847 non-null float64 34 BsmtFin Type 1 2850 non-null float64 35 BsmtFin SF 1 2929 non-null float64 36 BsmtFin Type 2 2849 non-null object 37 BsmtFin SF 2 2929 non-null float64 38 Bsmt Unf SF 2929 non-null float64 39 Total Bsmt SF 2929 non-null float64 40 Heating 2930 non-null object 41 Heating QC 2930 non-null int64 42 Central Air 2930 non-null int64 43 Electrical 2929 non-null float64 44 1st Flr SF 2930 non-null int64 45 2nd Flr SF 2930 non-null int64 46 Low Qual Fin SF 2930 non-null int64 47 Gr Liv Area 2930 non-null int64 48 Bsmt Full Bath 2928 non-null float64 49 Bsmt Half Bath 2928 non-null float64 50 Full Bath 2930 non-null int64 51 Half Bath 2930 non-null int64 52 Bedroom AbvGr 2930 non-null int64 53 Kitchen AbvGr 2930 non-null int64 54 Kitchen Qual 2930 non-null int64 55 TotRms AbvGrd 2930 non-null int64 56 Functional 2930 non-null int64 57 Fireplaces 2930 non-null int64 58 Fireplace Qu 1508 non-null float64 59 Garage Type 2773 non-null object 60 Garage Yr Blt 2771 non-null float64 61 Garage Finish 1540 non-null float64 62 Garage Cars 2929 non-null float64 63 Garage Area 2929 non-null float64 64 Garage Qual 2771 non-null float64 65 Garage Cond 2771 non-null float64 66 Paved Drive 2930 non-null int64 67 Wood Deck SF 2930 non-null int64 68 Open Porch SF 2930 non-null int64 69 Enclosed Porch 2930 non-null int64 70 3Ssn Porch 2930 non-null int64 71 Screen Porch 2930 non-null int64 72 Pool Area 2930 non-null int64 73 Pool QC 13 non-null float64 74 Fence 572 non-null float64 75 Misc Feature 106 non-null object 76 Misc Val 2930 non-null int64 77 Mo Sold 2930 non-null int64 78 Yr Sold 2930 non-null int64 79 Sale Type 2930 non-null object 80 Sale Condition 2930 non-null object 81 SalePrice 2930 non-null int64 dtypes: float64(23), int64(37), object(22) memory usage: 1.8+ MB
Lets start by setting up a pipeline of functions that will let us quickly iterate on different models
Functions that transform features, select features and train-and-test features
We will build our functions and make the pipeline gradually and will improve in rounds. So Lets begin with the round 1, where we are just writing up minimal code. As mentioned, We will keep improving on them through out this project.
# For now, just return DataFrame
def transform_features(data):
'''
under developement
'''
df = data.copy()
return df
# For now, it just returns the DataFrame's two columns
def select_features(df):
'''
under developement
'''
return data[['Gr Liv Area', 'SalePrice']]
def train_and_test(data):
'''
under developement
'''
train = data.iloc[:1460]
test = data.iloc[1460:]
numeric_train = train.select_dtypes(include=['integer', 'float'])
numeric_test = test.select_dtypes(include=['integer', 'float'])
target = 'SalePrice'
features = numeric_train.columns.drop(target)
lr = LinearRegression()
lr.fit(train[features], train[target])
predictions = lr.predict(test[features])
mse = mean_squared_error(test[target], predictions)
rmse = np.sqrt(mse)
r2 = r2_score(test[target], predictions)
return rmse, r2
# Features Transformation
round1_transformed_data = transform_features(data)
# Features Selection
round1_selected_data = select_features(round1_transformed_data)
# Training and Testing
round1_rmse_data = train_and_test(round1_selected_data)
print('So in this first and lazy attempt, we have got\nRMSE of {}\nR2 of {}'.format(round1_rmse_data[0],
round1_rmse_data[1]))
So in this first and lazy attempt, we have got RMSE of 57088.25161263909 R2 of 0.5073530501149299
Let's now start removing features with missing values, diving deeper into potential categorical features, transforming text and numerical columns
Lets update the transform_features()
function to achieve following goals
remove the features we don't want to use in the model.
transform features into the proper format like
Create new features by combining existing features
data.select_dtypes(include=['integer', 'float']).info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2930 entries, 0 to 2929 Data columns (total 60 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Order 2930 non-null int64 1 PID 2930 non-null int64 2 MS SubClass 2930 non-null int64 3 Lot Frontage 2440 non-null float64 4 Lot Area 2930 non-null int64 5 Lot Shape 2914 non-null float64 6 Utilities 2930 non-null int64 7 Land Slope 2930 non-null int64 8 Overall Qual 2930 non-null int64 9 Overall Cond 2930 non-null int64 10 Year Built 2930 non-null int64 11 Year Remod/Add 2930 non-null int64 12 Mas Vnr Area 2907 non-null float64 13 Exter Qual 2930 non-null int64 14 Exter Cond 2930 non-null int64 15 Bsmt Qual 2850 non-null float64 16 Bsmt Cond 2850 non-null float64 17 Bsmt Exposure 2847 non-null float64 18 BsmtFin Type 1 2850 non-null float64 19 BsmtFin SF 1 2929 non-null float64 20 BsmtFin SF 2 2929 non-null float64 21 Bsmt Unf SF 2929 non-null float64 22 Total Bsmt SF 2929 non-null float64 23 Heating QC 2930 non-null int64 24 Central Air 2930 non-null int64 25 Electrical 2929 non-null float64 26 1st Flr SF 2930 non-null int64 27 2nd Flr SF 2930 non-null int64 28 Low Qual Fin SF 2930 non-null int64 29 Gr Liv Area 2930 non-null int64 30 Bsmt Full Bath 2928 non-null float64 31 Bsmt Half Bath 2928 non-null float64 32 Full Bath 2930 non-null int64 33 Half Bath 2930 non-null int64 34 Bedroom AbvGr 2930 non-null int64 35 Kitchen AbvGr 2930 non-null int64 36 Kitchen Qual 2930 non-null int64 37 TotRms AbvGrd 2930 non-null int64 38 Functional 2930 non-null int64 39 Fireplaces 2930 non-null int64 40 Fireplace Qu 1508 non-null float64 41 Garage Yr Blt 2771 non-null float64 42 Garage Finish 1540 non-null float64 43 Garage Cars 2929 non-null float64 44 Garage Area 2929 non-null float64 45 Garage Qual 2771 non-null float64 46 Garage Cond 2771 non-null float64 47 Paved Drive 2930 non-null int64 48 Wood Deck SF 2930 non-null int64 49 Open Porch SF 2930 non-null int64 50 Enclosed Porch 2930 non-null int64 51 3Ssn Porch 2930 non-null int64 52 Screen Porch 2930 non-null int64 53 Pool Area 2930 non-null int64 54 Pool QC 13 non-null float64 55 Fence 572 non-null float64 56 Misc Val 2930 non-null int64 57 Mo Sold 2930 non-null int64 58 Yr Sold 2930 non-null int64 59 SalePrice 2930 non-null int64 dtypes: float64(23), int64(37) memory usage: 1.3 MB
data.select_dtypes(include=['object']).info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2930 entries, 0 to 2929 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MS Zoning 2930 non-null object 1 Street 2930 non-null object 2 Alley 198 non-null object 3 Land Contour 2930 non-null object 4 Lot Config 2930 non-null object 5 Neighborhood 2930 non-null object 6 Condition 1 2930 non-null object 7 Condition 2 2930 non-null object 8 Bldg Type 2930 non-null object 9 House Style 2930 non-null object 10 Roof Style 2930 non-null object 11 Roof Matl 2930 non-null object 12 Exterior 1st 2930 non-null object 13 Exterior 2nd 2930 non-null object 14 Mas Vnr Type 2907 non-null object 15 Foundation 2930 non-null object 16 BsmtFin Type 2 2849 non-null object 17 Heating 2930 non-null object 18 Garage Type 2773 non-null object 19 Misc Feature 106 non-null object 20 Sale Type 2930 non-null object 21 Sale Condition 2930 non-null object dtypes: object(22) memory usage: 503.7+ KB
Lets use this approach for now. We can always change thresholds and approaches to improve our score.
And for the Cleaning part, we can do following ops:
SalePrice
Since we are working out to make a standard function, our working will involve a lot of experimentation. So its a better idea to retain the original dataset and do experimentation on a copy of it.
df = data.copy()
df.select_dtypes('object').head()
MS Zoning | Street | Alley | Land Contour | Lot Config | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Mas Vnr Type | Foundation | BsmtFin Type 2 | Heating | Garage Type | Misc Feature | Sale Type | Sale Condition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | RL | Pave | NaN | Lvl | Corner | NAmes | Norm | Norm | 1Fam | 1Story | Hip | CompShg | BrkFace | Plywood | Stone | CBlock | Unf | GasA | Attchd | NaN | WD | Normal |
1 | RH | Pave | NaN | Lvl | Inside | NAmes | Feedr | Norm | 1Fam | 1Story | Gable | CompShg | VinylSd | VinylSd | None | CBlock | LwQ | GasA | Attchd | NaN | WD | Normal |
2 | RL | Pave | NaN | Lvl | Corner | NAmes | Norm | Norm | 1Fam | 1Story | Hip | CompShg | Wd Sdng | Wd Sdng | BrkFace | CBlock | Unf | GasA | Attchd | Gar2 | WD | Normal |
3 | RL | Pave | NaN | Lvl | Corner | NAmes | Norm | Norm | 1Fam | 1Story | Hip | CompShg | BrkFace | BrkFace | None | CBlock | Unf | GasA | Attchd | NaN | WD | Normal |
4 | RL | Pave | NaN | Lvl | Inside | Gilbert | Norm | Norm | 1Fam | 2Story | Gable | CompShg | VinylSd | VinylSd | None | PConc | Unf | GasA | Attchd | NaN | WD | Normal |
Lets check columns of Integer and Float Dtypes.
df.select_dtypes(['integer', 'float']).head()
Order | PID | MS SubClass | Lot Frontage | Lot Area | Lot Shape | Utilities | Land Slope | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Mas Vnr Area | Exter Qual | Exter Cond | Bsmt Qual | Bsmt Cond | Bsmt Exposure | BsmtFin Type 1 | BsmtFin SF 1 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | Heating QC | Central Air | Electrical | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | Kitchen Qual | TotRms AbvGrd | Functional | Fireplaces | Fireplace Qu | Garage Yr Blt | Garage Finish | Garage Cars | Garage Area | Garage Qual | Garage Cond | Paved Drive | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Pool QC | Fence | Misc Val | Mo Sold | Yr Sold | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 526301100 | 20 | 141.0 | 31770 | 1.0 | 4 | 3 | 6 | 5 | 1960 | 1960 | 112.0 | 3 | 3 | 3.0 | 4.0 | 4.0 | 4.0 | 639.0 | 0.0 | 441.0 | 1080.0 | 2 | 1 | 5.0 | 1656 | 0 | 0 | 1656 | 1.0 | 0.0 | 1 | 0 | 3 | 1 | 3 | 7 | 8 | 2 | 4.0 | 1960.0 | 3.0 | 2.0 | 528.0 | 3.0 | 3.0 | 2 | 210 | 62 | 0 | 0 | 0 | 0 | NaN | NaN | 0 | 5 | 2010 | 215000 |
1 | 2 | 526350040 | 20 | 80.0 | 11622 | 4.0 | 4 | 3 | 5 | 6 | 1961 | 1961 | 0.0 | 3 | 3 | 3.0 | 3.0 | 1.0 | 3.0 | 468.0 | 144.0 | 270.0 | 882.0 | 3 | 1 | 5.0 | 896 | 0 | 0 | 896 | 0.0 | 0.0 | 1 | 0 | 2 | 1 | 3 | 5 | 8 | 0 | NaN | 1961.0 | NaN | 1.0 | 730.0 | 3.0 | 3.0 | 3 | 140 | 0 | 0 | 0 | 120 | 0 | NaN | 3.0 | 0 | 6 | 2010 | 105000 |
2 | 3 | 526351010 | 20 | 81.0 | 14267 | 1.0 | 4 | 3 | 6 | 6 | 1958 | 1958 | 108.0 | 3 | 3 | 3.0 | 3.0 | 1.0 | 5.0 | 923.0 | 0.0 | 406.0 | 1329.0 | 3 | 1 | 5.0 | 1329 | 0 | 0 | 1329 | 0.0 | 0.0 | 1 | 1 | 3 | 1 | 4 | 6 | 8 | 0 | NaN | 1958.0 | NaN | 1.0 | 312.0 | 3.0 | 3.0 | 3 | 393 | 36 | 0 | 0 | 0 | 0 | NaN | NaN | 12500 | 6 | 2010 | 172000 |
3 | 4 | 526353030 | 20 | 93.0 | 11160 | 4.0 | 4 | 3 | 7 | 5 | 1968 | 1968 | 0.0 | 4 | 3 | 3.0 | 3.0 | 1.0 | 5.0 | 1065.0 | 0.0 | 1045.0 | 2110.0 | 5 | 1 | 5.0 | 2110 | 0 | 0 | 2110 | 1.0 | 0.0 | 2 | 1 | 3 | 1 | 5 | 8 | 8 | 2 | 3.0 | 1968.0 | 3.0 | 2.0 | 522.0 | 3.0 | 3.0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | 0 | 4 | 2010 | 244000 |
4 | 5 | 527105010 | 60 | 74.0 | 13830 | 1.0 | 4 | 3 | 5 | 5 | 1997 | 1998 | 0.0 | 3 | 3 | 4.0 | 3.0 | 1.0 | 6.0 | 791.0 | 0.0 | 137.0 | 928.0 | 4 | 1 | 5.0 | 928 | 701 | 0 | 1629 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 3 | 6 | 8 | 1 | 3.0 | 1997.0 | 3.0 | 2.0 | 482.0 | 3.0 | 3.0 | 3 | 212 | 34 | 0 | 0 | 0 | 0 | NaN | 3.0 | 0 | 3 | 2010 | 189900 |
By looking closely at numerical features, three of them should catch our attention. The features are Year Built
, Year Remod/Add
and Yr Sold
.
Year Built
is about the year property was built. Year Remod/Add
tells us about last modification if there is any. Whereas Yr Sold
tells us about the year of sale of the property.
There features might become far more impactful if we combine them to calculate "Age of house", and "Years since alteration"
Lets do it for now by creating a new features
age
that will be formed by subtracting Year Built
from Yr Sold
andyears_remod
that will be formed by subtracting Year Remod/Add
from Yr Sold
df['age'] = df['Yr Sold'] - df['Year Built']
df['years_remod'] = df['Yr Sold'] - df['Year Remod/Add']
We rarely get to have a perfect data! On running value_counts() on both of our new columns, we found out some samples with negative values.
# capture these rows
del_us = df[(df['age'] < 0) | (df['years_remod'] < 0)]
del_us
Order | PID | MS SubClass | MS Zoning | Lot Frontage | Lot Area | Street | Alley | Lot Shape | Land Contour | Utilities | Lot Config | Land Slope | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Mas Vnr Type | Mas Vnr Area | Exter Qual | Exter Cond | Foundation | Bsmt Qual | Bsmt Cond | Bsmt Exposure | BsmtFin Type 1 | BsmtFin SF 1 | BsmtFin Type 2 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | Heating | Heating QC | Central Air | Electrical | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | Kitchen Qual | TotRms AbvGrd | Functional | Fireplaces | Fireplace Qu | Garage Type | Garage Yr Blt | Garage Finish | Garage Cars | Garage Area | Garage Qual | Garage Cond | Paved Drive | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Pool QC | Fence | Misc Feature | Misc Val | Mo Sold | Yr Sold | Sale Type | Sale Condition | SalePrice | age | years_remod | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1702 | 1703 | 528120010 | 60 | RL | 134.0 | 16659 | Pave | NaN | 1.0 | Lvl | 4 | Corner | 3 | NridgHt | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2007 | 2008 | Gable | CompShg | VinylSd | VinylSd | None | 0.0 | 4 | 3 | PConc | 4.0 | 3.0 | 1.0 | 1.0 | 0.0 | Unf | 0.0 | 1582.0 | 1582.0 | GasA | 5 | 1 | 5.0 | 1582 | 570 | 0 | 2152 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 4 | 7 | 8 | 1 | 4.0 | Detchd | 2007.0 | NaN | 2.0 | 728.0 | 3.0 | 3.0 | 3 | 0 | 368 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 6 | 2007 | New | Partial | 260116 | 0 | -1 |
2180 | 2181 | 908154195 | 20 | RL | 128.0 | 39290 | Pave | NaN | 1.0 | Bnk | 4 | Inside | 3 | Edwards | Norm | Norm | 1Fam | 1Story | 10 | 5 | 2008 | 2009 | Hip | CompShg | CemntBd | CmentBd | Stone | 1224.0 | 5 | 3 | PConc | 5.0 | 3.0 | 4.0 | 6.0 | 4010.0 | Unf | 0.0 | 1085.0 | 5095.0 | GasA | 5 | 1 | 5.0 | 5095 | 0 | 0 | 5095 | 1.0 | 1.0 | 2 | 1 | 2 | 1 | 5 | 15 | 8 | 2 | 4.0 | Attchd | 2008.0 | 3.0 | 3.0 | 1154.0 | 3.0 | 3.0 | 3 | 546 | 484 | 0 | 0 | 0 | 0 | NaN | NaN | Elev | 17000 | 10 | 2007 | New | Partial | 183850 | -1 | -2 |
2181 | 2182 | 908154205 | 60 | RL | 130.0 | 40094 | Pave | NaN | 1.0 | Bnk | 4 | Inside | 3 | Edwards | PosN | PosN | 1Fam | 2Story | 10 | 5 | 2007 | 2008 | Hip | CompShg | CemntBd | CmentBd | Stone | 762.0 | 5 | 3 | PConc | 5.0 | 3.0 | 4.0 | 6.0 | 2260.0 | Unf | 0.0 | 878.0 | 3138.0 | GasA | 5 | 1 | 5.0 | 3138 | 1538 | 0 | 4676 | 1.0 | 0.0 | 3 | 1 | 3 | 1 | 5 | 11 | 8 | 1 | 4.0 | BuiltIn | 2007.0 | 3.0 | 3.0 | 884.0 | 3.0 | 3.0 | 3 | 208 | 406 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 10 | 2007 | New | Partial | 184750 | 0 | -1 |
# Lets remove these rows.
df.drop(del_us.index, axis=0, inplace=True)
Lets check the columns with Null Values less than 5% and delete them after consideration
# Series of Columns with percentages of Null values in them
null_series = pd.Series(df.isnull().sum()/len(df) * 100)
# Filtering out the columns with more than 5% Null values
null_series[null_series > 5.0].sort_values()
Garage Type 5.363854 Garage Yr Blt 5.432183 Garage Qual 5.432183 Garage Cond 5.432183 Lot Frontage 16.740690 Garage Finish 47.454732 Fireplace Qu 48.582166 Fence 80.457807 Alley 93.235395 Misc Feature 96.412709 Pool QC 99.555859 dtype: float64
len(df.columns)
84
We have an interesting situation here. Columns like Pool QC
, Misc Feature
, Alley
, Fence
, Fireplace QU
, Garage Finish
will be deleted straight away because they contain a lot of Null values.
However, the columns that contain Null Values just above 5% are all related to Garage. By reading Data Documentation, we can see that most of them are text columns of ordinal and nominal variables. And those columns already have NA option provided. So we can not take Null Value as NA either. Also, during the mapping we performed, NA were encoded as Zeros 0 Lets Delete all for them for now
# Cutoff at 5%. (1-0.05 = 0.95)
df = df.dropna(thresh=df.shape[0]*0.95, axis=1)
df.isnull().sum()
Order 0 PID 0 MS SubClass 0 MS Zoning 0 Lot Area 0 Street 0 Lot Shape 16 Land Contour 0 Utilities 0 Lot Config 0 Land Slope 0 Neighborhood 0 Condition 1 0 Condition 2 0 Bldg Type 0 House Style 0 Overall Qual 0 Overall Cond 0 Year Built 0 Year Remod/Add 0 Roof Style 0 Roof Matl 0 Exterior 1st 0 Exterior 2nd 0 Mas Vnr Type 23 Mas Vnr Area 23 Exter Qual 0 Exter Cond 0 Foundation 0 Bsmt Qual 80 Bsmt Cond 80 Bsmt Exposure 83 BsmtFin Type 1 80 BsmtFin SF 1 1 BsmtFin Type 2 81 BsmtFin SF 2 1 Bsmt Unf SF 1 Total Bsmt SF 1 Heating 0 Heating QC 0 Central Air 0 Electrical 1 1st Flr SF 0 2nd Flr SF 0 Low Qual Fin SF 0 Gr Liv Area 0 Bsmt Full Bath 2 Bsmt Half Bath 2 Full Bath 0 Half Bath 0 Bedroom AbvGr 0 Kitchen AbvGr 0 Kitchen Qual 0 TotRms AbvGrd 0 Functional 0 Fireplaces 0 Garage Cars 1 Garage Area 1 Paved Drive 0 Wood Deck SF 0 Open Porch SF 0 Enclosed Porch 0 3Ssn Porch 0 Screen Porch 0 Pool Area 0 Misc Val 0 Mo Sold 0 Yr Sold 0 Sale Type 0 Sale Condition 0 SalePrice 0 age 0 years_remod 0 dtype: int64
Lets segregate our DataFrame into Object and Numeric Dtypes
text_df = df.select_dtypes(include='object')
text_df.head(3)
MS Zoning | Street | Land Contour | Lot Config | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Mas Vnr Type | Foundation | BsmtFin Type 2 | Heating | Sale Type | Sale Condition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | RL | Pave | Lvl | Corner | NAmes | Norm | Norm | 1Fam | 1Story | Hip | CompShg | BrkFace | Plywood | Stone | CBlock | Unf | GasA | WD | Normal |
1 | RH | Pave | Lvl | Inside | NAmes | Feedr | Norm | 1Fam | 1Story | Gable | CompShg | VinylSd | VinylSd | None | CBlock | LwQ | GasA | WD | Normal |
2 | RL | Pave | Lvl | Corner | NAmes | Norm | Norm | 1Fam | 1Story | Hip | CompShg | Wd Sdng | Wd Sdng | BrkFace | CBlock | Unf | GasA | WD | Normal |
numeric_df = df.select_dtypes(include=['integer','float'])
numeric_df.head(3)
Order | PID | MS SubClass | Lot Area | Lot Shape | Utilities | Land Slope | Overall Qual | Overall Cond | Year Built | Year Remod/Add | Mas Vnr Area | Exter Qual | Exter Cond | Bsmt Qual | Bsmt Cond | Bsmt Exposure | BsmtFin Type 1 | BsmtFin SF 1 | BsmtFin SF 2 | Bsmt Unf SF | Total Bsmt SF | Heating QC | Central Air | Electrical | 1st Flr SF | 2nd Flr SF | Low Qual Fin SF | Gr Liv Area | Bsmt Full Bath | Bsmt Half Bath | Full Bath | Half Bath | Bedroom AbvGr | Kitchen AbvGr | Kitchen Qual | TotRms AbvGrd | Functional | Fireplaces | Garage Cars | Garage Area | Paved Drive | Wood Deck SF | Open Porch SF | Enclosed Porch | 3Ssn Porch | Screen Porch | Pool Area | Misc Val | Mo Sold | Yr Sold | SalePrice | age | years_remod | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 526301100 | 20 | 31770 | 1.0 | 4 | 3 | 6 | 5 | 1960 | 1960 | 112.0 | 3 | 3 | 3.0 | 4.0 | 4.0 | 4.0 | 639.0 | 0.0 | 441.0 | 1080.0 | 2 | 1 | 5.0 | 1656 | 0 | 0 | 1656 | 1.0 | 0.0 | 1 | 0 | 3 | 1 | 3 | 7 | 8 | 2 | 2.0 | 528.0 | 2 | 210 | 62 | 0 | 0 | 0 | 0 | 0 | 5 | 2010 | 215000 | 50 | 50 |
1 | 2 | 526350040 | 20 | 11622 | 4.0 | 4 | 3 | 5 | 6 | 1961 | 1961 | 0.0 | 3 | 3 | 3.0 | 3.0 | 1.0 | 3.0 | 468.0 | 144.0 | 270.0 | 882.0 | 3 | 1 | 5.0 | 896 | 0 | 0 | 896 | 0.0 | 0.0 | 1 | 0 | 2 | 1 | 3 | 5 | 8 | 0 | 1.0 | 730.0 | 3 | 140 | 0 | 0 | 0 | 120 | 0 | 0 | 6 | 2010 | 105000 | 49 | 49 |
2 | 3 | 526351010 | 20 | 14267 | 1.0 | 4 | 3 | 6 | 6 | 1958 | 1958 | 108.0 | 3 | 3 | 3.0 | 3.0 | 1.0 | 5.0 | 923.0 | 0.0 | 406.0 | 1329.0 | 3 | 1 | 5.0 | 1329 | 0 | 0 | 1329 | 0.0 | 0.0 | 1 | 1 | 3 | 1 | 4 | 6 | 8 | 0 | 1.0 | 312.0 | 3 | 393 | 36 | 0 | 0 | 0 | 0 | 12500 | 6 | 2010 | 172000 | 52 | 52 |
For text columns, lets remove every column that contains any Null value for now.
# Select Object dtype, and drop column (axis=1) with any missing values values (how='all')
text_df = text_df.dropna(how='any', axis=1)
text_df.shape
(2927, 17)
For the remaining numerical columns, Lets use
# Take out the columns using isna() and any() and Segregate their names in a list.
impute_these = numeric_df.columns[numeric_df.isna().any()].tolist()
impute_these
['Lot Shape', 'Mas Vnr Area', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', 'Electrical', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Garage Cars', 'Garage Area']
# We will use median to replace the missing missing values of following columns. They were mapped to numerical values
cols_median = ['Lot Shape', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'Electrical', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Garage Cars']
# Use Mean for imputations
cols_mean = []
for col in impute_these:
if col not in cols_median:
cols_mean.append(col)
cols_mean
['Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', 'Garage Area']
for col in cols_median:
numeric_df[col].fillna(numeric_df[col].median(), inplace=True)
for col in cols_mean:
numeric_df[col].fillna(numeric_df[col].mean(), inplace=True)
# Will resolve the SettingWithCopyWarning issue while building function
C:\Users\Dell\anaconda3\lib\site-packages\pandas\core\generic.py:6245: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._update_inplace(new_data)
Lets delete Order
and PID
columns
numeric_df = numeric_df.drop(['Order', 'PID'], axis=1)
Following Columns can potentially Leak Data to our target variable SalePrice
:
Mo Sold
, Sale Condition
, Sale Type
, Yr Sold
. For more information, check out Data Documentation
Lets remove them as well. We will remove them from our Sub-DataFrames. We will concatenate out Sub-DataFrames later
numeric_df = numeric_df.drop(['Mo Sold', 'Yr Sold'], axis=1)
text_df = text_df.drop(['Sale Type', 'Sale Condition'], axis=1)
???Lets keep the deletion of rows with extremities after concatenating both DataFrames, that we will do in the function.
data = pd.read_table('AmesHousing.tsv', delimiter="\t")
def transform_features(data):
'''
Transforms the DataFrame. It replicates
the work done in preceding code.
Args:
data: The Dataframe
Returns:
transformed DataFrame
'''
# Create a Copy
df = data.copy()
# Provide Substitute numbers for Ordinal Values
LotShape_mapping = {'Reg':4, 'IR1':3, 'IR2':2, 'IR1':1}
df['Lot Shape'] = df['Lot Shape'].map(LotShape_mapping, na_action='ignore')
Utilities_mapping = {'AllPub':4, 'NoSewr':3, 'NoSeWa':2, 'ELO':1}
df['Utilities'] = df['Utilities'].map(Utilities_mapping, na_action='ignore')
LandSlope_mapping = {'Gtl':3, 'Mod':2, 'Sev':1}
df['Land Slope'] = df['Land Slope'].map(LandSlope_mapping, na_action='ignore')
ExterQual_mapping = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1}
df['Exter Qual'] = df['Exter Qual'].map(ExterQual_mapping, na_action='ignore')
ExterCond_mapping = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1}
df['Exter Cond'] = df['Exter Cond'].map(ExterCond_mapping, na_action='ignore')
BsmtQual_mapping = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'NA':0}
df['Bsmt Qual'] = df['Bsmt Qual'].map(BsmtQual_mapping, na_action='ignore')
BsmtCond_mapping = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'NA':0}
df['Bsmt Cond'] = df['Bsmt Cond'].map(BsmtCond_mapping, na_action='ignore')
BsmtExposure_mapping = {'Gd':4, 'Av':3, 'Mn':2, 'No':1, 'NA':0}
df['Bsmt Exposure'] = df['Bsmt Exposure'].map(BsmtExposure_mapping, na_action='ignore')
BsmtFin_mapping = {'GLQ':6, 'ALQ':5, 'BLQ':4, 'Rec':3, 'LwQ':2, 'Unf':1, 'NA':0}
df['BsmtFin Type 1'] = df['BsmtFin Type 1'].map(BsmtFin_mapping, na_action='ignore')
Functional_mapping = {'Typ':8, 'Min1':7, 'Min2':6, 'Mod':5, 'Maj1':4, 'Maj2':3, 'Sev':2, 'Sal':1 }
df['Functional'] = df['Functional'].map(Functional_mapping, na_action='ignore')
HeatingQC_mapping = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1}
df['Heating QC'] = df['Heating QC'].map(HeatingQC_mapping, na_action='ignore')
CentralAir_mapping = {'Y': 1, 'N':0}
df['Central Air'] = df['Central Air'].map(CentralAir_mapping, na_action='ignore')
Electrical_mapping = {'SBrkr':5, 'FuseA':4, 'Mix':3, 'FuseF':2, 'FuseP':1 } # Assumed category 'Mix' to be at medium
df['Electrical'] = df['Electrical'].map(Electrical_mapping, na_action='ignore')
KitchenQual_mapping = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1}
df['Kitchen Qual'] = df['Kitchen Qual'].map(KitchenQual_mapping, na_action='ignore')
FireplaceQu_mapping = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'NA':0}
df['Fireplace Qu'] = df['Fireplace Qu'].map(FireplaceQu_mapping, na_action='ignore')
GarageFinish_mapping = {'Fin':3, 'RFn':2, 'UnF':1, 'NA':0}
df['Garage Finish'] = df['Garage Finish'].map(GarageFinish_mapping, na_action='ignore')
GarageQual_mapping = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'NA':0}
df['Garage Qual'] = df['Garage Qual'].map(GarageQual_mapping, na_action='ignore')
GarageCond_mapping = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'NA':0}
df['Garage Cond'] = df['Garage Cond'].map(GarageCond_mapping, na_action='ignore')
PavedDrive_mapping = {'Y':3, 'P':2, 'N':1}
df['Paved Drive'] = df['Paved Drive'].map(PavedDrive_mapping, na_action='ignore')
PoolQC_mapping = {'Ex':4, 'Gd':3, 'TA':2, 'Fa':1, 'NA':0}
df['Pool QC'] = df['Pool QC'].map(PoolQC_mapping, na_action='ignore')
Fence_mapping = {'GdPrv': 4, 'MnPrv':3, 'GdWo':2, 'MnWw':1, 'NA':0}
df['Fence'] = df['Fence'].map(Fence_mapping, na_action='ignore')
# Create new features and Clean
df['age'] = df['Yr Sold'] - df['Year Built']
df['years_remod'] = df['Yr Sold'] - df['Year Remod/Add']
del_us = df[(df['age'] < 0) | (df['years_remod'] < 0)]
df.drop(del_us.index, axis=0, inplace=True)
df = df.dropna(thresh=df.shape[0]*0.95, axis=1)
# Break the DataFrame WRT Dtype and Clean parts
text_df = df.select_dtypes(include='object')
text_df = text_df.dropna(how='any', axis=1)
numeric_df = df.select_dtypes(include=['integer','float'])
numeric_df = numeric_df.drop(['Order', 'PID', 'Mo Sold', 'Yr Sold'], axis=1)
cols_median = ['Lot Shape', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1',
'Electrical', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Garage Cars']
cols_mean = []
for col in impute_these:
if col not in cols_median:
cols_mean.append(col)
for col in cols_median:
numeric_df[col].fillna(numeric_df[col].median(), inplace=True)
for col in cols_mean:
numeric_df[col].fillna(numeric_df[col].mean(), inplace=True)
text_df = text_df.drop(['Sale Type', 'Sale Condition'], axis=1)
# Concatenate the DataFrame pieces
df = pd.concat([numeric_df, text_df], axis=1)
# Do more Cleaning
extreme_rows = df[df['Gr Liv Area'] > 4000]
df = df.drop(extreme_rows.index, axis = 0)
# Return
return df
The functions select_features()
and train_test()
are still as they are
# We updated transform_features()
round2_transformed_data = transform_features(data)
# We have NOT updated select_features(), So it is still taking only 2 columns
round2_selected_data = select_features(round2_transformed_data)
# We have NOT updated test_and_train() either!
round2_rmse_data = train_and_test(round2_selected_data)
print(round2_rmse_data)
(57088.25161263909, 0.5073530501149299)
print('Our RMSE for this attempt is around {:.2f} and\nWith an impovement in RMSE of around USD {:.2f}\nThats almost {:.1f}% improvement in the RMSE results'.
format(round2_rmse_data[0], round1_rmse_data[0]-round2_rmse_data[0], (round1_rmse_data[0]-round2_rmse_data[0])/round1_rmse_data[0]*100))
print('\nWhereas our R2 score in the previous attempt was\n{:.2f} and now, it has become {:.2f}'.format(round1_rmse_data[1], round2_rmse_data[1]))
Our RMSE for this attempt is around 57088.25 and With an impovement in RMSE of around USD 0.00 Thats almost 0.0% improvement in the RMSE results Whereas our R2 score in the previous attempt was 0.51 and now, it has become 0.51
Its time to move on to Feature Selection for numerical columns
SalePrice
?In the end, we will update the logic to our functions
SalePrice
, we will abandon one of the column that share > 0.7 Correlation to eachother.# Lets Plot a Correlation Matrix
numeric_df = round2_transformed_data.select_dtypes(['integer', 'float'])
plt.figure(figsize=(24,18))
sns.heatmap(numeric_df.corr(), annot=True, fmt='.1g', vmin=-1, vmax=1,
center= 0, cmap= 'coolwarm',cbar_kws= {'orientation': 'horizontal'})
<matplotlib.axes._subplots.AxesSubplot at 0x1e765734df0>
df_corr = numeric_df.corr().abs()
check_c = df_corr[(df_corr >= 0.7) & (df_corr < 1.00)] # Adding condition df_corr < 1.00 to avoid same columns overlapping
cols_c = check_c.loc[check_c[col] != np.nan].any(axis=0)
print('Following columns are collinear with other columns\n{}.\nTo find pairs, Kindly consult the heatmap'.format(cols_c[cols_c==True].index))
Following columns are collinear with other columns Index(['Overall Qual', 'Year Built', 'Year Remod/Add', 'Exter Qual', 'Bsmt Qual', 'Total Bsmt SF', '1st Flr SF', 'Gr Liv Area', 'Kitchen Qual', 'TotRms AbvGrd', 'Garage Cars', 'Garage Area', 'SalePrice', 'age', 'years_remod'], dtype='object'). To find pairs, Kindly consult the heatmap
# check_c shows only the ABSOLUTE Correlations greater than 0.7 and less than 1.00
plt.figure(figsize=(20,8))
sns.heatmap(check_c.isnull(), cbar=False, cmap='crest')
sns.despine()
plt.title('COLLINEARITY CHECK!\nCorrelations that are greater either greater than 0.7\nOr Less than -0.7', fontsize=20)
Text(0.5, 1.0, 'COLLINEARITY CHECK!\nCorrelations that are greater either greater than 0.7\nOr Less than -0.7')
Lets also see how do these columns correlate with the target, SalePrice
numeric_df[cols_c[cols_c == True].index].corr()['SalePrice']
Overall Qual 0.805160 Year Built 0.564935 Year Remod/Add 0.540110 Exter Qual 0.704824 Bsmt Qual 0.649699 Total Bsmt SF 0.658316 1st Flr SF 0.642763 Gr Liv Area 0.719345 Kitchen Qual 0.676754 TotRms AbvGrd 0.498450 Garage Cars 0.652498 Garage Area 0.648163 SalePrice 1.000000 age -0.565314 years_remod -0.541941 Name: SalePrice, dtype: float64
Following are the most Correlated Features with SalePrice
Overall quality
Gr Liv Area
Exter Qual
Kitchen Qual
Total Bsmt SF
Garage Cars
Garage Area
1st Flr SF
Years Before Sale
Years Since Remod
Within these features, there is a high correlation present between them that could be collinearity as well
Total Bsmt SF
and 1st Flr SF
BsmtFin Type 1
and 1st Flr SF
Total Bsmt SF
and drop 1st Flr SF
, and BsmtFin Type 1
Overall Qual
, Exter Qual
, Bsmt Qual
and Kitchen Qual
Overall Qual
and drop Exter Qual
, Bsmt Qual
and Kitchen Qual
Garage Cars
and Garage Area
Garage Area
and drop Garage Cars
BsmtFin Type 1
, Bsmt Full Bath
and BsmtFin SF 1
Bmst Full Bath
and drop BsmtFin Type 1
and BsmtFin SF 1
Gr Liv Area
, TotRmsAbvGrd
and 2nd Flr SF
Gr Liv Area
and drop TotRmsAbvGrd
and 2nd Flr SF
Since the columns age
and years_remod
were engineered from columns Year Built
and Year Remod/Add
, a very high correlation. though negative, can be seen among them.
age
and years_remod
and drop Year Built
and Year Remod/Add
drop_c = ['1st Flr SF', 'BsmtFin Type 1', 'Exter Qual', 'Bsmt Qual', 'Kitchen Qual',
'Garage Cars', 'BsmtFin SF 1', 'TotRms AbvGrd', '2nd Flr SF', 'Year Built',
'Year Remod/Add']
numeric_df = numeric_df.drop(drop_c, axis=1)
# Taking Correlation Coeffients
abs_corr_coeffs = numeric_df.corr()['SalePrice'].abs().sort_values(ascending=False)
# Displaying head only
abs_corr_coeffs.head(10)
SalePrice 1.000000 Overall Qual 0.805160 Gr Liv Area 0.719345 Total Bsmt SF 0.658316 Garage Area 0.648163 age 0.565314 Full Bath 0.544366 years_remod 0.541941 Mas Vnr Area 0.510928 Fireplaces 0.474777 Name: SalePrice, dtype: float64
Lets keep the columns with correlation coeffients above 0.4
We will change it to get better results later.
abs_corr_coeffs[abs_corr_coeffs > 0.4]
SalePrice 1.000000 Overall Qual 0.805160 Gr Liv Area 0.719345 Total Bsmt SF 0.658316 Garage Area 0.648163 age 0.565314 Full Bath 0.544366 years_remod 0.541941 Mas Vnr Area 0.510928 Fireplaces 0.474777 Heating QC 0.454140 Bsmt Exposure 0.400067 Name: SalePrice, dtype: float64
# Drop any columns that are less correlant that 0.4.
round2_transformed_data = round2_transformed_data.drop(abs_corr_coeffs[abs_corr_coeffs < 0.4].index, axis=1)
All nominal variables are the candidates of being converted to Categorical Dtype. Let's check the nominal variables for
We also have to look into columns that are currently in numerical encoding but could do much better with Categorical encoding
# according to the Data Documentary, following are the nominal variables
nominal_features_all = [ 'PID', 'MS SubClass', 'MS Zoning', 'Street', 'Alley', 'Land Contour', 'Lot Config', 'Neighborhood',
'Condition 1', 'Condition 2', 'Bldg Type', 'House Style', 'Roof Style', 'Roof Matl', 'Exterior 1st',
'Exterior 2nd', 'Mas Vnr Type', 'Foundation', 'Heating', 'Central Air', 'Garage Type',
'Misc Feature', 'Sale Type', 'Sale Condition']
# We need to retrieve the columns that exist in our functioning DataFrame, i-e, round2_transformed_data
nominal_features = []
for name in nominal_features_all:
if name in round2_transformed_data.columns:
nominal_features.append(name)
nominal_features
['MS Zoning', 'Street', 'Land Contour', 'Lot Config', 'Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type', 'House Style', 'Roof Style', 'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Foundation', 'Heating']
text_df[nominal_features].head()
MS Zoning | Street | Land Contour | Lot Config | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Foundation | Heating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | RL | Pave | Lvl | Corner | NAmes | Norm | Norm | 1Fam | 1Story | Hip | CompShg | BrkFace | Plywood | CBlock | GasA |
1 | RH | Pave | Lvl | Inside | NAmes | Feedr | Norm | 1Fam | 1Story | Gable | CompShg | VinylSd | VinylSd | CBlock | GasA |
2 | RL | Pave | Lvl | Corner | NAmes | Norm | Norm | 1Fam | 1Story | Hip | CompShg | Wd Sdng | Wd Sdng | CBlock | GasA |
3 | RL | Pave | Lvl | Corner | NAmes | Norm | Norm | 1Fam | 1Story | Hip | CompShg | BrkFace | BrkFace | CBlock | GasA |
4 | RL | Pave | Lvl | Inside | Gilbert | Norm | Norm | 1Fam | 2Story | Gable | CompShg | VinylSd | VinylSd | PConc | GasA |
# Lets check the frequency tables with value_counts
for col in nominal_features:
print(col,'\n',round2_transformed_data[col].value_counts(), '\n', 'NUNIQUE:',
round2_transformed_data[col].nunique(), '\n', '-'*20)
MS Zoning RL 2267 RM 462 FV 139 RH 27 C (all) 25 A (agr) 2 I (all) 2 Name: MS Zoning, dtype: int64 NUNIQUE: 7 -------------------- Street Pave 2912 Grvl 12 Name: Street, dtype: int64 NUNIQUE: 2 -------------------- Land Contour Lvl 2630 HLS 120 Bnk 114 Low 60 Name: Land Contour, dtype: int64 NUNIQUE: 4 -------------------- Lot Config Inside 2138 Corner 507 CulDSac 180 FR2 85 FR3 14 Name: Lot Config, dtype: int64 NUNIQUE: 5 -------------------- Neighborhood NAmes 443 CollgCr 267 OldTown 239 Edwards 191 Somerst 182 NridgHt 165 Gilbert 165 Sawyer 151 NWAmes 131 SawyerW 125 Mitchel 114 BrkSide 108 Crawfor 103 IDOTRR 93 Timber 72 NoRidge 69 StoneBr 51 SWISU 48 ClearCr 44 MeadowV 37 BrDale 30 Blmngtn 28 Veenker 24 NPkVill 23 Blueste 10 Greens 8 GrnHill 2 Landmrk 1 Name: Neighborhood, dtype: int64 NUNIQUE: 28 -------------------- Condition 1 Norm 2518 Feedr 163 Artery 92 RRAn 50 PosN 38 RRAe 28 PosA 20 RRNn 9 RRNe 6 Name: Condition 1, dtype: int64 NUNIQUE: 9 -------------------- Condition 2 Norm 2895 Feedr 13 Artery 5 PosA 4 PosN 3 RRNn 2 RRAe 1 RRAn 1 Name: Condition 2, dtype: int64 NUNIQUE: 8 -------------------- Bldg Type 1Fam 2419 TwnhsE 233 Duplex 109 Twnhs 101 2fmCon 62 Name: Bldg Type, dtype: int64 NUNIQUE: 5 -------------------- House Style 1Story 1480 2Story 868 1.5Fin 314 SLvl 128 SFoyer 83 2.5Unf 24 1.5Unf 19 2.5Fin 8 Name: House Style, dtype: int64 NUNIQUE: 8 -------------------- Roof Style Gable 2319 Hip 547 Gambrel 22 Flat 20 Mansard 11 Shed 5 Name: Roof Style, dtype: int64 NUNIQUE: 6 -------------------- Roof Matl CompShg 2883 Tar&Grv 23 WdShake 9 WdShngl 6 Membran 1 Metal 1 Roll 1 Name: Roof Matl, dtype: int64 NUNIQUE: 7 -------------------- Exterior 1st VinylSd 1025 MetalSd 450 HdBoard 441 Wd Sdng 419 Plywood 221 CemntBd 124 BrkFace 88 WdShing 56 AsbShng 44 Stucco 42 BrkComm 6 Stone 2 CBlock 2 AsphShn 2 PreCast 1 ImStucc 1 Name: Exterior 1st, dtype: int64 NUNIQUE: 16 -------------------- Exterior 2nd VinylSd 1014 MetalSd 447 HdBoard 405 Wd Sdng 397 Plywood 274 CmentBd 124 Wd Shng 81 BrkFace 47 Stucco 46 AsbShng 38 Brk Cmn 22 ImStucc 14 Stone 6 AsphShn 4 CBlock 3 PreCast 1 Other 1 Name: Exterior 2nd, dtype: int64 NUNIQUE: 17 -------------------- Foundation PConc 1304 CBlock 1244 BrkTil 311 Slab 49 Stone 11 Wood 5 Name: Foundation, dtype: int64 NUNIQUE: 6 -------------------- Heating GasA 2879 GasW 27 Grav 9 Wall 6 OthW 2 Floor 1 Name: Heating, dtype: int64 NUNIQUE: 6 --------------------
text_df.head()
MS Zoning | Street | Land Contour | Lot Config | Neighborhood | Condition 1 | Condition 2 | Bldg Type | House Style | Roof Style | Roof Matl | Exterior 1st | Exterior 2nd | Foundation | Heating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | RL | Pave | Lvl | Corner | NAmes | Norm | Norm | 1Fam | 1Story | Hip | CompShg | BrkFace | Plywood | CBlock | GasA |
1 | RH | Pave | Lvl | Inside | NAmes | Feedr | Norm | 1Fam | 1Story | Gable | CompShg | VinylSd | VinylSd | CBlock | GasA |
2 | RL | Pave | Lvl | Corner | NAmes | Norm | Norm | 1Fam | 1Story | Hip | CompShg | Wd Sdng | Wd Sdng | CBlock | GasA |
3 | RL | Pave | Lvl | Corner | NAmes | Norm | Norm | 1Fam | 1Story | Hip | CompShg | BrkFace | BrkFace | CBlock | GasA |
4 | RL | Pave | Lvl | Inside | Gilbert | Norm | Norm | 1Fam | 2Story | Gable | CompShg | VinylSd | VinylSd | PConc | GasA |
text_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2927 entries, 0 to 2929 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MS Zoning 2927 non-null object 1 Street 2927 non-null object 2 Land Contour 2927 non-null object 3 Lot Config 2927 non-null object 4 Neighborhood 2927 non-null object 5 Condition 1 2927 non-null object 6 Condition 2 2927 non-null object 7 Bldg Type 2927 non-null object 8 House Style 2927 non-null object 9 Roof Style 2927 non-null object 10 Roof Matl 2927 non-null object 11 Exterior 1st 2927 non-null object 12 Exterior 2nd 2927 non-null object 13 Foundation 2927 non-null object 14 Heating 2927 non-null object dtypes: object(15) memory usage: 365.9+ KB
Heating
, Roof Matl
, Condition 2
, Land Contour
, Street
, Central Air
have one value each that has an overwhelming presence.Exterior 1st
, Exterior 2nd
and Neighborhood
contain relatively larger number of unique values.round2_transformed_data.shape
(2924, 38)
# This is the frequency table showing how much the MOST FREQUENT value is occupying of the total column.
top_cat_freq = {}
for col in nominal_features:
top_cat_freq[col] = round2_transformed_data[col].value_counts(normalize=True, ascending=False)[0]
#print(col, round2_transformed_data[col].value_counts(normalize = True, ascending = False)[0])
top_cat_freq = pd.Series(top_cat_freq)
top_cat_freq.sort_values(ascending=False)
Street 0.995896 Condition 2 0.990082 Roof Matl 0.985978 Heating 0.984610 Land Contour 0.899453 Condition 1 0.861149 Bldg Type 0.827291 Roof Style 0.793092 MS Zoning 0.775308 Lot Config 0.731190 House Style 0.506156 Foundation 0.445964 Exterior 1st 0.350547 Exterior 2nd 0.346785 Neighborhood 0.151505 dtype: float64
# Adjust an arbitrary cutoff at 0.85
myraid_cutoff = 0.85
# Because of some unusual errors, I introduced the dictionary rather than pulling out the columns directly
myraid_dict = {}
for col in nominal_features:
myraid = round2_transformed_data[col].value_counts(normalize = True, ascending = False)[0]
myraid_dict[col] = myraid
for key, val in myraid_dict.items():
if val > myraid_cutoff:
nominal_features.remove(key)
round2_transformed_data.drop(key, axis=1, inplace=True)
round2_transformed_data.shape
(2924, 32)
# Adjust an arbitrary cutoff at 10, to avoid too many Categorical Columns
nunique_thresh = 10
nunique_dict = {}
for col in nominal_features:
nunique_dict[col] = round2_transformed_data[col].nunique()
for key, val in nunique_dict.items():
if val > nunique_thresh:
round2_transformed_data.drop(key, axis=1, inplace=True)
nominal_features.remove(key)
print(nunique_dict)
{'MS Zoning': 7, 'Lot Config': 5, 'Neighborhood': 28, 'Bldg Type': 5, 'House Style': 8, 'Roof Style': 6, 'Exterior 1st': 16, 'Exterior 2nd': 17, 'Foundation': 6}
print(round2_transformed_data.shape)
(2924, 29)
Lets look into the numerical variables that can possibly do better being categorical
for col in round2_transformed_data.select_dtypes(['integer', 'float']).columns:
pass
# Backspace Pass and Uncomment to run
# print(col, '\n', round2_transformed_data[col].value_counts(), '\n')
Overall Qual
Year Built
, Year Remod/Add
, Full Bath
, TotRmsAbvGrd
, Fireplaces
, years_remod
, age
1st Flr SF
, Gr Liv Area
, SalePrice
So far, its difficult to see any 'plain' nominal variable in above-given columns. We can't assume Ordinal Variables as to be Nominal ones.
Let's move on for now
round2_transformed_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2924 entries, 0 to 2929 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Overall Qual 2924 non-null int64 1 Year Built 2924 non-null int64 2 Year Remod/Add 2924 non-null int64 3 Mas Vnr Area 2924 non-null float64 4 Exter Qual 2924 non-null int64 5 Bsmt Qual 2924 non-null float64 6 Bsmt Exposure 2924 non-null float64 7 BsmtFin Type 1 2924 non-null float64 8 BsmtFin SF 1 2924 non-null float64 9 Total Bsmt SF 2924 non-null float64 10 Heating QC 2924 non-null int64 11 1st Flr SF 2924 non-null int64 12 2nd Flr SF 2924 non-null int64 13 Gr Liv Area 2924 non-null int64 14 Full Bath 2924 non-null int64 15 Kitchen Qual 2924 non-null int64 16 TotRms AbvGrd 2924 non-null int64 17 Fireplaces 2924 non-null int64 18 Garage Cars 2924 non-null float64 19 Garage Area 2924 non-null float64 20 SalePrice 2924 non-null int64 21 age 2924 non-null int64 22 years_remod 2924 non-null int64 23 MS Zoning 2924 non-null object 24 Lot Config 2924 non-null object 25 Bldg Type 2924 non-null object 26 House Style 2924 non-null object 27 Roof Style 2924 non-null object 28 Foundation 2924 non-null object dtypes: float64(8), int64(15), object(6) memory usage: 685.3+ KB
# Select just the remaining text columns and convert to categorical
text_cols = round2_transformed_data.select_dtypes(include=['object'])
for col in text_cols:
round2_transformed_data[col] = round2_transformed_data[col].astype('category')
# Create dummy columns and add back to the dataframe!
dummies = pd.get_dummies(round2_transformed_data.select_dtypes(include=['category']))
# Concatenate back
round2_transformed_data = pd.concat([round2_transformed_data, dummies], axis=1)
# Drop the original ones
round2_transformed_data = round2_transformed_data.drop(text_cols, axis=1)
# Check
round2_transformed_data.head(5)
Overall Qual | Year Built | Year Remod/Add | Mas Vnr Area | Exter Qual | Bsmt Qual | Bsmt Exposure | BsmtFin Type 1 | BsmtFin SF 1 | Total Bsmt SF | Heating QC | 1st Flr SF | 2nd Flr SF | Gr Liv Area | Full Bath | Kitchen Qual | TotRms AbvGrd | Fireplaces | Garage Cars | Garage Area | SalePrice | age | years_remod | MS Zoning_A (agr) | MS Zoning_C (all) | MS Zoning_FV | MS Zoning_I (all) | MS Zoning_RH | MS Zoning_RL | MS Zoning_RM | Lot Config_Corner | Lot Config_CulDSac | Lot Config_FR2 | Lot Config_FR3 | Lot Config_Inside | Bldg Type_1Fam | Bldg Type_2fmCon | Bldg Type_Duplex | Bldg Type_Twnhs | Bldg Type_TwnhsE | House Style_1.5Fin | House Style_1.5Unf | House Style_1Story | House Style_2.5Fin | House Style_2.5Unf | House Style_2Story | House Style_SFoyer | House Style_SLvl | Roof Style_Flat | Roof Style_Gable | Roof Style_Gambrel | Roof Style_Hip | Roof Style_Mansard | Roof Style_Shed | Foundation_BrkTil | Foundation_CBlock | Foundation_PConc | Foundation_Slab | Foundation_Stone | Foundation_Wood | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6 | 1960 | 1960 | 112.0 | 3 | 3.0 | 4.0 | 4.0 | 639.0 | 1080.0 | 2 | 1656 | 0 | 1656 | 1 | 3 | 7 | 2 | 2.0 | 528.0 | 215000 | 50 | 50 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
1 | 5 | 1961 | 1961 | 0.0 | 3 | 3.0 | 1.0 | 3.0 | 468.0 | 882.0 | 3 | 896 | 0 | 896 | 1 | 3 | 5 | 0 | 1.0 | 730.0 | 105000 | 49 | 49 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 6 | 1958 | 1958 | 108.0 | 3 | 3.0 | 1.0 | 5.0 | 923.0 | 1329.0 | 3 | 1329 | 0 | 1329 | 1 | 4 | 6 | 0 | 1.0 | 312.0 | 172000 | 52 | 52 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 7 | 1968 | 1968 | 0.0 | 4 | 3.0 | 1.0 | 5.0 | 1065.0 | 2110.0 | 5 | 2110 | 0 | 2110 | 2 | 5 | 8 | 2 | 2.0 | 522.0 | 244000 | 42 | 42 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
4 | 5 | 1997 | 1998 | 0.0 | 3 | 4.0 | 1.0 | 6.0 | 791.0 | 928.0 | 4 | 928 | 701 | 1629 | 2 | 3 | 6 | 1 | 2.0 | 482.0 | 189900 | 13 | 12 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
Lets update the function with all the operations we performed in previous cells
def select_features(data, coef_cutoff, cardinality_thresh, myraid_cutoff):
'''
The function replicates the code done in prior cells.
It selects the features we define to be appropriate
Args:
df: DataFrame
coef_cutoff: Selects the minimum Coefficient of Correlation as Cut off
myraid_cutoff: Selects the maximum acceptable presence of the Mode value in its column
nunique_thresh: Selects the maximum number of unique values for a column to be acceptable
Return:
DataFrame with requested changes and Dummy Variables for Nominal Variables
'''
df = data.copy()
# Lists of all the columns with their names segregated into 3 types
numeric_cols = ['Order','Year Built','Year Remod/Add','Bsmt Full Bath','Bsmt Half Bath','Full Bath','Half Bath','Bedroom','Kitchen','TotRmsAbvGrd',
'Fireplaces','Garage Yr Blt','Garage Cars','Mo Sold','Yr Sold','Lot Frontage','Lot Area','Mas Vnr Area','BsmtFin SF 1',
'BsmtFin SF 2','Bsmt Unf SF','Total Bsmt SF','1st Flr SF','2nd Flr SF','Low Qual Fin SF','Gr Liv Area','Garage Area',
'Wood Deck SF','Open Porch SF','Enclosed Porch','3-Ssn Porch','Screen Porch','Pool Area','Misc Val','SalePrice']
ordinal_cols = ['Lot Shape', 'Utilities', 'Land Slope', 'Overall Qual', 'Overall Cond', 'Exter Qual', 'Exter Cond', 'Bsmt Qual', 'Bsmt Cond',
'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFinType 2', 'HeatingQC', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu',
'Garage Finish', 'Garage Qual', 'Garage Cond', 'Paved Drive', 'Pool QC', 'Fence']
nominal_cols = ['PID', 'MS SubClass', 'MS Zoning', 'Street', 'Alley', 'Land Contour', 'Lot Config', 'Neighborhood', 'Condition 1', 'Condition 2',
'Bldg Type', 'House Style', 'Roof Style', 'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type', 'Foundation', 'Heating',
'Central Air', 'Garage Type', 'Misc Feature', 'Sale Type', 'Sale Condition']
# Following columns exhibited Colinearity. Drop them
drop_c = ['1st Flr SF', 'BsmtFin Type 1', 'Exter Qual', 'Bsmt Qual', 'Kitchen Qual',
'Garage Cars', 'BsmtFin SF 1', 'TotRms AbvGrd', '2nd Flr SF', 'Year Built',
'Year Remod/Add']
df = df.drop(drop_c, axis=1)
# Since we removed some columns earlier,
numeric_cols = [col for col in numeric_cols if col in df.columns]
numeric_df = df[numeric_cols].copy()
ordinal_cols = [col for col in ordinal_cols if col in df.columns]
ordinal_df = df[ordinal_cols].copy()
nominal_cols = [col for col in nominal_cols if col in df.columns]
nominal_df = df[nominal_cols].copy()
# NUMERICAL DF
# Columns with the minumum Coefficient of Correlation.
abs_corr_coeffs = numeric_df.corr()['SalePrice'].abs().sort_values(ascending=False)
numeric_df = numeric_df.drop(abs_corr_coeffs[abs_corr_coeffs < coef_cutoff].index, axis=1)
# Columns with least Cardinality in order to avoid too many columns
cardinality_dict = {}
for col in nominal_df.columns:
cardinality_dict[col] = nominal_df[col].nunique()
for key, val in cardinality_dict.items():
if val > cardinality_thresh:
nominal_df.drop(key, axis=1, inplace=True)
nominal_cols = nominal_df.copy().columns.tolist()
# CATEGORICAL DF
categorical_df = pd.concat([nominal_df, ordinal_df], axis=1)
# Columns with minimum acceptable presence of mode value in the columns to avoid Low Variance Features
myraid_dict = {}
for col in categorical_df.columns:
sum_val = categorical_df[col].value_counts().sum()
max_val = categorical_df[col].value_counts().max()
myraid = max_val/sum_val
myraid_dict[col] = myraid
for key, val in myraid_dict.items():
if val > myraid_cutoff:
categorical_df.drop(key, axis=1, inplace=True)
nominal_cols_final = [col for col in nominal_cols if col in categorical_df.columns].copy()
nominal_df_final = df[nominal_cols_final].copy()
nominal_df_final_dummies = nominal_df_final.astype('category')
categorical_df.drop(nominal_df_final, axis=1, inplace=True)
df = pd.concat([numeric_df, categorical_df, pd.get_dummies(nominal_df_final_dummies)], axis=1)
return df
# We updated transform_features()
round3_transformed_data = transform_features(data)
# We have updated select_features()
round3_selected_data = select_features(round3_transformed_data, 0.4, 30, 0.85)
round3_selected_data.head()
Full Bath | Fireplaces | Mas Vnr Area | Total Bsmt SF | Gr Liv Area | Garage Area | SalePrice | Lot Shape | Overall Qual | Overall Cond | Bsmt Exposure | MS SubClass_20 | MS SubClass_30 | MS SubClass_40 | MS SubClass_45 | MS SubClass_50 | MS SubClass_60 | MS SubClass_70 | MS SubClass_75 | MS SubClass_80 | MS SubClass_85 | MS SubClass_90 | MS SubClass_120 | MS SubClass_150 | MS SubClass_160 | MS SubClass_180 | MS SubClass_190 | MS Zoning_A (agr) | MS Zoning_C (all) | MS Zoning_FV | MS Zoning_I (all) | MS Zoning_RH | MS Zoning_RL | MS Zoning_RM | Lot Config_Corner | Lot Config_CulDSac | Lot Config_FR2 | Lot Config_FR3 | Lot Config_Inside | Neighborhood_Blmngtn | Neighborhood_Blueste | Neighborhood_BrDale | Neighborhood_BrkSide | Neighborhood_ClearCr | Neighborhood_CollgCr | Neighborhood_Crawfor | Neighborhood_Edwards | Neighborhood_Gilbert | Neighborhood_Greens | Neighborhood_GrnHill | Neighborhood_IDOTRR | Neighborhood_Landmrk | Neighborhood_MeadowV | Neighborhood_Mitchel | Neighborhood_NAmes | Neighborhood_NPkVill | Neighborhood_NWAmes | Neighborhood_NoRidge | Neighborhood_NridgHt | Neighborhood_OldTown | Neighborhood_SWISU | Neighborhood_Sawyer | Neighborhood_SawyerW | Neighborhood_Somerst | Neighborhood_StoneBr | Neighborhood_Timber | Neighborhood_Veenker | Bldg Type_1Fam | Bldg Type_2fmCon | Bldg Type_Duplex | Bldg Type_Twnhs | Bldg Type_TwnhsE | House Style_1.5Fin | House Style_1.5Unf | House Style_1Story | House Style_2.5Fin | House Style_2.5Unf | House Style_2Story | House Style_SFoyer | House Style_SLvl | Roof Style_Flat | Roof Style_Gable | Roof Style_Gambrel | Roof Style_Hip | Roof Style_Mansard | Roof Style_Shed | Exterior 1st_AsbShng | Exterior 1st_AsphShn | Exterior 1st_BrkComm | Exterior 1st_BrkFace | Exterior 1st_CBlock | Exterior 1st_CemntBd | Exterior 1st_HdBoard | Exterior 1st_ImStucc | Exterior 1st_MetalSd | Exterior 1st_Plywood | Exterior 1st_PreCast | Exterior 1st_Stone | Exterior 1st_Stucco | Exterior 1st_VinylSd | Exterior 1st_Wd Sdng | Exterior 1st_WdShing | Exterior 2nd_AsbShng | Exterior 2nd_AsphShn | Exterior 2nd_Brk Cmn | Exterior 2nd_BrkFace | Exterior 2nd_CBlock | Exterior 2nd_CmentBd | Exterior 2nd_HdBoard | Exterior 2nd_ImStucc | Exterior 2nd_MetalSd | Exterior 2nd_Other | Exterior 2nd_Plywood | Exterior 2nd_PreCast | Exterior 2nd_Stone | Exterior 2nd_Stucco | Exterior 2nd_VinylSd | Exterior 2nd_Wd Sdng | Exterior 2nd_Wd Shng | Foundation_BrkTil | Foundation_CBlock | Foundation_PConc | Foundation_Slab | Foundation_Stone | Foundation_Wood | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 112.0 | 1080.0 | 1656 | 528.0 | 215000 | 1.0 | 6 | 5 | 4.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 0.0 | 882.0 | 896 | 730.0 | 105000 | 4.0 | 5 | 6 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 108.0 | 1329.0 | 1329 | 312.0 | 172000 | 1.0 | 6 | 6 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 2 | 2 | 0.0 | 2110.0 | 2110 | 522.0 | 244000 | 4.0 | 7 | 5 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
4 | 2 | 1 | 0.0 | 928.0 | 1629 | 482.0 | 189900 | 1.0 | 5 | 5 | 1.0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
round3_rmse_data = train_and_test(round3_selected_data)
print(round3_rmse_data)
(27672.832798147665, 0.8764173319410928)
print('Our RMSE for this attempt is around {:.2f} and\nWith an impovement in RMSE of around USD {:.2f}\nThat is almost {:.1f}% improvement than the previous results'.
format(round3_rmse_data[0], round2_rmse_data[0]-round3_rmse_data[0], (round2_rmse_data[0]-round3_rmse_data[0])/round2_rmse_data[0]*100))
Our RMSE for this attempt is around 27672.83 and With an impovement in RMSE of around USD 29415.42 That is almost 51.5% improvement than the previous results
print('\nWhereas our R2 score in the previous attempt was\n{:.2f} and now, it has become {:.2f}'.format(round2_rmse_data[1], round3_rmse_data[1]))
Whereas our R2 score in the previous attempt was 0.51 and now, it has become 0.88
Having given some effort to improve our Feature transformation and selection, let's move on to the training and testing
Now for the final part of the pipeline, training and testing. When iterating on different features, using simple validation is a good idea. Let's add a parameter named k, that controls the type of cross-validation that occurs
def train_and_test(data, k=0):
'''
Trains and Tests the Linear Regression Model for multiple
values of n_splits in case n_splits > 1
Args:
data: The DataFrame
k: n_splits
Returns:
RMSEs and R2 on given number of n_splits
'''
# Take numerical variables and separate features and target
numeric_df = data.select_dtypes(include=['integer', 'float'])
features = numeric_df.columns.drop('SalePrice')
target = 'SalePrice'
# Instantiate LinearRegression() model
lr = LinearRegression()
# HoldOut Validation
if k == 0:
train = data.iloc[:1460]
test = data.iloc[1460:]
lr = LinearRegression()
lr.fit(train[features], train[target])
predictions = lr.predict(test[features])
mse = mean_squared_error(test[target], predictions)
rmse = np.sqrt(mse)
r2 = r2_score(test[target], predictions)
return rmse, r2
# Simple Cross Validation
elif k == 1:
# Shuffle the rows and
df_shuffled = data.sample(frac=1,)
train = data.iloc[:1460]
test = data.iloc[1460:]
# Case 1
lr.fit(train[features], train[target])
predictions_1 = lr.predict(test[features])
mse_1 = mean_squared_error(test[target], predictions_1)
rmse_1 = np.sqrt(mse_1)
r2_1 = r2_score(test[target], predictions_1)
# Case 2
lr.fit(test[features], test[target])
predictions_2 = lr.predict(train[features])
mse_2 = mean_squared_error(train[target], predictions_2)
rmse_2 = np.sqrt(mse_2)
r2_2 = r2_score(train[target], predictions_2)
# Print
print('Set no.1 RMSE:',rmse_1)
print('Set no.2 RMSE:',rmse_2)
print('\n')
print('Set no.1 R2:', r2_1)
print('Set no.2 R2:', r2_2)
# Average
avg_rmse = np.mean([rmse_1, rmse_2])
avg_r2 = np.mean([r2_1, r2_2])
return avg_rmse, avg_r2
# K-Fold Cross Validation
else:
rmses = []
r2s = []
kf = KFold(n_splits=k, shuffle=True)
mses = cross_val_score(lr, data[features], data[target], scoring = 'neg_mean_squared_error', cv =kf)
r2s = cross_val_score(lr, data[features], data[target], scoring = 'r2', cv =kf)
for n in mses:
root = abs(n)**0.5
rmses.append(root)
for m in r2s:
np.append(r2s, m)
avg_rmse = np.mean(rmses)
avg_r2 = np.mean(r2s)
return avg_rmse, avg_r2
# We have updated transform_features() and select_features()
round4_transformed_data = transform_features(data)
round4_selected_data = select_features(round4_transformed_data, 0.4, 30, 0.95)
# We updated train_and_test()
round4_rmse_data = train_and_test(round4_selected_data,9)
print(round4_rmse_data)
(26612.92546342874, 0.884968003557995)
for i in range(20):
round4_rmse_data = train_and_test(round4_selected_data,i)
print(i, round4_rmse_data)
0 (27337.589340987335, 0.879393487430832) Set no.1 RMSE: 27337.589340987335 Set no.2 RMSE: 27497.522043291374 Set no.1 R2: 0.879393487430832 Set no.2 R2: 0.8768455682459111 1 (27417.555692139355, 0.8781195278383715) 2 (27602.496457610236, 0.8783352842079774) 3 (27139.775262714615, 0.8799588142711915) 4 (26754.40654103932, 0.8846465576415803) 5 (26822.128539032943, 0.88272910793457) 6 (26706.3193012581, 0.8863668954352811) 7 (26486.729511930043, 0.8832505730641651) 8 (26703.26199041539, 0.8833696323062765) 9 (26498.547057781132, 0.8841231330456684) 10 (26515.737624577265, 0.8823736757473013) 11 (26638.81943445398, 0.8837644694638215) 12 (26539.392415221988, 0.8839297628702348) 13 (26638.43553591646, 0.8846211171838969) 14 (26548.674581315747, 0.8820688535777317) 15 (26417.236045498612, 0.8838199217901334) 16 (26403.046036924676, 0.8856418276734104) 17 (26539.680902787713, 0.8825590598363491) 18 (26502.633426585315, 0.8840391564319211) 19 (26463.586983704332, 0.8858385616298949)
After having the pipeline built, lets vary the inputs to find the analyse the results
So far, we have created models as well as a chain of flexible models than can be used to find the best combinations of
coef_cutoff
: Selects the minimum Coefficient of Correlation as Cut offmyraid_cutoff
: Selects the maximum acceptable presence of the Mode value in its columnnunique_thresh
: Selects the maximum number of unique values for a column to be acceptablek
: n_splits of Cross Validationselect_features( data , coef_cutoff , cardinality_thresh , myraid_cutoff )
Lets use vast ranges for each to find the spots where we are able to get the ideal scores
# Read in the data again
data = pd.read_table('AmesHousing.tsv', delimiter="\t")
# Initiate an empty list
result_list = []
# A range of Correlation Coefficients
for coef_cutoff in np.arange(0.1,1.0,0.1):
# A range of Cardinality Thresh
for cardinality_thresh in [10,20,30]:
# Range for Myraid Cutoff (It's about the acceptable percentage presence of the mode in the column)
for myraid_cutoff in np.arange(0.5,1.0,0.100):
temp_list = []
round4_transformed_data = transform_features(data)
round4_selected_data = select_features(round4_transformed_data, coef_cutoff, cardinality_thresh , myraid_cutoff)
round4_rmse_data = train_and_test(round4_selected_data, 9)
temp_list.append(coef_cutoff)
temp_list.append(cardinality_thresh)
temp_list.append(myraid_cutoff)
temp_list.append(round4_rmse_data[0])
temp_list.append(round4_rmse_data[1])
result_list.append(temp_list)
# To show the results in formatted floats instead of scientific notation
pd.options.display.float_format = '{:.4f}'.format
results = pd.DataFrame(result_list, columns=['Correlation Coefficient', 'Cardinality Thresh',
'Myraid Cutoff', 'RMSE', 'R2'])
results
Correlation Coefficient | Cardinality Thresh | Myraid Cutoff | RMSE | R2 | |
---|---|---|---|---|---|
0 | 0.1000 | 10 | 0.5000 | 29874.8367 | 0.8554 |
1 | 0.1000 | 10 | 0.6000 | 29339.0784 | 0.8584 |
2 | 0.1000 | 10 | 0.7000 | 28734.2927 | 0.8605 |
3 | 0.1000 | 10 | 0.8000 | 28680.8633 | 0.8633 |
4 | 0.1000 | 10 | 0.9000 | 27916.3381 | 0.8723 |
5 | 0.1000 | 20 | 0.5000 | 28950.1670 | 0.8636 |
6 | 0.1000 | 20 | 0.6000 | 28512.8425 | 0.8659 |
7 | 0.1000 | 20 | 0.7000 | 27912.6818 | 0.8696 |
8 | 0.1000 | 20 | 0.8000 | 27795.4313 | 0.8707 |
9 | 0.1000 | 20 | 0.9000 | 27786.2954 | 0.8764 |
10 | 0.1000 | 30 | 0.5000 | 26376.4776 | 0.8851 |
11 | 0.1000 | 30 | 0.6000 | 25773.9918 | 0.8916 |
12 | 0.1000 | 30 | 0.7000 | 25153.7738 | 0.8964 |
13 | 0.1000 | 30 | 0.8000 | 25301.3139 | 0.8955 |
14 | 0.1000 | 30 | 0.9000 | 25129.7039 | 0.8971 |
15 | 0.2000 | 10 | 0.5000 | 30824.8420 | 0.8462 |
16 | 0.2000 | 10 | 0.6000 | 30113.7412 | 0.8491 |
17 | 0.2000 | 10 | 0.7000 | 29300.6417 | 0.8583 |
18 | 0.2000 | 10 | 0.8000 | 29136.7827 | 0.8605 |
19 | 0.2000 | 10 | 0.9000 | 28539.6717 | 0.8668 |
20 | 0.2000 | 20 | 0.5000 | 29668.5502 | 0.8561 |
21 | 0.2000 | 20 | 0.6000 | 29104.3674 | 0.8592 |
22 | 0.2000 | 20 | 0.7000 | 28512.9610 | 0.8655 |
23 | 0.2000 | 20 | 0.8000 | 28295.7189 | 0.8671 |
24 | 0.2000 | 20 | 0.9000 | 28180.5064 | 0.8712 |
25 | 0.2000 | 30 | 0.5000 | 27580.9401 | 0.8781 |
26 | 0.2000 | 30 | 0.6000 | 26853.5535 | 0.8819 |
27 | 0.2000 | 30 | 0.7000 | 26085.2709 | 0.8898 |
28 | 0.2000 | 30 | 0.8000 | 25902.8740 | 0.8884 |
29 | 0.2000 | 30 | 0.9000 | 25844.8338 | 0.8913 |
30 | 0.3000 | 10 | 0.5000 | 31775.7866 | 0.8360 |
31 | 0.3000 | 10 | 0.6000 | 31077.9497 | 0.8411 |
32 | 0.3000 | 10 | 0.7000 | 30084.7505 | 0.8516 |
33 | 0.3000 | 10 | 0.8000 | 29794.3919 | 0.8552 |
34 | 0.3000 | 10 | 0.9000 | 29116.6049 | 0.8604 |
35 | 0.3000 | 20 | 0.5000 | 30826.3542 | 0.8430 |
36 | 0.3000 | 20 | 0.6000 | 30617.8714 | 0.8516 |
37 | 0.3000 | 20 | 0.7000 | 29399.9049 | 0.8589 |
38 | 0.3000 | 20 | 0.8000 | 29024.8450 | 0.8600 |
39 | 0.3000 | 20 | 0.9000 | 29078.7050 | 0.8624 |
40 | 0.3000 | 30 | 0.5000 | 28297.4902 | 0.8677 |
41 | 0.3000 | 30 | 0.6000 | 27648.5686 | 0.8723 |
42 | 0.3000 | 30 | 0.7000 | 26788.0124 | 0.8818 |
43 | 0.3000 | 30 | 0.8000 | 26853.7018 | 0.8812 |
44 | 0.3000 | 30 | 0.9000 | 26639.7105 | 0.8848 |
45 | 0.4000 | 10 | 0.5000 | 31990.8924 | 0.8328 |
46 | 0.4000 | 10 | 0.6000 | 31377.7029 | 0.8384 |
47 | 0.4000 | 10 | 0.7000 | 30218.3739 | 0.8500 |
48 | 0.4000 | 10 | 0.8000 | 29907.3278 | 0.8514 |
49 | 0.4000 | 10 | 0.9000 | 29156.8711 | 0.8571 |
50 | 0.4000 | 20 | 0.5000 | 31072.0242 | 0.8412 |
51 | 0.4000 | 20 | 0.6000 | 30548.3411 | 0.8469 |
52 | 0.4000 | 20 | 0.7000 | 29419.7135 | 0.8570 |
53 | 0.4000 | 20 | 0.8000 | 29191.6595 | 0.8623 |
54 | 0.4000 | 20 | 0.9000 | 28975.4413 | 0.8627 |
55 | 0.4000 | 30 | 0.5000 | 28618.6679 | 0.8642 |
56 | 0.4000 | 30 | 0.6000 | 27850.6588 | 0.8723 |
57 | 0.4000 | 30 | 0.7000 | 26744.7140 | 0.8809 |
58 | 0.4000 | 30 | 0.8000 | 26824.4584 | 0.8824 |
59 | 0.4000 | 30 | 0.9000 | 26763.8406 | 0.8832 |
60 | 0.5000 | 10 | 0.5000 | 32439.6614 | 0.8298 |
61 | 0.5000 | 10 | 0.6000 | 31570.0048 | 0.8368 |
62 | 0.5000 | 10 | 0.7000 | 30351.5456 | 0.8465 |
63 | 0.5000 | 10 | 0.8000 | 30104.5001 | 0.8523 |
64 | 0.5000 | 10 | 0.9000 | 29359.8399 | 0.8577 |
65 | 0.5000 | 20 | 0.5000 | 31202.7130 | 0.8402 |
66 | 0.5000 | 20 | 0.6000 | 30849.3478 | 0.8424 |
67 | 0.5000 | 20 | 0.7000 | 29568.1206 | 0.8559 |
68 | 0.5000 | 20 | 0.8000 | 29432.0174 | 0.8585 |
69 | 0.5000 | 20 | 0.9000 | 29336.4556 | 0.8572 |
70 | 0.5000 | 30 | 0.5000 | 28799.8538 | 0.8639 |
71 | 0.5000 | 30 | 0.6000 | 28057.8098 | 0.8715 |
72 | 0.5000 | 30 | 0.7000 | 27030.1184 | 0.8796 |
73 | 0.5000 | 30 | 0.8000 | 27033.5430 | 0.8802 |
74 | 0.5000 | 30 | 0.9000 | 26760.4401 | 0.8817 |
75 | 0.6000 | 10 | 0.5000 | 33123.5304 | 0.8198 |
76 | 0.6000 | 10 | 0.6000 | 32354.0023 | 0.8281 |
77 | 0.6000 | 10 | 0.7000 | 31161.9591 | 0.8415 |
78 | 0.6000 | 10 | 0.8000 | 30617.2727 | 0.8461 |
79 | 0.6000 | 10 | 0.9000 | 29947.8215 | 0.8533 |
80 | 0.6000 | 20 | 0.5000 | 32075.7575 | 0.8323 |
81 | 0.6000 | 20 | 0.6000 | 31540.9285 | 0.8370 |
82 | 0.6000 | 20 | 0.7000 | 30077.9992 | 0.8490 |
83 | 0.6000 | 20 | 0.8000 | 29808.9672 | 0.8525 |
84 | 0.6000 | 20 | 0.9000 | 29737.0735 | 0.8554 |
85 | 0.6000 | 30 | 0.5000 | 29316.4992 | 0.8582 |
86 | 0.6000 | 30 | 0.6000 | 28599.8647 | 0.8681 |
87 | 0.6000 | 30 | 0.7000 | 27347.4964 | 0.8773 |
88 | 0.6000 | 30 | 0.8000 | 27106.2895 | 0.8773 |
89 | 0.6000 | 30 | 0.9000 | 27198.6289 | 0.8806 |
90 | 0.7000 | 10 | 0.5000 | 38101.2307 | 0.7631 |
91 | 0.7000 | 10 | 0.6000 | 34677.9186 | 0.8038 |
92 | 0.7000 | 10 | 0.7000 | 32766.0984 | 0.8217 |
93 | 0.7000 | 10 | 0.8000 | 32303.6157 | 0.8293 |
94 | 0.7000 | 10 | 0.9000 | 31480.0787 | 0.8387 |
95 | 0.7000 | 20 | 0.5000 | 34443.5013 | 0.8056 |
96 | 0.7000 | 20 | 0.6000 | 33751.3421 | 0.8142 |
97 | 0.7000 | 20 | 0.7000 | 31954.1836 | 0.8348 |
98 | 0.7000 | 20 | 0.8000 | 31274.5220 | 0.8373 |
99 | 0.7000 | 20 | 0.9000 | 31093.2444 | 0.8417 |
100 | 0.7000 | 30 | 0.5000 | 31099.4209 | 0.8398 |
101 | 0.7000 | 30 | 0.6000 | 30288.7484 | 0.8507 |
102 | 0.7000 | 30 | 0.7000 | 28780.7141 | 0.8643 |
103 | 0.7000 | 30 | 0.8000 | 28296.4102 | 0.8675 |
104 | 0.7000 | 30 | 0.9000 | 28417.1617 | 0.8672 |
105 | 0.8000 | 10 | 0.5000 | 45892.1243 | 0.6552 |
106 | 0.8000 | 10 | 0.6000 | 45685.4890 | 0.6574 |
107 | 0.8000 | 10 | 0.7000 | 43783.6148 | 0.6857 |
108 | 0.8000 | 10 | 0.8000 | 41972.2035 | 0.7124 |
109 | 0.8000 | 10 | 0.9000 | 38245699159853.3203 | -6811756776477574144.0000 |
110 | 0.8000 | 20 | 0.5000 | 117590731670117.0000 | -2454348772123703296.0000 |
111 | 0.8000 | 20 | 0.6000 | 116313764869062.2500 | -79234686101212971008.0000 |
112 | 0.8000 | 20 | 0.7000 | 1852512390479788.7500 | -204539575760350304.0000 |
113 | 0.8000 | 20 | 0.8000 | 180876859078249.8125 | -24870266454043545600.0000 |
114 | 0.8000 | 20 | 0.9000 | 112900216161031.6562 | -20714453197697302528.0000 |
115 | 0.8000 | 30 | 0.5000 | 262103807571182.5625 | -135299703027640827904.0000 |
116 | 0.8000 | 30 | 0.6000 | 429964639444931.7500 | -62269158222819917824.0000 |
117 | 0.8000 | 30 | 0.7000 | 226499332526949.0625 | -43771073783054483456.0000 |
118 | 0.8000 | 30 | 0.8000 | 104941860057478.9844 | -15785681207131865088.0000 |
119 | 0.8000 | 30 | 0.9000 | 79871838359477.6094 | -12201553272608940032.0000 |
120 | 0.9000 | 10 | 0.5000 | 45881.5210 | 0.6542 |
121 | 0.9000 | 10 | 0.6000 | 45879.2348 | 0.6561 |
122 | 0.9000 | 10 | 0.7000 | 43745.5064 | 0.6875 |
123 | 0.9000 | 10 | 0.8000 | 41934.3127 | 0.7142 |
124 | 0.9000 | 10 | 0.9000 | 40844.9056 | 0.7250 |
125 | 0.9000 | 20 | 0.5000 | 266737558126530.1562 | -19506759589265485824.0000 |
126 | 0.9000 | 20 | 0.6000 | 407857191493323.6875 | -8499002089086435328.0000 |
127 | 0.9000 | 20 | 0.7000 | 274274571392602.1562 | -190670933196039979008.0000 |
128 | 0.9000 | 20 | 0.8000 | 29458242850008.5391 | -42287022770408071168.0000 |
129 | 0.9000 | 20 | 0.9000 | 216120253262526.6250 | -68582735563274600448.0000 |
130 | 0.9000 | 30 | 0.5000 | 345209383125116.0000 | -39609920879539732480.0000 |
131 | 0.9000 | 30 | 0.6000 | 106860610429706.7969 | -46357169521402519552.0000 |
132 | 0.9000 | 30 | 0.7000 | 120263052592499.6562 | -9929688962501949440.0000 |
133 | 0.9000 | 30 | 0.8000 | 132187001784381.9375 | -30470530406334513152.0000 |
134 | 0.9000 | 30 | 0.9000 | 272153112659617.0000 | -1758456692703656192.0000 |
results.sort_values(by='R2', ascending=False).head(15)
Correlation Coefficient | Cardinality Thresh | Myraid Cutoff | RMSE | R2 | |
---|---|---|---|---|---|
14 | 0.1000 | 30 | 0.9000 | 25129.7039 | 0.8971 |
12 | 0.1000 | 30 | 0.7000 | 25153.7738 | 0.8964 |
13 | 0.1000 | 30 | 0.8000 | 25301.3139 | 0.8955 |
11 | 0.1000 | 30 | 0.6000 | 25773.9918 | 0.8916 |
29 | 0.2000 | 30 | 0.9000 | 25844.8338 | 0.8913 |
27 | 0.2000 | 30 | 0.7000 | 26085.2709 | 0.8898 |
28 | 0.2000 | 30 | 0.8000 | 25902.8740 | 0.8884 |
10 | 0.1000 | 30 | 0.5000 | 26376.4776 | 0.8851 |
44 | 0.3000 | 30 | 0.9000 | 26639.7105 | 0.8848 |
59 | 0.4000 | 30 | 0.9000 | 26763.8406 | 0.8832 |
58 | 0.4000 | 30 | 0.8000 | 26824.4584 | 0.8824 |
26 | 0.2000 | 30 | 0.6000 | 26853.5535 | 0.8819 |
42 | 0.3000 | 30 | 0.7000 | 26788.0124 | 0.8818 |
74 | 0.5000 | 30 | 0.9000 | 26760.4401 | 0.8817 |
43 | 0.3000 | 30 | 0.8000 | 26853.7018 | 0.8812 |
Its clear that the maximum Cardinality Thresh is resulting in the highest R2 (the lowest RMSE) scores
Cardinality Thresh
can be increased to see if it helps us with achieving better scoresAnother interesting metric could be the ratio between Correlation Coefficient and Myraid Cutoff.
ratio cc/mc
to get the ratioresults['ratio cc/mc'] = results['Correlation Coefficient']/results['Myraid Cutoff']
results.sort_values(by='R2', ascending=False).head(15)
Correlation Coefficient | Cardinality Thresh | Myraid Cutoff | RMSE | R2 | ratio cc/mc | |
---|---|---|---|---|---|---|
14 | 0.1000 | 30 | 0.9000 | 25129.7039 | 0.8971 | 0.1111 |
12 | 0.1000 | 30 | 0.7000 | 25153.7738 | 0.8964 | 0.1429 |
13 | 0.1000 | 30 | 0.8000 | 25301.3139 | 0.8955 | 0.1250 |
11 | 0.1000 | 30 | 0.6000 | 25773.9918 | 0.8916 | 0.1667 |
29 | 0.2000 | 30 | 0.9000 | 25844.8338 | 0.8913 | 0.2222 |
27 | 0.2000 | 30 | 0.7000 | 26085.2709 | 0.8898 | 0.2857 |
28 | 0.2000 | 30 | 0.8000 | 25902.8740 | 0.8884 | 0.2500 |
10 | 0.1000 | 30 | 0.5000 | 26376.4776 | 0.8851 | 0.2000 |
44 | 0.3000 | 30 | 0.9000 | 26639.7105 | 0.8848 | 0.3333 |
59 | 0.4000 | 30 | 0.9000 | 26763.8406 | 0.8832 | 0.4444 |
58 | 0.4000 | 30 | 0.8000 | 26824.4584 | 0.8824 | 0.5000 |
26 | 0.2000 | 30 | 0.6000 | 26853.5535 | 0.8819 | 0.3333 |
42 | 0.3000 | 30 | 0.7000 | 26788.0124 | 0.8818 | 0.4286 |
74 | 0.5000 | 30 | 0.9000 | 26760.4401 | 0.8817 | 0.5556 |
43 | 0.3000 | 30 | 0.8000 | 26853.7018 | 0.8812 | 0.3750 |
# Plot
plt.figure(figsize=(20,8))
sns.lineplot(data=results.sort_values(by='R2', ascending=False).head(100), x='ratio cc/mc', y='R2')
plt.title('Relationship between the ratio of/nCoefficient of Correlation and Myraid Cutoff', fontsize=20)
plt.xlabel('ratio cc/mc', fontsize=15)
plt.ylabel('R2 Score', fontsize=15)
sns.despine()
plt.grid()
Correlation Coefficient
and Myraid Cutoff
!# Read in the data again
data = pd.read_table('AmesHousing.tsv', delimiter="\t")
# Initiate an empty list
result_list_close = []
# A range of Correlation Coefficients
for coef_cutoff in np.arange(0.01,0.20,0.01):
# Range for Myraid Cutoff (It's about the acceptable percentage presence of the mode in the column)
for myraid_cutoff in np.arange(0.8,1.0,0.01):
temp_list = []
round4_transformed_data = transform_features(data)
round4_selected_data = select_features(round4_transformed_data, coef_cutoff, 40, myraid_cutoff)
round4_rmse_data = train_and_test(round4_selected_data, 9)
temp_list.append(coef_cutoff)
temp_list.append(cardinality_thresh)
temp_list.append(myraid_cutoff)
temp_list.append(round4_rmse_data[0])
temp_list.append(round4_rmse_data[1])
result_list_close.append(temp_list)
results_close = pd.DataFrame(result_list_close, columns=['Correlation Coefficient', 'Cardinality Thresh',
'Myraid Cutoff', 'RMSE', 'R2'])
results_close
Correlation Coefficient | Cardinality Thresh | Myraid Cutoff | RMSE | R2 | |
---|---|---|---|---|---|
0 | 0.0100 | 30 | 0.8000 | 25319.5955 | 0.8931 |
1 | 0.0100 | 30 | 0.8100 | 25390.3321 | 0.8941 |
2 | 0.0100 | 30 | 0.8200 | 25403.5577 | 0.8939 |
3 | 0.0100 | 30 | 0.8300 | 25531.9459 | 0.8953 |
4 | 0.0100 | 30 | 0.8400 | 25462.0606 | 0.8941 |
5 | 0.0100 | 30 | 0.8500 | 25335.8613 | 0.8959 |
6 | 0.0100 | 30 | 0.8600 | 25330.5257 | 0.8956 |
7 | 0.0100 | 30 | 0.8700 | 25199.4548 | 0.8945 |
8 | 0.0100 | 30 | 0.8800 | 25353.6194 | 0.8951 |
9 | 0.0100 | 30 | 0.8900 | 25294.7159 | 0.8974 |
10 | 0.0100 | 30 | 0.9000 | 25228.6555 | 0.8959 |
11 | 0.0100 | 30 | 0.9100 | 25072.1735 | 0.8966 |
12 | 0.0100 | 30 | 0.9200 | 25359.7330 | 0.8960 |
13 | 0.0100 | 30 | 0.9300 | 25379.5111 | 0.8953 |
14 | 0.0100 | 30 | 0.9400 | 25056.7559 | 0.9000 |
15 | 0.0100 | 30 | 0.9500 | 25172.4014 | 0.8960 |
16 | 0.0100 | 30 | 0.9600 | 25020.5051 | 0.8970 |
17 | 0.0100 | 30 | 0.9700 | 25054.8722 | 0.8985 |
18 | 0.0100 | 30 | 0.9800 | 25123.3375 | 0.9000 |
19 | 0.0100 | 30 | 0.9900 | 25056.5723 | 0.8967 |
20 | 0.0200 | 30 | 0.8000 | 25333.3243 | 0.8947 |
21 | 0.0200 | 30 | 0.8100 | 25377.5409 | 0.8936 |
22 | 0.0200 | 30 | 0.8200 | 25271.8764 | 0.8946 |
23 | 0.0200 | 30 | 0.8300 | 25676.3231 | 0.8940 |
24 | 0.0200 | 30 | 0.8400 | 25243.8320 | 0.8934 |
25 | 0.0200 | 30 | 0.8500 | 25414.2845 | 0.8939 |
26 | 0.0200 | 30 | 0.8600 | 25476.6015 | 0.8946 |
27 | 0.0200 | 30 | 0.8700 | 25254.5181 | 0.8970 |
28 | 0.0200 | 30 | 0.8800 | 25138.6216 | 0.8946 |
29 | 0.0200 | 30 | 0.8900 | 25388.8613 | 0.8960 |
30 | 0.0200 | 30 | 0.9000 | 25119.6256 | 0.8964 |
31 | 0.0200 | 30 | 0.9100 | 25292.8234 | 0.8962 |
32 | 0.0200 | 30 | 0.9200 | 25143.7931 | 0.8956 |
33 | 0.0200 | 30 | 0.9300 | 25143.0949 | 0.8977 |
34 | 0.0200 | 30 | 0.9400 | 24886.7747 | 0.8967 |
35 | 0.0200 | 30 | 0.9500 | 25177.4022 | 0.8971 |
36 | 0.0200 | 30 | 0.9600 | 25070.7707 | 0.8959 |
37 | 0.0200 | 30 | 0.9700 | 24869.1029 | 0.8986 |
38 | 0.0200 | 30 | 0.9800 | 25155.2075 | 0.8967 |
39 | 0.0200 | 30 | 0.9900 | 25020.5700 | 0.8989 |
40 | 0.0300 | 30 | 0.8000 | 25344.5735 | 0.8934 |
41 | 0.0300 | 30 | 0.8100 | 25362.7865 | 0.8950 |
42 | 0.0300 | 30 | 0.8200 | 25242.2969 | 0.8951 |
43 | 0.0300 | 30 | 0.8300 | 25303.7224 | 0.8939 |
44 | 0.0300 | 30 | 0.8400 | 25299.8104 | 0.8955 |
45 | 0.0300 | 30 | 0.8500 | 25211.6326 | 0.8943 |
46 | 0.0300 | 30 | 0.8600 | 25352.3789 | 0.8949 |
47 | 0.0300 | 30 | 0.8700 | 25267.1844 | 0.8966 |
48 | 0.0300 | 30 | 0.8800 | 25332.5574 | 0.8946 |
49 | 0.0300 | 30 | 0.8900 | 25251.6792 | 0.8962 |
50 | 0.0300 | 30 | 0.9000 | 25088.4363 | 0.8987 |
51 | 0.0300 | 30 | 0.9100 | 25118.3952 | 0.8932 |
52 | 0.0300 | 30 | 0.9200 | 25195.2474 | 0.8945 |
53 | 0.0300 | 30 | 0.9300 | 25140.7994 | 0.8978 |
54 | 0.0300 | 30 | 0.9400 | 24918.9759 | 0.8972 |
55 | 0.0300 | 30 | 0.9500 | 25132.9378 | 0.8957 |
56 | 0.0300 | 30 | 0.9600 | 25069.8472 | 0.8980 |
57 | 0.0300 | 30 | 0.9700 | 25040.0198 | 0.8964 |
58 | 0.0300 | 30 | 0.9800 | 24948.3835 | 0.8979 |
59 | 0.0300 | 30 | 0.9900 | 24982.1514 | 0.8964 |
60 | 0.0400 | 30 | 0.8000 | 25457.3011 | 0.8937 |
61 | 0.0400 | 30 | 0.8100 | 25687.7407 | 0.8950 |
62 | 0.0400 | 30 | 0.8200 | 25342.4870 | 0.8955 |
63 | 0.0400 | 30 | 0.8300 | 25277.0118 | 0.8950 |
64 | 0.0400 | 30 | 0.8400 | 25187.1096 | 0.8923 |
65 | 0.0400 | 30 | 0.8500 | 25271.0769 | 0.8955 |
66 | 0.0400 | 30 | 0.8600 | 25166.0803 | 0.8919 |
67 | 0.0400 | 30 | 0.8700 | 25291.6821 | 0.8953 |
68 | 0.0400 | 30 | 0.8800 | 25194.3690 | 0.8945 |
69 | 0.0400 | 30 | 0.8900 | 25068.8806 | 0.8951 |
70 | 0.0400 | 30 | 0.9000 | 24981.2742 | 0.8971 |
71 | 0.0400 | 30 | 0.9100 | 24952.0524 | 0.8974 |
72 | 0.0400 | 30 | 0.9200 | 25147.7355 | 0.8977 |
73 | 0.0400 | 30 | 0.9300 | 25100.1328 | 0.8953 |
74 | 0.0400 | 30 | 0.9400 | 24931.1303 | 0.8980 |
75 | 0.0400 | 30 | 0.9500 | 24846.3240 | 0.8968 |
76 | 0.0400 | 30 | 0.9600 | 24925.1585 | 0.8977 |
77 | 0.0400 | 30 | 0.9700 | 24925.5928 | 0.8975 |
78 | 0.0400 | 30 | 0.9800 | 24947.6206 | 0.8980 |
79 | 0.0400 | 30 | 0.9900 | 24954.4494 | 0.8978 |
80 | 0.0500 | 30 | 0.8000 | 25287.8627 | 0.8951 |
81 | 0.0500 | 30 | 0.8100 | 25402.5233 | 0.8948 |
82 | 0.0500 | 30 | 0.8200 | 25328.9230 | 0.8957 |
83 | 0.0500 | 30 | 0.8300 | 25496.7254 | 0.8973 |
84 | 0.0500 | 30 | 0.8400 | 25347.2547 | 0.8960 |
85 | 0.0500 | 30 | 0.8500 | 25216.3787 | 0.8921 |
86 | 0.0500 | 30 | 0.8600 | 25410.7218 | 0.8953 |
87 | 0.0500 | 30 | 0.8700 | 25235.7093 | 0.8963 |
88 | 0.0500 | 30 | 0.8800 | 25111.2412 | 0.8975 |
89 | 0.0500 | 30 | 0.8900 | 25256.0988 | 0.8964 |
90 | 0.0500 | 30 | 0.9000 | 25043.3415 | 0.8962 |
91 | 0.0500 | 30 | 0.9100 | 25148.1891 | 0.8955 |
92 | 0.0500 | 30 | 0.9200 | 25033.4180 | 0.8970 |
93 | 0.0500 | 30 | 0.9300 | 25105.7461 | 0.8962 |
94 | 0.0500 | 30 | 0.9400 | 24811.7022 | 0.8986 |
95 | 0.0500 | 30 | 0.9500 | 24917.7888 | 0.8975 |
96 | 0.0500 | 30 | 0.9600 | 25053.3417 | 0.8996 |
97 | 0.0500 | 30 | 0.9700 | 24877.9370 | 0.8986 |
98 | 0.0500 | 30 | 0.9800 | 24953.5953 | 0.8983 |
99 | 0.0500 | 30 | 0.9900 | 25110.9524 | 0.8972 |
100 | 0.0600 | 30 | 0.8000 | 25289.3175 | 0.8934 |
101 | 0.0600 | 30 | 0.8100 | 25480.3831 | 0.8949 |
102 | 0.0600 | 30 | 0.8200 | 25360.5408 | 0.8963 |
103 | 0.0600 | 30 | 0.8300 | 25361.8186 | 0.8944 |
104 | 0.0600 | 30 | 0.8400 | 25356.7221 | 0.8954 |
105 | 0.0600 | 30 | 0.8500 | 25443.6883 | 0.8960 |
106 | 0.0600 | 30 | 0.8600 | 25435.9519 | 0.8923 |
107 | 0.0600 | 30 | 0.8700 | 25141.3856 | 0.8940 |
108 | 0.0600 | 30 | 0.8800 | 25244.9713 | 0.8968 |
109 | 0.0600 | 30 | 0.8900 | 25102.7205 | 0.8956 |
110 | 0.0600 | 30 | 0.9000 | 25060.4129 | 0.8985 |
111 | 0.0600 | 30 | 0.9100 | 25056.8897 | 0.8961 |
112 | 0.0600 | 30 | 0.9200 | 25121.0038 | 0.8957 |
113 | 0.0600 | 30 | 0.9300 | 25040.8626 | 0.8979 |
114 | 0.0600 | 30 | 0.9400 | 24863.4688 | 0.8985 |
115 | 0.0600 | 30 | 0.9500 | 24775.3363 | 0.8982 |
116 | 0.0600 | 30 | 0.9600 | 24969.0290 | 0.8989 |
117 | 0.0600 | 30 | 0.9700 | 24914.2901 | 0.8981 |
118 | 0.0600 | 30 | 0.9800 | 25032.4151 | 0.8962 |
119 | 0.0600 | 30 | 0.9900 | 25077.8580 | 0.8962 |
120 | 0.0700 | 30 | 0.8000 | 25330.4999 | 0.8942 |
121 | 0.0700 | 30 | 0.8100 | 25412.3133 | 0.8943 |
122 | 0.0700 | 30 | 0.8200 | 25460.5777 | 0.8938 |
123 | 0.0700 | 30 | 0.8300 | 25231.1880 | 0.8928 |
124 | 0.0700 | 30 | 0.8400 | 25282.9938 | 0.8946 |
125 | 0.0700 | 30 | 0.8500 | 25342.0241 | 0.8956 |
126 | 0.0700 | 30 | 0.8600 | 25327.5343 | 0.8940 |
127 | 0.0700 | 30 | 0.8700 | 25082.1916 | 0.8966 |
128 | 0.0700 | 30 | 0.8800 | 25308.7553 | 0.8963 |
129 | 0.0700 | 30 | 0.8900 | 25317.3376 | 0.8973 |
130 | 0.0700 | 30 | 0.9000 | 25159.0771 | 0.8942 |
131 | 0.0700 | 30 | 0.9100 | 25114.4776 | 0.8977 |
132 | 0.0700 | 30 | 0.9200 | 25261.8553 | 0.8972 |
133 | 0.0700 | 30 | 0.9300 | 25007.7135 | 0.8957 |
134 | 0.0700 | 30 | 0.9400 | 24996.8801 | 0.8997 |
135 | 0.0700 | 30 | 0.9500 | 25063.2953 | 0.8983 |
136 | 0.0700 | 30 | 0.9600 | 25215.8539 | 0.8973 |
137 | 0.0700 | 30 | 0.9700 | 24771.9368 | 0.8946 |
138 | 0.0700 | 30 | 0.9800 | 24840.9240 | 0.8985 |
139 | 0.0700 | 30 | 0.9900 | 25036.1353 | 0.8963 |
140 | 0.0800 | 30 | 0.8000 | 25355.9836 | 0.8954 |
141 | 0.0800 | 30 | 0.8100 | 25428.8492 | 0.8951 |
142 | 0.0800 | 30 | 0.8200 | 25403.4293 | 0.8946 |
143 | 0.0800 | 30 | 0.8300 | 25261.0324 | 0.8936 |
144 | 0.0800 | 30 | 0.8400 | 25140.9177 | 0.8957 |
145 | 0.0800 | 30 | 0.8500 | 25328.2455 | 0.8953 |
146 | 0.0800 | 30 | 0.8600 | 25380.8979 | 0.8926 |
147 | 0.0800 | 30 | 0.8700 | 25328.3488 | 0.8954 |
148 | 0.0800 | 30 | 0.8800 | 25117.4951 | 0.8952 |
149 | 0.0800 | 30 | 0.8900 | 25266.1902 | 0.8917 |
150 | 0.0800 | 30 | 0.9000 | 25124.5786 | 0.8960 |
151 | 0.0800 | 30 | 0.9100 | 25184.9526 | 0.8947 |
152 | 0.0800 | 30 | 0.9200 | 25091.5817 | 0.8977 |
153 | 0.0800 | 30 | 0.9300 | 25239.7873 | 0.8935 |
154 | 0.0800 | 30 | 0.9400 | 24909.8578 | 0.8964 |
155 | 0.0800 | 30 | 0.9500 | 25087.0659 | 0.8984 |
156 | 0.0800 | 30 | 0.9600 | 25122.2417 | 0.8976 |
157 | 0.0800 | 30 | 0.9700 | 24974.0710 | 0.8974 |
158 | 0.0800 | 30 | 0.9800 | 24992.1790 | 0.8979 |
159 | 0.0800 | 30 | 0.9900 | 25159.6298 | 0.8988 |
160 | 0.0900 | 30 | 0.8000 | 25286.9401 | 0.8936 |
161 | 0.0900 | 30 | 0.8100 | 25203.5317 | 0.8945 |
162 | 0.0900 | 30 | 0.8200 | 25291.8415 | 0.8948 |
163 | 0.0900 | 30 | 0.8300 | 25234.7825 | 0.8942 |
164 | 0.0900 | 30 | 0.8400 | 25257.4933 | 0.8946 |
165 | 0.0900 | 30 | 0.8500 | 25314.1546 | 0.8965 |
166 | 0.0900 | 30 | 0.8600 | 25541.3470 | 0.8934 |
167 | 0.0900 | 30 | 0.8700 | 25067.3912 | 0.8971 |
168 | 0.0900 | 30 | 0.8800 | 25489.1142 | 0.8944 |
169 | 0.0900 | 30 | 0.8900 | 25129.9798 | 0.8955 |
170 | 0.0900 | 30 | 0.9000 | 25184.0899 | 0.8961 |
171 | 0.0900 | 30 | 0.9100 | 25205.9237 | 0.8970 |
172 | 0.0900 | 30 | 0.9200 | 25080.7618 | 0.8962 |
173 | 0.0900 | 30 | 0.9300 | 25127.7912 | 0.8955 |
174 | 0.0900 | 30 | 0.9400 | 24964.7274 | 0.8991 |
175 | 0.0900 | 30 | 0.9500 | 24852.4966 | 0.8987 |
176 | 0.0900 | 30 | 0.9600 | 24981.7119 | 0.8977 |
177 | 0.0900 | 30 | 0.9700 | 24924.3391 | 0.8966 |
178 | 0.0900 | 30 | 0.9800 | 24786.3574 | 0.8991 |
179 | 0.0900 | 30 | 0.9900 | 25102.1824 | 0.8977 |
180 | 0.1000 | 30 | 0.8000 | 25249.1264 | 0.8944 |
181 | 0.1000 | 30 | 0.8100 | 25368.7032 | 0.8964 |
182 | 0.1000 | 30 | 0.8200 | 25168.4508 | 0.8949 |
183 | 0.1000 | 30 | 0.8300 | 25288.8140 | 0.8951 |
184 | 0.1000 | 30 | 0.8400 | 25394.5968 | 0.8945 |
185 | 0.1000 | 30 | 0.8500 | 25240.8144 | 0.8964 |
186 | 0.1000 | 30 | 0.8600 | 25318.4631 | 0.8951 |
187 | 0.1000 | 30 | 0.8700 | 25068.2277 | 0.8954 |
188 | 0.1000 | 30 | 0.8800 | 25291.6713 | 0.8967 |
189 | 0.1000 | 30 | 0.8900 | 25196.6923 | 0.8957 |
190 | 0.1000 | 30 | 0.9000 | 25086.4070 | 0.8959 |
191 | 0.1000 | 30 | 0.9100 | 25085.0352 | 0.8954 |
192 | 0.1000 | 30 | 0.9200 | 25125.1610 | 0.8965 |
193 | 0.1000 | 30 | 0.9300 | 25013.9674 | 0.8976 |
194 | 0.1000 | 30 | 0.9400 | 24925.7541 | 0.8973 |
195 | 0.1000 | 30 | 0.9500 | 24912.1327 | 0.8987 |
196 | 0.1000 | 30 | 0.9600 | 24893.4669 | 0.8978 |
197 | 0.1000 | 30 | 0.9700 | 24847.9441 | 0.8978 |
198 | 0.1000 | 30 | 0.9800 | 24819.6316 | 0.8961 |
199 | 0.1000 | 30 | 0.9900 | 25402.4468 | 0.8970 |
200 | 0.1100 | 30 | 0.8000 | 25302.4186 | 0.8958 |
201 | 0.1100 | 30 | 0.8100 | 25333.3363 | 0.8949 |
202 | 0.1100 | 30 | 0.8200 | 25324.9672 | 0.8959 |
203 | 0.1100 | 30 | 0.8300 | 25096.7135 | 0.8939 |
204 | 0.1100 | 30 | 0.8400 | 25320.8600 | 0.8949 |
205 | 0.1100 | 30 | 0.8500 | 25344.5433 | 0.8950 |
206 | 0.1100 | 30 | 0.8600 | 25360.9010 | 0.8928 |
207 | 0.1100 | 30 | 0.8700 | 25111.8798 | 0.8956 |
208 | 0.1100 | 30 | 0.8800 | 25203.3297 | 0.8964 |
209 | 0.1100 | 30 | 0.8900 | 25176.8837 | 0.8955 |
210 | 0.1100 | 30 | 0.9000 | 25061.2102 | 0.8978 |
211 | 0.1100 | 30 | 0.9100 | 25227.7469 | 0.8977 |
212 | 0.1100 | 30 | 0.9200 | 25150.8792 | 0.8978 |
213 | 0.1100 | 30 | 0.9300 | 25040.8364 | 0.8956 |
214 | 0.1100 | 30 | 0.9400 | 25118.2509 | 0.8977 |
215 | 0.1100 | 30 | 0.9500 | 24924.8750 | 0.8963 |
216 | 0.1100 | 30 | 0.9600 | 24998.7718 | 0.8987 |
217 | 0.1100 | 30 | 0.9700 | 24786.6189 | 0.8981 |
218 | 0.1100 | 30 | 0.9800 | 25147.2220 | 0.8981 |
219 | 0.1100 | 30 | 0.9900 | 25254.5353 | 0.8964 |
220 | 0.1200 | 30 | 0.8000 | 25445.4207 | 0.8940 |
221 | 0.1200 | 30 | 0.8100 | 25518.3016 | 0.8950 |
222 | 0.1200 | 30 | 0.8200 | 25548.9545 | 0.8950 |
223 | 0.1200 | 30 | 0.8300 | 25469.4524 | 0.8947 |
224 | 0.1200 | 30 | 0.8400 | 25474.0338 | 0.8943 |
225 | 0.1200 | 30 | 0.8500 | 25340.9946 | 0.8928 |
226 | 0.1200 | 30 | 0.8600 | 25342.5606 | 0.8945 |
227 | 0.1200 | 30 | 0.8700 | 25227.9081 | 0.8963 |
228 | 0.1200 | 30 | 0.8800 | 25142.2117 | 0.8953 |
229 | 0.1200 | 30 | 0.8900 | 25223.1234 | 0.8959 |
230 | 0.1200 | 30 | 0.9000 | 25258.9367 | 0.8950 |
231 | 0.1200 | 30 | 0.9100 | 25205.6908 | 0.8976 |
232 | 0.1200 | 30 | 0.9200 | 25292.2390 | 0.8934 |
233 | 0.1200 | 30 | 0.9300 | 25195.5771 | 0.8947 |
234 | 0.1200 | 30 | 0.9400 | 25184.2472 | 0.8975 |
235 | 0.1200 | 30 | 0.9500 | 24986.2612 | 0.8947 |
236 | 0.1200 | 30 | 0.9600 | 24853.0997 | 0.8981 |
237 | 0.1200 | 30 | 0.9700 | 25016.2541 | 0.8949 |
238 | 0.1200 | 30 | 0.9800 | 25190.0745 | 0.8961 |
239 | 0.1200 | 30 | 0.9900 | 25137.1456 | 0.8970 |
240 | 0.1300 | 30 | 0.8000 | 25328.8072 | 0.8951 |
241 | 0.1300 | 30 | 0.8100 | 25677.6035 | 0.8920 |
242 | 0.1300 | 30 | 0.8200 | 25612.5971 | 0.8941 |
243 | 0.1300 | 30 | 0.8300 | 25299.9881 | 0.8952 |
244 | 0.1300 | 30 | 0.8400 | 25276.6733 | 0.8933 |
245 | 0.1300 | 30 | 0.8500 | 25395.4655 | 0.8948 |
246 | 0.1300 | 30 | 0.8600 | 25375.5590 | 0.8950 |
247 | 0.1300 | 30 | 0.8700 | 25371.7674 | 0.8953 |
248 | 0.1300 | 30 | 0.8800 | 25316.7328 | 0.8936 |
249 | 0.1300 | 30 | 0.8900 | 25332.7555 | 0.8940 |
250 | 0.1300 | 30 | 0.9000 | 25221.8688 | 0.8967 |
251 | 0.1300 | 30 | 0.9100 | 25228.2105 | 0.8950 |
252 | 0.1300 | 30 | 0.9200 | 25348.1457 | 0.8963 |
253 | 0.1300 | 30 | 0.9300 | 25337.1565 | 0.8975 |
254 | 0.1300 | 30 | 0.9400 | 24855.5039 | 0.8957 |
255 | 0.1300 | 30 | 0.9500 | 25149.2091 | 0.8970 |
256 | 0.1300 | 30 | 0.9600 | 25065.7053 | 0.8965 |
257 | 0.1300 | 30 | 0.9700 | 24998.1724 | 0.8974 |
258 | 0.1300 | 30 | 0.9800 | 25146.8251 | 0.8961 |
259 | 0.1300 | 30 | 0.9900 | 25171.0554 | 0.8917 |
260 | 0.1400 | 30 | 0.8000 | 25519.7418 | 0.8934 |
261 | 0.1400 | 30 | 0.8100 | 25517.0525 | 0.8936 |
262 | 0.1400 | 30 | 0.8200 | 25320.0197 | 0.8944 |
263 | 0.1400 | 30 | 0.8300 | 25263.3055 | 0.8953 |
264 | 0.1400 | 30 | 0.8400 | 25178.4581 | 0.8947 |
265 | 0.1400 | 30 | 0.8500 | 25453.2282 | 0.8944 |
266 | 0.1400 | 30 | 0.8600 | 25345.5451 | 0.8935 |
267 | 0.1400 | 30 | 0.8700 | 25406.7961 | 0.8952 |
268 | 0.1400 | 30 | 0.8800 | 25234.0289 | 0.8948 |
269 | 0.1400 | 30 | 0.8900 | 25258.7209 | 0.8950 |
270 | 0.1400 | 30 | 0.9000 | 25110.6806 | 0.8968 |
271 | 0.1400 | 30 | 0.9100 | 25122.5146 | 0.8958 |
272 | 0.1400 | 30 | 0.9200 | 25247.5922 | 0.8972 |
273 | 0.1400 | 30 | 0.9300 | 25159.7754 | 0.8953 |
274 | 0.1400 | 30 | 0.9400 | 25006.6667 | 0.8976 |
275 | 0.1400 | 30 | 0.9500 | 25002.3834 | 0.8972 |
276 | 0.1400 | 30 | 0.9600 | 25025.8777 | 0.8962 |
277 | 0.1400 | 30 | 0.9700 | 25065.5751 | 0.8986 |
278 | 0.1400 | 30 | 0.9800 | 25033.9803 | 0.8976 |
279 | 0.1400 | 30 | 0.9900 | 25138.8757 | 0.8976 |
280 | 0.1500 | 30 | 0.8000 | 25313.2262 | 0.8943 |
281 | 0.1500 | 30 | 0.8100 | 25364.5597 | 0.8944 |
282 | 0.1500 | 30 | 0.8200 | 25378.7642 | 0.8945 |
283 | 0.1500 | 30 | 0.8300 | 25337.6887 | 0.8935 |
284 | 0.1500 | 30 | 0.8400 | 25359.7424 | 0.8945 |
285 | 0.1500 | 30 | 0.8500 | 25371.3660 | 0.8937 |
286 | 0.1500 | 30 | 0.8600 | 25562.6126 | 0.8934 |
287 | 0.1500 | 30 | 0.8700 | 25315.9071 | 0.8950 |
288 | 0.1500 | 30 | 0.8800 | 25414.9824 | 0.8954 |
289 | 0.1500 | 30 | 0.8900 | 25273.0151 | 0.8957 |
290 | 0.1500 | 30 | 0.9000 | 25280.5025 | 0.8959 |
291 | 0.1500 | 30 | 0.9100 | 25450.5367 | 0.8958 |
292 | 0.1500 | 30 | 0.9200 | 25224.5489 | 0.8959 |
293 | 0.1500 | 30 | 0.9300 | 25538.6673 | 0.8950 |
294 | 0.1500 | 30 | 0.9400 | 25020.0794 | 0.8956 |
295 | 0.1500 | 30 | 0.9500 | 24915.5722 | 0.8954 |
296 | 0.1500 | 30 | 0.9600 | 25013.7292 | 0.8959 |
297 | 0.1500 | 30 | 0.9700 | 25134.7871 | 0.8967 |
298 | 0.1500 | 30 | 0.9800 | 25073.6322 | 0.8964 |
299 | 0.1500 | 30 | 0.9900 | 24830.1082 | 0.8954 |
300 | 0.1600 | 30 | 0.8000 | 25371.8193 | 0.8933 |
301 | 0.1600 | 30 | 0.8100 | 25510.5650 | 0.8959 |
302 | 0.1600 | 30 | 0.8200 | 25654.3858 | 0.8951 |
303 | 0.1600 | 30 | 0.8300 | 25483.7484 | 0.8938 |
304 | 0.1600 | 30 | 0.8400 | 25199.7884 | 0.8955 |
305 | 0.1600 | 30 | 0.8500 | 25307.3109 | 0.8956 |
306 | 0.1600 | 30 | 0.8600 | 25231.0101 | 0.8915 |
307 | 0.1600 | 30 | 0.8700 | 25280.3796 | 0.8955 |
308 | 0.1600 | 30 | 0.8800 | 25268.1001 | 0.8950 |
309 | 0.1600 | 30 | 0.8900 | 25568.0851 | 0.8934 |
310 | 0.1600 | 30 | 0.9000 | 25171.3300 | 0.8964 |
311 | 0.1600 | 30 | 0.9100 | 25130.9337 | 0.8968 |
312 | 0.1600 | 30 | 0.9200 | 25331.8625 | 0.8961 |
313 | 0.1600 | 30 | 0.9300 | 25200.9746 | 0.8947 |
314 | 0.1600 | 30 | 0.9400 | 24910.4047 | 0.8967 |
315 | 0.1600 | 30 | 0.9500 | 24991.1136 | 0.8976 |
316 | 0.1600 | 30 | 0.9600 | 24849.1143 | 0.8976 |
317 | 0.1600 | 30 | 0.9700 | 25364.4195 | 0.8954 |
318 | 0.1600 | 30 | 0.9800 | 24906.6399 | 0.8977 |
319 | 0.1600 | 30 | 0.9900 | 25252.9911 | 0.8957 |
320 | 0.1700 | 30 | 0.8000 | 25508.1059 | 0.8952 |
321 | 0.1700 | 30 | 0.8100 | 25627.3496 | 0.8922 |
322 | 0.1700 | 30 | 0.8200 | 25162.1694 | 0.8942 |
323 | 0.1700 | 30 | 0.8300 | 25313.3254 | 0.8954 |
324 | 0.1700 | 30 | 0.8400 | 25389.5450 | 0.8955 |
325 | 0.1700 | 30 | 0.8500 | 25428.1668 | 0.8934 |
326 | 0.1700 | 30 | 0.8600 | 25398.4563 | 0.8945 |
327 | 0.1700 | 30 | 0.8700 | 25291.4351 | 0.8961 |
328 | 0.1700 | 30 | 0.8800 | 25478.1254 | 0.8933 |
329 | 0.1700 | 30 | 0.8900 | 25391.2349 | 0.8954 |
330 | 0.1700 | 30 | 0.9000 | 25181.2809 | 0.8967 |
331 | 0.1700 | 30 | 0.9100 | 25127.5570 | 0.8962 |
332 | 0.1700 | 30 | 0.9200 | 25235.7226 | 0.8944 |
333 | 0.1700 | 30 | 0.9300 | 25131.2194 | 0.8973 |
334 | 0.1700 | 30 | 0.9400 | 24986.2725 | 0.8971 |
335 | 0.1700 | 30 | 0.9500 | 25109.3512 | 0.8970 |
336 | 0.1700 | 30 | 0.9600 | 25057.2383 | 0.8962 |
337 | 0.1700 | 30 | 0.9700 | 24996.4183 | 0.8970 |
338 | 0.1700 | 30 | 0.9800 | 24866.0483 | 0.8992 |
339 | 0.1700 | 30 | 0.9900 | 25005.6390 | 0.8952 |
340 | 0.1800 | 30 | 0.8000 | 25464.0659 | 0.8928 |
341 | 0.1800 | 30 | 0.8100 | 25683.6299 | 0.8945 |
342 | 0.1800 | 30 | 0.8200 | 25295.2191 | 0.8931 |
343 | 0.1800 | 30 | 0.8300 | 25525.3959 | 0.8908 |
344 | 0.1800 | 30 | 0.8400 | 25653.6839 | 0.8930 |
345 | 0.1800 | 30 | 0.8500 | 25396.1965 | 0.8941 |
346 | 0.1800 | 30 | 0.8600 | 25392.1306 | 0.8944 |
347 | 0.1800 | 30 | 0.8700 | 25517.2482 | 0.8964 |
348 | 0.1800 | 30 | 0.8800 | 25251.3181 | 0.8966 |
349 | 0.1800 | 30 | 0.8900 | 25398.1104 | 0.8956 |
350 | 0.1800 | 30 | 0.9000 | 25145.6083 | 0.8964 |
351 | 0.1800 | 30 | 0.9100 | 25162.0446 | 0.8960 |
352 | 0.1800 | 30 | 0.9200 | 25303.6140 | 0.8939 |
353 | 0.1800 | 30 | 0.9300 | 25075.8085 | 0.8942 |
354 | 0.1800 | 30 | 0.9400 | 25026.3079 | 0.8979 |
355 | 0.1800 | 30 | 0.9500 | 25058.5700 | 0.8964 |
356 | 0.1800 | 30 | 0.9600 | 25099.9346 | 0.8978 |
357 | 0.1800 | 30 | 0.9700 | 25035.6446 | 0.8953 |
358 | 0.1800 | 30 | 0.9800 | 24959.9797 | 0.8950 |
359 | 0.1800 | 30 | 0.9900 | 24972.9613 | 0.8984 |
360 | 0.1900 | 30 | 0.8000 | 25969.8119 | 0.8891 |
361 | 0.1900 | 30 | 0.8100 | 26079.2299 | 0.8875 |
362 | 0.1900 | 30 | 0.8200 | 26038.0914 | 0.8893 |
363 | 0.1900 | 30 | 0.8300 | 26059.7966 | 0.8884 |
364 | 0.1900 | 30 | 0.8400 | 25963.7348 | 0.8846 |
365 | 0.1900 | 30 | 0.8500 | 25809.5009 | 0.8903 |
366 | 0.1900 | 30 | 0.8600 | 25969.0608 | 0.8901 |
367 | 0.1900 | 30 | 0.8700 | 26322.1859 | 0.8907 |
368 | 0.1900 | 30 | 0.8800 | 25750.0185 | 0.8905 |
369 | 0.1900 | 30 | 0.8900 | 25945.6055 | 0.8899 |
370 | 0.1900 | 30 | 0.9000 | 26044.8803 | 0.8895 |
371 | 0.1900 | 30 | 0.9100 | 25828.6522 | 0.8914 |
372 | 0.1900 | 30 | 0.9200 | 25804.7176 | 0.8922 |
373 | 0.1900 | 30 | 0.9300 | 25704.0433 | 0.8917 |
374 | 0.1900 | 30 | 0.9400 | 25650.3304 | 0.8918 |
375 | 0.1900 | 30 | 0.9500 | 25693.0419 | 0.8916 |
376 | 0.1900 | 30 | 0.9600 | 25818.1989 | 0.8901 |
377 | 0.1900 | 30 | 0.9700 | 25626.4920 | 0.8919 |
378 | 0.1900 | 30 | 0.9800 | 25458.3044 | 0.8935 |
379 | 0.1900 | 30 | 0.9900 | 25583.0284 | 0.8900 |
results_close.sort_values(by='R2', ascending=False).head(15)
Correlation Coefficient | Cardinality Thresh | Myraid Cutoff | RMSE | R2 | |
---|---|---|---|---|---|
18 | 0.0100 | 30 | 0.9800 | 25123.3375 | 0.9000 |
14 | 0.0100 | 30 | 0.9400 | 25056.7559 | 0.9000 |
134 | 0.0700 | 30 | 0.9400 | 24996.8801 | 0.8997 |
96 | 0.0500 | 30 | 0.9600 | 25053.3417 | 0.8996 |
338 | 0.1700 | 30 | 0.9800 | 24866.0483 | 0.8992 |
178 | 0.0900 | 30 | 0.9800 | 24786.3574 | 0.8991 |
174 | 0.0900 | 30 | 0.9400 | 24964.7274 | 0.8991 |
39 | 0.0200 | 30 | 0.9900 | 25020.5700 | 0.8989 |
116 | 0.0600 | 30 | 0.9600 | 24969.0290 | 0.8989 |
159 | 0.0800 | 30 | 0.9900 | 25159.6298 | 0.8988 |
175 | 0.0900 | 30 | 0.9500 | 24852.4966 | 0.8987 |
195 | 0.1000 | 30 | 0.9500 | 24912.1327 | 0.8987 |
50 | 0.0300 | 30 | 0.9000 | 25088.4363 | 0.8987 |
216 | 0.1100 | 30 | 0.9600 | 24998.7718 | 0.8987 |
37 | 0.0200 | 30 | 0.9700 | 24869.1029 | 0.8986 |
Cardinality Thresh
to see if we can get even bettercardinality_thresh = np.arange(10,300,10)
c_dict = {}
for c in cardinality_thresh:
round_transformed_data = transform_features(data)
round_selected_data = select_features(round_transformed_data, 0.05, c, 0.94)
round_rmse_data = train_and_test(round_selected_data,9)
c_dict[c] = round_rmse_data[1]
card_ser = pd.Series(c_dict)
# Plot
plt.figure(figsize=(20,8))
sns.lineplot(data=card_ser, x=card_ser.index, y=card_ser.values)
plt.title('Finding the best Cardinal Thresh', fontsize=20)
plt.xlabel('Cardinality Thresh', fontsize=15)
plt.ylabel('R2 Score', fontsize=15)
plt.axhline(card_ser.values.max(), color='red', linestyle='--')
sns.despine()
plt.grid()
Cardinality Thresh
can be close to 100# We updated transform_features()
round_transformed_data = transform_features(data)
round_selected_data = select_features(round_transformed_data, 0.05, 213, 0.94)
round_rmse_data = train_and_test(round_selected_data,9)
print(round_rmse_data)
(24880.337336146637, 0.8995203769925438)
folds_dict = {}
folds = np.arange(3,100)
for cv in folds:
round_transformed_data = transform_features(data)
round_selected_data = select_features(round_transformed_data, 0.05, 213, 0.94)
round_rmse_data = train_and_test(round_selected_data, cv)
folds_dict[cv] = round_rmse_data[0]
folds_ser = pd.Series(folds_dict)
# Plot
plt.figure(figsize=(20,8))
sns.histplot(data=folds_ser, x=folds_ser.index, y=folds_ser.values)
plt.xlabel('Folds', fontsize=15)
plt.ylabel('RMSE', fontsize=15)
sns.despine()
plt.grid()
# We updated transform_features()
round_transformed_data = transform_features(data)
round_selected_data = select_features(round_transformed_data, 0.05, 214, 0.94)
round_rmse_data = train_and_test(round_selected_data,9)
print(round_rmse_data)
(24743.20299773431, 0.8969196329782585)
SalePrice
is above than USD 700,000 or even USD 600,000 could be ousted but I kept themIt is yet to see as well that how this data would respond to Ridge, and Lasso, where we will be having quite a few parameters to optmize with.