The data used for this project is drawn from a paper written in 2017 by Dean de Cock. You can read more about why the data was collected here. You can also read about the different columns in the data here.
The paper presents a data set describing the sale of individual residential property in Ames, Iowa from 2006 to 2010. The data set contains 2930 observations and a large number of explanatory variables (23 nominal, 23 ordinal, 14 discrete, and 20 continuous) involved in assessing home values. The data set was provided for students as a final project in an undergraduate regression course.
One important value of an estimated regression equation is its ability to predict the effects on Y of a change in one or more values of the independent variables. The value of this is obvious. Careful policy cannot be made without estimates of the effects that may result. It is the desire for particular results that drive the formation of most policy. Regression models can be, and have been, invaluable aids in forming such policies.
The goal for this project is to use regression analysis in various forms of python coding and available library resources (sklearn.model, statsmodels, ...) to:
# import a whole pile of key python libary
# modules to execute various code commands.
import pandas as pd
import numpy as np
import random
import string
import matplotlib.pyplot as plt
import seaborn as sns
from numpy.random import seed, randint
from IPython.display import HTML
from IPython.display import display, Markdown
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error
from sklearn import linear_model
import warnings
warnings.filterwarnings('ignore')
# read data file provided and observe data structure and contents.
df = pd.read_csv('AmesHousing.txt', delimiter='\t', na_values=['NaN'])
print(df.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2930 entries, 0 to 2929 Data columns (total 82 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Order 2930 non-null int64 1 PID 2930 non-null int64 2 MS SubClass 2930 non-null int64 3 MS Zoning 2930 non-null object 4 Lot Frontage 2440 non-null float64 5 Lot Area 2930 non-null int64 6 Street 2930 non-null object 7 Alley 198 non-null object 8 Lot Shape 2930 non-null object 9 Land Contour 2930 non-null object 10 Utilities 2930 non-null object 11 Lot Config 2930 non-null object 12 Land Slope 2930 non-null object 13 Neighborhood 2930 non-null object 14 Condition 1 2930 non-null object 15 Condition 2 2930 non-null object 16 Bldg Type 2930 non-null object 17 House Style 2930 non-null object 18 Overall Qual 2930 non-null int64 19 Overall Cond 2930 non-null int64 20 Year Built 2930 non-null int64 21 Year Remod/Add 2930 non-null int64 22 Roof Style 2930 non-null object 23 Roof Matl 2930 non-null object 24 Exterior 1st 2930 non-null object 25 Exterior 2nd 2930 non-null object 26 Mas Vnr Type 2907 non-null object 27 Mas Vnr Area 2907 non-null float64 28 Exter Qual 2930 non-null object 29 Exter Cond 2930 non-null object 30 Foundation 2930 non-null object 31 Bsmt Qual 2850 non-null object 32 Bsmt Cond 2850 non-null object 33 Bsmt Exposure 2847 non-null object 34 BsmtFin Type 1 2850 non-null object 35 BsmtFin SF 1 2929 non-null float64 36 BsmtFin Type 2 2849 non-null object 37 BsmtFin SF 2 2929 non-null float64 38 Bsmt Unf SF 2929 non-null float64 39 Total Bsmt SF 2929 non-null float64 40 Heating 2930 non-null object 41 Heating QC 2930 non-null object 42 Central Air 2930 non-null object 43 Electrical 2929 non-null object 44 1st Flr SF 2930 non-null int64 45 2nd Flr SF 2930 non-null int64 46 Low Qual Fin SF 2930 non-null int64 47 Gr Liv Area 2930 non-null int64 48 Bsmt Full Bath 2928 non-null float64 49 Bsmt Half Bath 2928 non-null float64 50 Full Bath 2930 non-null int64 51 Half Bath 2930 non-null int64 52 Bedroom AbvGr 2930 non-null int64 53 Kitchen AbvGr 2930 non-null int64 54 Kitchen Qual 2930 non-null object 55 TotRms AbvGrd 2930 non-null int64 56 Functional 2930 non-null object 57 Fireplaces 2930 non-null int64 58 Fireplace Qu 1508 non-null object 59 Garage Type 2773 non-null object 60 Garage Yr Blt 2771 non-null float64 61 Garage Finish 2771 non-null object 62 Garage Cars 2929 non-null float64 63 Garage Area 2929 non-null float64 64 Garage Qual 2771 non-null object 65 Garage Cond 2771 non-null object 66 Paved Drive 2930 non-null object 67 Wood Deck SF 2930 non-null int64 68 Open Porch SF 2930 non-null int64 69 Enclosed Porch 2930 non-null int64 70 3Ssn Porch 2930 non-null int64 71 Screen Porch 2930 non-null int64 72 Pool Area 2930 non-null int64 73 Pool QC 13 non-null object 74 Fence 572 non-null object 75 Misc Feature 106 non-null object 76 Misc Val 2930 non-null int64 77 Mo Sold 2930 non-null int64 78 Yr Sold 2930 non-null int64 79 Sale Type 2930 non-null object 80 Sale Condition 2930 non-null object 81 SalePrice 2930 non-null int64 dtypes: float64(11), int64(28), object(43) memory usage: 1.8+ MB None
There are certain columns which have a high quantity of missing values as shown in the summary of information above. I will remove those at this point of the analysis that have more than 25% missing values.
df = df.drop(['Alley', 'Fireplace Qu', 'Pool QC',\
'Fence', 'Misc Feature'], axis = 1)
# extract numerical data classified as 'flota64 and 'int64'.
numerical_df = df.select_dtypes(include=['float64', 'int64'])
print(numerical_df.info(), '\n')
print(numerical_df.head(3), '\n')
# identify columns with missing values and isolate ones with less than 5%.
df_null_counts = numerical_df.isnull().sum()
print(df_null_counts), ']n'
fill_missing_values = df[df_null_counts[(df_null_counts>0) & (df_null_counts<146)].index]
# confirm successful isolation
print(fill_missing_values.isnull().sum())
# fill missing values with column mean rather than removing them.
fill_missing_values = fill_missing_values.fillna(fill_missing_values.mean())
# confirm no missing values present in any columns.
fill_missing_values.isna().sum().sum()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2930 entries, 0 to 2929 Data columns (total 39 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Order 2930 non-null int64 1 PID 2930 non-null int64 2 MS SubClass 2930 non-null int64 3 Lot Frontage 2440 non-null float64 4 Lot Area 2930 non-null int64 5 Overall Qual 2930 non-null int64 6 Overall Cond 2930 non-null int64 7 Year Built 2930 non-null int64 8 Year Remod/Add 2930 non-null int64 9 Mas Vnr Area 2907 non-null float64 10 BsmtFin SF 1 2929 non-null float64 11 BsmtFin SF 2 2929 non-null float64 12 Bsmt Unf SF 2929 non-null float64 13 Total Bsmt SF 2929 non-null float64 14 1st Flr SF 2930 non-null int64 15 2nd Flr SF 2930 non-null int64 16 Low Qual Fin SF 2930 non-null int64 17 Gr Liv Area 2930 non-null int64 18 Bsmt Full Bath 2928 non-null float64 19 Bsmt Half Bath 2928 non-null float64 20 Full Bath 2930 non-null int64 21 Half Bath 2930 non-null int64 22 Bedroom AbvGr 2930 non-null int64 23 Kitchen AbvGr 2930 non-null int64 24 TotRms AbvGrd 2930 non-null int64 25 Fireplaces 2930 non-null int64 26 Garage Yr Blt 2771 non-null float64 27 Garage Cars 2929 non-null float64 28 Garage Area 2929 non-null float64 29 Wood Deck SF 2930 non-null int64 30 Open Porch SF 2930 non-null int64 31 Enclosed Porch 2930 non-null int64 32 3Ssn Porch 2930 non-null int64 33 Screen Porch 2930 non-null int64 34 Pool Area 2930 non-null int64 35 Misc Val 2930 non-null int64 36 Mo Sold 2930 non-null int64 37 Yr Sold 2930 non-null int64 38 SalePrice 2930 non-null int64 dtypes: float64(11), int64(28) memory usage: 892.9 KB None Order PID MS SubClass Lot Frontage Lot Area Overall Qual \ 0 1 526301100 20 141.0 31770 6 1 2 526350040 20 80.0 11622 5 2 3 526351010 20 81.0 14267 6 Overall Cond Year Built Year Remod/Add Mas Vnr Area ... Wood Deck SF \ 0 5 1960 1960 112.0 ... 210 1 6 1961 1961 0.0 ... 140 2 6 1958 1958 108.0 ... 393 Open Porch SF Enclosed Porch 3Ssn Porch Screen Porch Pool Area \ 0 62 0 0 0 0 1 0 0 0 120 0 2 36 0 0 0 0 Misc Val Mo Sold Yr Sold SalePrice 0 0 5 2010 215000 1 0 6 2010 105000 2 12500 6 2010 172000 [3 rows x 39 columns] Order 0 PID 0 MS SubClass 0 Lot Frontage 490 Lot Area 0 Overall Qual 0 Overall Cond 0 Year Built 0 Year Remod/Add 0 Mas Vnr Area 23 BsmtFin SF 1 1 BsmtFin SF 2 1 Bsmt Unf SF 1 Total Bsmt SF 1 1st Flr SF 0 2nd Flr SF 0 Low Qual Fin SF 0 Gr Liv Area 0 Bsmt Full Bath 2 Bsmt Half Bath 2 Full Bath 0 Half Bath 0 Bedroom AbvGr 0 Kitchen AbvGr 0 TotRms AbvGrd 0 Fireplaces 0 Garage Yr Blt 159 Garage Cars 1 Garage Area 1 Wood Deck SF 0 Open Porch SF 0 Enclosed Porch 0 3Ssn Porch 0 Screen Porch 0 Pool Area 0 Misc Val 0 Mo Sold 0 Yr Sold 0 SalePrice 0 dtype: int64 Mas Vnr Area 23 BsmtFin SF 1 1 BsmtFin SF 2 1 Bsmt Unf SF 1 Total Bsmt SF 1 Bsmt Full Bath 2 Bsmt Half Bath 2 Garage Cars 1 Garage Area 1 dtype: int64
0
Years Until Remodelled
House Square Feet Data
# drop numerical columns with missing values from original file.
df_temp = df.drop(['Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2',\
'Bsmt Unf SF', 'Total Bsmt SF', 'Bsmt Full Bath',\
'Bsmt Half Bath', 'Garage Cars', 'Garage Area',\
], axis = 1)
# restore numerical columns to original file that had
# missing values replaced with column mean.
df2 = pd.concat([df_temp, fill_missing_values], axis=1)
# add new variables based on calculations.
df2['yrs_until_remod'] = df2['Year Remod/Add'] - df2['Year Built']
df2['Total_SF'] = df2['1st Flr SF'] + df2['2nd Flr SF'] + df2['Total Bsmt SF']
df2['1st_plus_2nd'] = df2['1st Flr SF'] + df2['2nd Flr SF']
print(df2.info())
# drop variables that would unlikely contribute to house price prediciton.
df2 = df2.drop(['PID', 'Lot Frontage', 'Year Built',\
'Year Remod/Add', 'Garage Yr Blt', \
'Misc Val', 'Mo Sold', 'Yr Sold'], axis = 1)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2930 entries, 0 to 2929 Data columns (total 80 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Order 2930 non-null int64 1 PID 2930 non-null int64 2 MS SubClass 2930 non-null int64 3 MS Zoning 2930 non-null object 4 Lot Frontage 2440 non-null float64 5 Lot Area 2930 non-null int64 6 Street 2930 non-null object 7 Lot Shape 2930 non-null object 8 Land Contour 2930 non-null object 9 Utilities 2930 non-null object 10 Lot Config 2930 non-null object 11 Land Slope 2930 non-null object 12 Neighborhood 2930 non-null object 13 Condition 1 2930 non-null object 14 Condition 2 2930 non-null object 15 Bldg Type 2930 non-null object 16 House Style 2930 non-null object 17 Overall Qual 2930 non-null int64 18 Overall Cond 2930 non-null int64 19 Year Built 2930 non-null int64 20 Year Remod/Add 2930 non-null int64 21 Roof Style 2930 non-null object 22 Roof Matl 2930 non-null object 23 Exterior 1st 2930 non-null object 24 Exterior 2nd 2930 non-null object 25 Mas Vnr Type 2907 non-null object 26 Exter Qual 2930 non-null object 27 Exter Cond 2930 non-null object 28 Foundation 2930 non-null object 29 Bsmt Qual 2850 non-null object 30 Bsmt Cond 2850 non-null object 31 Bsmt Exposure 2847 non-null object 32 BsmtFin Type 1 2850 non-null object 33 BsmtFin Type 2 2849 non-null object 34 Heating 2930 non-null object 35 Heating QC 2930 non-null object 36 Central Air 2930 non-null object 37 Electrical 2929 non-null object 38 1st Flr SF 2930 non-null int64 39 2nd Flr SF 2930 non-null int64 40 Low Qual Fin SF 2930 non-null int64 41 Gr Liv Area 2930 non-null int64 42 Full Bath 2930 non-null int64 43 Half Bath 2930 non-null int64 44 Bedroom AbvGr 2930 non-null int64 45 Kitchen AbvGr 2930 non-null int64 46 Kitchen Qual 2930 non-null object 47 TotRms AbvGrd 2930 non-null int64 48 Functional 2930 non-null object 49 Fireplaces 2930 non-null int64 50 Garage Type 2773 non-null object 51 Garage Yr Blt 2771 non-null float64 52 Garage Finish 2771 non-null object 53 Garage Qual 2771 non-null object 54 Garage Cond 2771 non-null object 55 Paved Drive 2930 non-null object 56 Wood Deck SF 2930 non-null int64 57 Open Porch SF 2930 non-null int64 58 Enclosed Porch 2930 non-null int64 59 3Ssn Porch 2930 non-null int64 60 Screen Porch 2930 non-null int64 61 Pool Area 2930 non-null int64 62 Misc Val 2930 non-null int64 63 Mo Sold 2930 non-null int64 64 Yr Sold 2930 non-null int64 65 Sale Type 2930 non-null object 66 Sale Condition 2930 non-null object 67 SalePrice 2930 non-null int64 68 Mas Vnr Area 2930 non-null float64 69 BsmtFin SF 1 2930 non-null float64 70 BsmtFin SF 2 2930 non-null float64 71 Bsmt Unf SF 2930 non-null float64 72 Total Bsmt SF 2930 non-null float64 73 Bsmt Full Bath 2930 non-null float64 74 Bsmt Half Bath 2930 non-null float64 75 Garage Cars 2930 non-null float64 76 Garage Area 2930 non-null float64 77 yrs_until_remod 2930 non-null int64 78 Total_SF 2930 non-null float64 79 1st_plus_2nd 2930 non-null int64 dtypes: float64(12), int64(30), object(38) memory usage: 1.8+ MB None
Data leakage can cause you to create overly optimistic if not completely invalid predictive models.
Data leakage is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the mode being constructed.
How could data leakage pose a problem?
# use seaborn library to create boxplot.
fig, ax = plt.subplots()
fig.set_size_inches(12, 10)
column=['Gr Liv Area', '1st_plus_2nd', 'Total Bsmt SF', '1st Flr SF', '2nd Flr SF', 'Total_SF']
ax = sns.boxplot(data=df2[column], palette='colorblind')
plt.title('Boxplot of Square Feet by House Location', fontsize=20, pad=20)
plt.xlabel('House Locations', fontsize=20, labelpad = 18)
plt.ylabel('Square Feet', fontsize=20, labelpad = 18)
plt.yticks(fontsize=13)
plt.xticks(fontsize=13)
(array([0, 1, 2, 3, 4, 5]), [Text(0, 0, 'Gr Liv Area'), Text(1, 0, '1st_plus_2nd'), Text(2, 0, 'Total Bsmt SF'), Text(3, 0, '1st Flr SF'), Text(4, 0, '2nd Flr SF'), Text(5, 0, 'Total_SF')])
The graph above confirms the sum of 1st and 2nd floor square footage is equal to 'Gr Liv Area' square feet. So I will remove columns 'Gr Liv Area' and '1st_plus_2nd'.
Rather than analyzing square feet by each floor separately, I feel it would be most appropriate to analyze total square feet since there are houses with no second floor (i.e. 'Ranch-Type' or 'Bungalow') and some with no basement. So, I will eliminate the columns of square feet by floor.
df2 = df2.drop(['Gr Liv Area', '1st_plus_2nd', 'Total Bsmt SF', '1st Flr SF', '2nd Flr SF'], axis = 1)
numerical_df2 = df2.select_dtypes(include=['float64', 'int64'])
print(numerical_df2.info())
numerical_df2.isna().sum().sum()
null_series = numerical_df2.isnull().sum()
print(null_series)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2930 entries, 0 to 2929 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Order 2930 non-null int64 1 MS SubClass 2930 non-null int64 2 Lot Area 2930 non-null int64 3 Overall Qual 2930 non-null int64 4 Overall Cond 2930 non-null int64 5 Low Qual Fin SF 2930 non-null int64 6 Full Bath 2930 non-null int64 7 Half Bath 2930 non-null int64 8 Bedroom AbvGr 2930 non-null int64 9 Kitchen AbvGr 2930 non-null int64 10 TotRms AbvGrd 2930 non-null int64 11 Fireplaces 2930 non-null int64 12 Wood Deck SF 2930 non-null int64 13 Open Porch SF 2930 non-null int64 14 Enclosed Porch 2930 non-null int64 15 3Ssn Porch 2930 non-null int64 16 Screen Porch 2930 non-null int64 17 Pool Area 2930 non-null int64 18 SalePrice 2930 non-null int64 19 Mas Vnr Area 2930 non-null float64 20 BsmtFin SF 1 2930 non-null float64 21 BsmtFin SF 2 2930 non-null float64 22 Bsmt Unf SF 2930 non-null float64 23 Bsmt Full Bath 2930 non-null float64 24 Bsmt Half Bath 2930 non-null float64 25 Garage Cars 2930 non-null float64 26 Garage Area 2930 non-null float64 27 yrs_until_remod 2930 non-null int64 28 Total_SF 2930 non-null float64 dtypes: float64(9), int64(20) memory usage: 664.0 KB None Order 0 MS SubClass 0 Lot Area 0 Overall Qual 0 Overall Cond 0 Low Qual Fin SF 0 Full Bath 0 Half Bath 0 Bedroom AbvGr 0 Kitchen AbvGr 0 TotRms AbvGrd 0 Fireplaces 0 Wood Deck SF 0 Open Porch SF 0 Enclosed Porch 0 3Ssn Porch 0 Screen Porch 0 Pool Area 0 SalePrice 0 Mas Vnr Area 0 BsmtFin SF 1 0 BsmtFin SF 2 0 Bsmt Unf SF 0 Bsmt Full Bath 0 Bsmt Half Bath 0 Garage Cars 0 Garage Area 0 yrs_until_remod 0 Total_SF 0 dtype: int64
# compute the pairwise correlation coefficients between all of the columns in train_subset
df2_subset = df2[null_series.index]
corr_df = df2_subset.corr(method ='pearson')
Sale_Price_Corr = abs(corr_df['SalePrice'])
sorted_corrs = Sale_Price_Corr.sort_values(ascending=True)
print(sorted_corrs)
BsmtFin SF 2 0.005890 Order 0.031408 3Ssn Porch 0.032225 Bsmt Half Bath 0.035815 Low Qual Fin SF 0.037660 Pool Area 0.068403 MS SubClass 0.085092 Overall Cond 0.101697 Screen Porch 0.112151 Kitchen AbvGr 0.119814 Enclosed Porch 0.128787 Bedroom AbvGr 0.143913 Bsmt Unf SF 0.182805 yrs_until_remod 0.240168 Lot Area 0.266549 Bsmt Full Bath 0.275894 Half Bath 0.285056 Open Porch SF 0.312951 Wood Deck SF 0.327143 BsmtFin SF 1 0.432794 Fireplaces 0.474558 TotRms AbvGrd 0.495474 Mas Vnr Area 0.505784 Full Bath 0.545604 Garage Area 0.640385 Garage Cars 0.647861 Total_SF 0.793054 Overall Qual 0.799262 SalePrice 1.000000 Name: SalePrice, dtype: float64
The candidate features listed above are sorted by how strongly they're correlated with the SalePrice column. I will keep only the features that have a correlation of 0.3 or higher. This cutoff is a bit arbitrary, however I will start with that.
The next thing I will look for is potential collinearity between some of these feature columns. Collinearity is when 2 feature columns are highly correlated and stand the risk of duplicating information. If we have 2 features that convey the same information using 2 different measures or metrics, we don't need to keep both.
Generate a correlation matrix heatmap using Seaborn to visually compare the correlations and look for problematic pairwise feature correlations is an effective way to pick out potential cases of collinearity. Looking for outlier values in the heatmap makes this visual representation easier than just looking at the table of pairwise comparisons.
# plot seaborn heatmap.
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(12,10))
strong_corrs = sorted_corrs[sorted_corrs > 0.3]
corrmat = df2_subset[strong_corrs.index]
corr2_df = corrmat.corr(method ='pearson')
ax = sns.heatmap(corr2_df)
The diagonal of white squares reflect the correlations of the variables with themselves (i.e. perfect correlation of 1.0)
Based on the correlation matrix heatmap (i.e. color shade comparisons), we can tell that the following pair of columns are strongly correlated:
If we read the descriptions of these two columns from the data documentation, we can tell that each of the columns reflects very similar information. I'll remove 'Garage Cars'.
Establish the final numerical house features to test predictability of house sale price.
final_corr_cols = strong_corrs.drop(['Garage Cars'])
print(final_corr_cols, '\n')
features = final_corr_cols.drop(['SalePrice']).index
display(Markdown('<h3><span style="color:blue"> Ten Numerical House Features </span></h3>'))
print(features, '\n')
Open Porch SF 0.312951 Wood Deck SF 0.327143 BsmtFin SF 1 0.432794 Fireplaces 0.474558 TotRms AbvGrd 0.495474 Mas Vnr Area 0.505784 Full Bath 0.545604 Garage Area 0.640385 Total_SF 0.793054 Overall Qual 0.799262 SalePrice 1.000000 Name: SalePrice, dtype: float64
Index(['Open Porch SF', 'Wood Deck SF', 'BsmtFin SF 1', 'Fireplaces', 'TotRms AbvGrd', 'Mas Vnr Area', 'Full Bath', 'Garage Area', 'Total_SF', 'Overall Qual'], dtype='object')
Now I'm ready to split the 'cleaned' numerical data into 'Train' and 'Test' groups. I will split the data equally (50%) between the two.
There seems to be arguments both ways to either split the full data set into train and test without shuffling randomly or to include random shuffling. I have chosen to include randomizing.
# split dataframe file in half into 'Train' and 'Test' dataframes.
# include randimize shuffling to prevent confounding of results.
train=df2.sample(frac=0.5,random_state=1) #random state is a seed value
test=df2.drop(train.index)
# confirm equal split of data.
print(len(train), len(test), '\n')
target = 'SalePrice'
clean_test = test[final_corr_cols.index].dropna()
# use sklearn linear model to calculate magnitude of errors (MSE & RMSE).
lr = LinearRegression()
lr.fit(train[features], train['SalePrice'])
train_predictions = lr.predict(train[features])
test_predictions = lr.predict(clean_test[features])
train_mse = mean_squared_error(train_predictions, train[target])
test_mse = mean_squared_error(test_predictions, clean_test[target])
train_rmse = np.sqrt(train_mse)
test_rmse = np.sqrt(test_mse)
print('Train Data Root Mean Square Error =', '{:.0f}'.format(train_rmse), '\n')
print('Test Data Root Mean Square Error =', '{:.0f}'.format(test_rmse))
1465 1465 Train Data Root Mean Square Error = 37231 Test Data Root Mean Square Error = 34335
RMSE
So what do the two Root Mean Square Error values above, one from 'Train' data and the other from 'Test' data tell us? Well, the value for 'Test' data is lower than the value from 'Train' which I suppose is good (i.e. lower error). But, does it tell us anything about how well the 10 features included in the Multiple Regression model fit as a predictor model for housing sale price? As a numerical individual value, I really don't think so.
R-Squared
In Multiple Linear Regression, adjusted R-Squared (corrected for sample size and regression coefficients) is more appropriate than R-Squared as an increasing number of X variables also increases R-Squared. Adjusted R-Squared is always lower than the R-Squared.
Adjusted-R²
Specifically, we scale (1-R²) by a factor that is directly proportional to the number of regression variables. Greater is the number of regression variables in the model, greater is this scaling factor and greater is the downward adjustment to R².
# create column groups from 1 variable to 10 and compare
# calculated R-Squared values for each.
col1 = ['Overall Qual']
col2 = ['Overall Qual', 'Total_SF']
col3 = ['Overall Qual', 'Total_SF', 'Garage Area']
col4 = ['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath']
col5 = ['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath',\
'Mas Vnr Area']
col6 = ['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath',\
'Mas Vnr Area', 'TotRms AbvGrd']
col7 = ['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath',\
'Mas Vnr Area', 'TotRms AbvGrd', 'Fireplaces']
col8 = ['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath',\
'Mas Vnr Area', 'TotRms AbvGrd', 'Fireplaces', 'BsmtFin SF 1']
col9 = ['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath',\
'Mas Vnr Area', 'TotRms AbvGrd', 'Fireplaces', 'BsmtFin SF 1',\
'Wood Deck SF']
col10 = ['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath',\
'Mas Vnr Area', 'TotRms AbvGrd', 'Fireplaces', 'BsmtFin SF 1',\
'Wood Deck SF', 'Open Porch SF']
list = [col1, col2, col3, col4, col5, col6, col7, col8, col9, col10]
j = 0
for n in list:
X = test[n]
Y = test['SalePrice']
# with statsmodels
X = sm.add_constant(X) # adding a constant
j = j + 1
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
display(Markdown('<h3><span style="color:blue"> {} Features </span></h3>'.format(j)))
print(n, '\n')
rsquared = model.rsquared
print('R-Squared Value =', "{:.2f}%".format(rsquared*100))
['Overall Qual'] R-Squared Value = 64.03%
['Overall Qual', 'Total_SF'] R-Squared Value = 76.90%
['Overall Qual', 'Total_SF', 'Garage Area'] R-Squared Value = 78.40%
['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath'] R-Squared Value = 78.41%
['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath', 'Mas Vnr Area'] R-Squared Value = 79.06%
['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath', 'Mas Vnr Area', 'TotRms AbvGrd'] R-Squared Value = 79.07%
['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath', 'Mas Vnr Area', 'TotRms AbvGrd', 'Fireplaces'] R-Squared Value = 79.84%
['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath', 'Mas Vnr Area', 'TotRms AbvGrd', 'Fireplaces', 'BsmtFin SF 1'] R-Squared Value = 80.76%
['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath', 'Mas Vnr Area', 'TotRms AbvGrd', 'Fireplaces', 'BsmtFin SF 1', 'Wood Deck SF'] R-Squared Value = 80.93%
['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath', 'Mas Vnr Area', 'TotRms AbvGrd', 'Fireplaces', 'BsmtFin SF 1', 'Wood Deck SF', 'Open Porch SF'] R-Squared Value = 80.94%
Of the 10 features included in the model, the following three seem to have little additional impact on total explained variability: 'TotRms AbvGrd', 'Wood Deck SF', 'Open Porch SF'.
I will rerun the Multiple Regression Analysis after removing these three variables.
# limiting the independent variables to seven.
col = ['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath',\
'Mas Vnr Area', 'Fireplaces', 'BsmtFin SF 1']
X = test[col]
Y = test['SalePrice']
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, Y)
print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)
# with statsmodels
X = sm.add_constant(X) # adding a constant
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
display(Markdown('<h3><span style="color:blue"> {} Features </span></h3>'.format(7)))
print(col, '\n')
print_model = model.summary()
print(print_model)
Intercept: -84039.20489769935 Coefficients: [2.20689106e+04 2.99955690e+01 5.00006747e+01 7.77247138e+03 3.30347939e+01 1.03369485e+04 1.88517499e+01]
['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath', 'Mas Vnr Area', 'Fireplaces', 'BsmtFin SF 1'] OLS Regression Results ============================================================================== Dep. Variable: SalePrice R-squared: 0.807 Model: OLS Adj. R-squared: 0.806 Method: Least Squares F-statistic: 871.3 Date: Wed, 14 Jul 2021 Prob (F-statistic): 0.00 Time: 16:33:11 Log-Likelihood: -17379. No. Observations: 1465 AIC: 3.477e+04 Df Residuals: 1457 BIC: 3.482e+04 Df Model: 7 Covariance Type: nonrobust ================================================================================ coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------- const -8.404e+04 4386.895 -19.157 0.000 -9.26e+04 -7.54e+04 Overall Qual 2.207e+04 925.548 23.844 0.000 2.03e+04 2.39e+04 Total_SF 29.9956 1.850 16.214 0.000 26.367 33.624 Garage Area 50.0007 5.424 9.219 0.000 39.361 60.640 Full Bath 7772.4714 2142.082 3.628 0.000 3570.577 1.2e+04 Mas Vnr Area 33.0348 5.665 5.832 0.000 21.923 44.147 Fireplaces 1.034e+04 1613.944 6.405 0.000 7171.046 1.35e+04 BsmtFin SF 1 18.8517 2.309 8.164 0.000 14.322 23.381 ============================================================================== Omnibus: 544.756 Durbin-Watson: 1.624 Prob(Omnibus): 0.000 Jarque-Bera (JB): 50591.128 Skew: -0.767 Prob(JB): 0.00 Kurtosis: 31.748 Cond. No. 1.37e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.37e+04. This might indicate that there are strong multicollinearity or other numerical problems.
Finding 7 features out of the original 29 numerical columns that yield an adjusted R-Squared value of 80.6% I would say is very good for a start.
I woud like to investigate 'Sale Condition' as a variable and determine what to do with it.
# determine how many unique 'values' there are for 'Sale Conditions'.
print(df2['Sale Condition'].value_counts(dropna=False))
Normal 2413 Partial 245 Abnorml 190 Family 46 Alloca 24 AdjLand 12 Name: Sale Condition, dtype: int64
I see in the output above that there are 6 types of 'Sale Condition' categories. Let's generate a box plot and compare house sale price for each Sale Condition category.
fig, ax = plt.subplots()
fig.set_size_inches(12, 10)
sns.boxplot(y=df2['SalePrice'], x=df2['Sale Condition'],
data=df2,
palette='colorblind')
plt.title('Boxplot of House Sale Price by Sale Condition', fontsize=20, pad=20)
plt.xlabel('Sale Condition', fontsize=18, labelpad = 18)
plt.ylabel('House Sale Price', fontsize=18, labelpad = 18)
plt.yticks(fontsize=13)
plt.xticks(fontsize=13)
(array([0, 1, 2, 3, 4, 5]), [Text(0, 0, 'Normal'), Text(1, 0, 'Partial'), Text(2, 0, 'Family'), Text(3, 0, 'Abnorml'), Text(4, 0, 'Alloca'), Text(5, 0, 'AdjLand')])
The Sale Condition classed as 'Normal' represents just over 82% of all houses in this data file. Including the other five types of 'Sale Condition' categories shown above may 'muddy' the water regarding house sale price predictability. There may be significant differences in their respective means.
Let's see if we confine the data to Sale Condition = 'Normal' only if the overall model predictability increases (i.e R-Squared).
# filter 'Sale Condition' to one category = 'Normal'.
df2_normal = df2[df2['Sale Condition'] == 'Normal']
print(df2_normal['Sale Condition'].value_counts())
# split dataframe file in half into 'Train' and 'Test' dataframes.
# include randimize shuffling to prevent confounding of results.
train=df2_normal.sample(frac=0.5,random_state=1) #random state is a seed value
test=df2_normal.drop(train.index)
print(len(train), len(test), '\n')
col = ['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath',\
'Mas Vnr Area', 'Fireplaces', 'BsmtFin SF 1']
X = test[col]
Y = test['SalePrice']
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, Y)
print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)
# with statsmodels
X = sm.add_constant(X) # adding a constant
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
display(Markdown('<h3><span style="color:blue"> {} Features </span></h3>'.format(7)))
print(col, '\n')
print_model = model.summary()
print(print_model)
Normal 2413 Name: Sale Condition, dtype: int64 1206 1207 Intercept: -80393.62619124033 Coefficients: [18838.59607641 39.75151283 42.1614011 3792.99175005 39.89610188 7595.27313596 23.99813149]
['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath', 'Mas Vnr Area', 'Fireplaces', 'BsmtFin SF 1'] OLS Regression Results ============================================================================== Dep. Variable: SalePrice R-squared: 0.848 Model: OLS Adj. R-squared: 0.847 Method: Least Squares F-statistic: 954.4 Date: Wed, 14 Jul 2021 Prob (F-statistic): 0.00 Time: 16:33:11 Log-Likelihood: -14065. No. Observations: 1207 AIC: 2.815e+04 Df Residuals: 1199 BIC: 2.819e+04 Df Model: 7 Covariance Type: nonrobust ================================================================================ coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------- const -8.039e+04 4066.339 -19.771 0.000 -8.84e+04 -7.24e+04 Overall Qual 1.884e+04 852.965 22.086 0.000 1.72e+04 2.05e+04 Total_SF 39.7515 1.866 21.300 0.000 36.090 43.413 Garage Area 42.1614 5.019 8.400 0.000 32.314 52.009 Full Bath 3792.9918 1921.934 1.974 0.049 22.264 7563.719 Mas Vnr Area 39.8961 5.277 7.560 0.000 29.542 50.250 Fireplaces 7595.2731 1429.507 5.313 0.000 4790.660 1.04e+04 BsmtFin SF 1 23.9981 2.196 10.928 0.000 19.690 28.307 ============================================================================== Omnibus: 401.861 Durbin-Watson: 1.678 Prob(Omnibus): 0.000 Jarque-Bera (JB): 3654.498 Skew: 1.272 Prob(JB): 0.00 Kurtosis: 11.136 Cond. No. 1.36e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.36e+04. This might indicate that there are strong multicollinearity or other numerical problems.
We see that by creating a predictor model within Sale Condition = 'Normal, the adjusted R-Squared value increased from 80.6% to 84.7%
Whether someone would want to limit a house sale price predictor model to only 'Normal' sale condition or across all 6 catagories would depend on how critical it would be to maximize adjusted R-Squared value.
Looking at the output table of coefficients above, I see that coefficient 'Full Bath' has a P>|t| value of 0.049.
This indicates a somewhat 'weak' variable in the equation relative to the others. I will take it out of the equation and see if Adjusted R-Squared lowers.
# omit category 'Full Bath' from equation.
col = ['Overall Qual', 'Total_SF', 'Garage Area', \
'Mas Vnr Area', 'Fireplaces', 'BsmtFin SF 1']
X = test[col] # here we have 10 variables for multiple regression.
Y = test['SalePrice']
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, Y)
print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)
# with statsmodels
X = sm.add_constant(X) # adding a constant
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
display(Markdown('<h3><span style="color:blue"> {} Features </span></h3>'.format(6)))
print(col, '\n')
print_model = model.summary()
print(print_model)
Intercept: -79844.43133164069 Coefficients: [19114.77818564 41.32715704 42.72905189 39.54209009 7479.86412996 23.02168246]
['Overall Qual', 'Total_SF', 'Garage Area', 'Mas Vnr Area', 'Fireplaces', 'BsmtFin SF 1'] OLS Regression Results ============================================================================== Dep. Variable: SalePrice R-squared: 0.847 Model: OLS Adj. R-squared: 0.847 Method: Least Squares F-statistic: 1110. Date: Wed, 14 Jul 2021 Prob (F-statistic): 0.00 Time: 16:33:11 Log-Likelihood: -14067. No. Observations: 1207 AIC: 2.815e+04 Df Residuals: 1200 BIC: 2.818e+04 Df Model: 6 Covariance Type: nonrobust ================================================================================ coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------- const -7.984e+04 4061.696 -19.658 0.000 -8.78e+04 -7.19e+04 Overall Qual 1.911e+04 842.421 22.690 0.000 1.75e+04 2.08e+04 Total_SF 41.3272 1.689 24.470 0.000 38.014 44.641 Garage Area 42.7291 5.017 8.517 0.000 32.886 52.572 Mas Vnr Area 39.5421 5.281 7.488 0.000 29.182 49.903 Fireplaces 7479.8641 1430.032 5.231 0.000 4674.223 1.03e+04 BsmtFin SF 1 23.0217 2.142 10.747 0.000 18.819 27.225 ============================================================================== Omnibus: 390.909 Durbin-Watson: 1.681 Prob(Omnibus): 0.000 Jarque-Bera (JB): 3420.680 Skew: 1.241 Prob(JB): 0.00 Kurtosis: 10.865 Cond. No. 1.36e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.36e+04. This might indicate that there are strong multicollinearity or other numerical problems.
The adjusted R-Squared value remained the same: 84.7%. This I would say is quite good with only 6 house features used.
Let's see if we can glean some category type columns from the data set and yield reasonable predictions of house sale price.
# extract non numeric columns from original data set.
text_df2 = df2.select_dtypes(include=['object'])
text_df3 = pd.DataFrame()
print(text_df2.info(), '\n')
# remove 'object' based columns with less than 5 unique values.
# will most likely not be significant in predictor model.
for n in text_df2:
print(n, ' Unique Values Count = ', text_df2[n].nunique())
new = text_df2[n].nunique()
if new > 4:
text_df3 = pd.concat([text_df3,text_df2[n]], axis=1)
else:
None
print('\n')
print(text_df3.head(3), '\n')
print(text_df3.info(), '\n')
df_null_counts2 = text_df3.isnull().sum()
print(df_null_counts2, '\n')
fill_missing_values2 = df[df_null_counts2[df_null_counts2>0].index]
print(fill_missing_values2.isnull().sum(), '\n')
# determine qty. of each unique value.
# this will determine what to change the missing values
# to - the mode among the unique values in each category.
for n in fill_missing_values2:
print(fill_missing_values2[n].value_counts(dropna=False))
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2930 entries, 0 to 2929 Data columns (total 38 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MS Zoning 2930 non-null object 1 Street 2930 non-null object 2 Lot Shape 2930 non-null object 3 Land Contour 2930 non-null object 4 Utilities 2930 non-null object 5 Lot Config 2930 non-null object 6 Land Slope 2930 non-null object 7 Neighborhood 2930 non-null object 8 Condition 1 2930 non-null object 9 Condition 2 2930 non-null object 10 Bldg Type 2930 non-null object 11 House Style 2930 non-null object 12 Roof Style 2930 non-null object 13 Roof Matl 2930 non-null object 14 Exterior 1st 2930 non-null object 15 Exterior 2nd 2930 non-null object 16 Mas Vnr Type 2907 non-null object 17 Exter Qual 2930 non-null object 18 Exter Cond 2930 non-null object 19 Foundation 2930 non-null object 20 Bsmt Qual 2850 non-null object 21 Bsmt Cond 2850 non-null object 22 Bsmt Exposure 2847 non-null object 23 BsmtFin Type 1 2850 non-null object 24 BsmtFin Type 2 2849 non-null object 25 Heating 2930 non-null object 26 Heating QC 2930 non-null object 27 Central Air 2930 non-null object 28 Electrical 2929 non-null object 29 Kitchen Qual 2930 non-null object 30 Functional 2930 non-null object 31 Garage Type 2773 non-null object 32 Garage Finish 2771 non-null object 33 Garage Qual 2771 non-null object 34 Garage Cond 2771 non-null object 35 Paved Drive 2930 non-null object 36 Sale Type 2930 non-null object 37 Sale Condition 2930 non-null object dtypes: object(38) memory usage: 870.0+ KB None MS Zoning Unique Values Count = 7 Street Unique Values Count = 2 Lot Shape Unique Values Count = 4 Land Contour Unique Values Count = 4 Utilities Unique Values Count = 3 Lot Config Unique Values Count = 5 Land Slope Unique Values Count = 3 Neighborhood Unique Values Count = 28 Condition 1 Unique Values Count = 9 Condition 2 Unique Values Count = 8 Bldg Type Unique Values Count = 5 House Style Unique Values Count = 8 Roof Style Unique Values Count = 6 Roof Matl Unique Values Count = 8 Exterior 1st Unique Values Count = 16 Exterior 2nd Unique Values Count = 17 Mas Vnr Type Unique Values Count = 5 Exter Qual Unique Values Count = 4 Exter Cond Unique Values Count = 5 Foundation Unique Values Count = 6 Bsmt Qual Unique Values Count = 5 Bsmt Cond Unique Values Count = 5 Bsmt Exposure Unique Values Count = 4 BsmtFin Type 1 Unique Values Count = 6 BsmtFin Type 2 Unique Values Count = 6 Heating Unique Values Count = 6 Heating QC Unique Values Count = 5 Central Air Unique Values Count = 2 Electrical Unique Values Count = 5 Kitchen Qual Unique Values Count = 5 Functional Unique Values Count = 8 Garage Type Unique Values Count = 6 Garage Finish Unique Values Count = 3 Garage Qual Unique Values Count = 5 Garage Cond Unique Values Count = 5 Paved Drive Unique Values Count = 3 Sale Type Unique Values Count = 10 Sale Condition Unique Values Count = 6 MS Zoning Lot Config Neighborhood Condition 1 Condition 2 Bldg Type \ 0 RL Corner NAmes Norm Norm 1Fam 1 RH Inside NAmes Feedr Norm 1Fam 2 RL Corner NAmes Norm Norm 1Fam House Style Roof Style Roof Matl Exterior 1st ... Heating Heating QC \ 0 1Story Hip CompShg BrkFace ... GasA Fa 1 1Story Gable CompShg VinylSd ... GasA TA 2 1Story Hip CompShg Wd Sdng ... GasA TA Electrical Kitchen Qual Functional Garage Type Garage Qual Garage Cond \ 0 SBrkr TA Typ Attchd TA TA 1 SBrkr TA Typ Attchd TA TA 2 SBrkr Gd Typ Attchd TA TA Sale Type Sale Condition 0 WD Normal 1 WD Normal 2 WD Normal [3 rows x 28 columns] <class 'pandas.core.frame.DataFrame'> Index: 2930 entries, 0 to 2929 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MS Zoning 2930 non-null object 1 Lot Config 2930 non-null object 2 Neighborhood 2930 non-null object 3 Condition 1 2930 non-null object 4 Condition 2 2930 non-null object 5 Bldg Type 2930 non-null object 6 House Style 2930 non-null object 7 Roof Style 2930 non-null object 8 Roof Matl 2930 non-null object 9 Exterior 1st 2930 non-null object 10 Exterior 2nd 2930 non-null object 11 Mas Vnr Type 2907 non-null object 12 Exter Cond 2930 non-null object 13 Foundation 2930 non-null object 14 Bsmt Qual 2850 non-null object 15 Bsmt Cond 2850 non-null object 16 BsmtFin Type 1 2850 non-null object 17 BsmtFin Type 2 2849 non-null object 18 Heating 2930 non-null object 19 Heating QC 2930 non-null object 20 Electrical 2929 non-null object 21 Kitchen Qual 2930 non-null object 22 Functional 2930 non-null object 23 Garage Type 2773 non-null object 24 Garage Qual 2771 non-null object 25 Garage Cond 2771 non-null object 26 Sale Type 2930 non-null object 27 Sale Condition 2930 non-null object dtypes: object(28) memory usage: 663.8+ KB None MS Zoning 0 Lot Config 0 Neighborhood 0 Condition 1 0 Condition 2 0 Bldg Type 0 House Style 0 Roof Style 0 Roof Matl 0 Exterior 1st 0 Exterior 2nd 0 Mas Vnr Type 23 Exter Cond 0 Foundation 0 Bsmt Qual 80 Bsmt Cond 80 BsmtFin Type 1 80 BsmtFin Type 2 81 Heating 0 Heating QC 0 Electrical 1 Kitchen Qual 0 Functional 0 Garage Type 157 Garage Qual 159 Garage Cond 159 Sale Type 0 Sale Condition 0 dtype: int64 Mas Vnr Type 23 Bsmt Qual 80 Bsmt Cond 80 BsmtFin Type 1 80 BsmtFin Type 2 81 Electrical 1 Garage Type 157 Garage Qual 159 Garage Cond 159 dtype: int64 None 1752 BrkFace 880 Stone 249 BrkCmn 25 NaN 23 CBlock 1 Name: Mas Vnr Type, dtype: int64 TA 1283 Gd 1219 Ex 258 Fa 88 NaN 80 Po 2 Name: Bsmt Qual, dtype: int64 TA 2616 Gd 122 Fa 104 NaN 80 Po 5 Ex 3 Name: Bsmt Cond, dtype: int64 GLQ 859 Unf 851 ALQ 429 Rec 288 BLQ 269 LwQ 154 NaN 80 Name: BsmtFin Type 1, dtype: int64 Unf 2499 Rec 106 LwQ 89 NaN 81 BLQ 68 ALQ 53 GLQ 34 Name: BsmtFin Type 2, dtype: int64 SBrkr 2682 FuseA 188 FuseF 50 FuseP 8 Mix 1 NaN 1 Name: Electrical, dtype: int64 Attchd 1731 Detchd 782 BuiltIn 186 NaN 157 Basment 36 2Types 23 CarPort 15 Name: Garage Type, dtype: int64 TA 2615 NaN 159 Fa 124 Gd 24 Po 5 Ex 3 Name: Garage Qual, dtype: int64 TA 2665 NaN 159 Fa 74 Gd 15 Po 14 Ex 3 Name: Garage Cond, dtype: int64
TA = Typical/Average
Gd = Good
Mas Vnr Type - This is masonry veneer type and 60% is None. Therefore I will remove this column.
Bsmt Qual - Split the 80 NaN values between TA and Gd.
Bsmt Cond - 89% is TA, so allocate the missng values to TA.
BsmtFin Type 1 - Split the 80 NaN values between GLQ and Unf.
BsmtFin Type 2 - 85% is Unf, so allocate the missng values to Unf.
Electrical - 92% is SBrkr, so allocate the missng values to SBrkr.
Garage Type - Split the 157 NaN values between Attchd (70%) and Detchd (30%).
Garage Qual - 89% is TA, so allocate the missng values to TA.
Garage Cond - 91% is TA, so allocate the missng values to TA.
fill_missing_values2['Bsmt Qual'].replace('nan', np.nan, inplace=True)
mask = fill_missing_values2['Bsmt Qual'].isna()
ind = fill_missing_values2['Bsmt Qual'].loc[mask].sample(frac=0.5).index
fill_missing_values2.loc[ind, 'Bsmt Qual'] = 'TA'
fill_missing_values2['Bsmt Qual'] = fill_missing_values2['Bsmt Qual'].fillna('Gd')
print(fill_missing_values2['Bsmt Qual']. value_counts(dropna=False))
fill_missing_values2['Bsmt Cond'] = fill_missing_values2['Bsmt Cond'].fillna('TA')
print(fill_missing_values2['Bsmt Cond']. value_counts(dropna=False))
fill_missing_values2['BsmtFin Type 1'].replace('nan', np.nan, inplace=True)
mask = fill_missing_values2['BsmtFin Type 1'].isna()
ind = fill_missing_values2['BsmtFin Type 1'].loc[mask].sample(frac=0.5).index
fill_missing_values2.loc[ind, 'BsmtFin Type 1'] = 'GLQ'
fill_missing_values2['BsmtFin Type 1'] = fill_missing_values2['BsmtFin Type 1'].fillna('Unf')
print(fill_missing_values2['BsmtFin Type 1']. value_counts(dropna=False))
fill_missing_values2['BsmtFin Type 2'] = fill_missing_values2['BsmtFin Type 2'].fillna('Unf')
print(fill_missing_values2['BsmtFin Type 2']. value_counts(dropna=False))
fill_missing_values2['Electrical'] = fill_missing_values2['Electrical'].fillna('SBrkr')
print(fill_missing_values2['Electrical']. value_counts(dropna=False))
fill_missing_values2['Garage Type'].replace('nan', np.nan, inplace=True)
mask = fill_missing_values2['Garage Type'].isna()
ind = fill_missing_values2['Garage Type'].loc[mask].sample(frac=0.7).index
fill_missing_values2.loc[ind, 'Garage Type'] = 'Attchd'
fill_missing_values2['Garage Type'] = fill_missing_values2['Garage Type'].fillna('Detchd')
print(fill_missing_values2['Garage Type']. value_counts(dropna=False))
fill_missing_values2['Garage Qual'] = fill_missing_values2['Garage Qual'].fillna('TA')
print(fill_missing_values2['Garage Qual']. value_counts(dropna=False))
fill_missing_values2['Garage Cond'] = fill_missing_values2['Garage Cond'].fillna('TA')
print(fill_missing_values2['Garage Cond']. value_counts(dropna=False))
print(fill_missing_values2.isnull().sum())
TA 1323 Gd 1259 Ex 258 Fa 88 Po 2 Name: Bsmt Qual, dtype: int64 TA 2696 Gd 122 Fa 104 Po 5 Ex 3 Name: Bsmt Cond, dtype: int64 GLQ 899 Unf 891 ALQ 429 Rec 288 BLQ 269 LwQ 154 Name: BsmtFin Type 1, dtype: int64 Unf 2580 Rec 106 LwQ 89 BLQ 68 ALQ 53 GLQ 34 Name: BsmtFin Type 2, dtype: int64 SBrkr 2683 FuseA 188 FuseF 50 FuseP 8 Mix 1 Name: Electrical, dtype: int64 Attchd 1841 Detchd 829 BuiltIn 186 Basment 36 2Types 23 CarPort 15 Name: Garage Type, dtype: int64 TA 2774 Fa 124 Gd 24 Po 5 Ex 3 Name: Garage Qual, dtype: int64 TA 2824 Fa 74 Gd 15 Po 14 Ex 3 Name: Garage Cond, dtype: int64 Mas Vnr Type 23 Bsmt Qual 0 Bsmt Cond 0 BsmtFin Type 1 0 BsmtFin Type 2 0 Electrical 0 Garage Type 0 Garage Qual 0 Garage Cond 0 dtype: int64
# replace columns with missing values with
# modified ones having no missing values.
df_cat_temp = text_df3.drop(['Mas Vnr Type', 'Bsmt Qual', 'Bsmt Cond', \
'BsmtFin Type 1', 'BsmtFin Type 2', 'Electrical', \
'Garage Type', 'Garage Qual', 'Garage Cond',\
], axis = 1)
text_df3 = pd.concat([df_cat_temp, fill_missing_values2], axis=1)
text_df3 = text_df3.drop(['Mas Vnr Type'], axis = 1)
print(text_df3.info())
# confirm no missing values.
print(text_df3.isnull().sum().sum())
<class 'pandas.core.frame.DataFrame'> Index: 2930 entries, 0 to 2929 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MS Zoning 2930 non-null object 1 Lot Config 2930 non-null object 2 Neighborhood 2930 non-null object 3 Condition 1 2930 non-null object 4 Condition 2 2930 non-null object 5 Bldg Type 2930 non-null object 6 House Style 2930 non-null object 7 Roof Style 2930 non-null object 8 Roof Matl 2930 non-null object 9 Exterior 1st 2930 non-null object 10 Exterior 2nd 2930 non-null object 11 Exter Cond 2930 non-null object 12 Foundation 2930 non-null object 13 Heating 2930 non-null object 14 Heating QC 2930 non-null object 15 Kitchen Qual 2930 non-null object 16 Functional 2930 non-null object 17 Sale Type 2930 non-null object 18 Sale Condition 2930 non-null object 19 Bsmt Qual 2930 non-null object 20 Bsmt Cond 2930 non-null object 21 BsmtFin Type 1 2930 non-null object 22 BsmtFin Type 2 2930 non-null object 23 Electrical 2930 non-null object 24 Garage Type 2930 non-null object 25 Garage Qual 2930 non-null object 26 Garage Cond 2930 non-null object dtypes: object(27) memory usage: 640.9+ KB None 0
X = text_df3
Y = df2['SalePrice']
# create 'dummy' columns in order to be able to
# perform linear regression analysis on categorical data.
X = pd.get_dummies(data=X, drop_first=True)
print('There are', len(X.columns), 'dummy columns created.', '\n')
print(len(X))
from sklearn import linear_model
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .50, random_state = 40)
model = sm.OLS(Y_test, X_test).fit()
predictions = model.predict(X)
print_model = model.summary()
print(print_model)
There are 184 dummy columns created. 2930 OLS Regression Results ============================================================================== Dep. Variable: SalePrice R-squared: 0.822 Model: OLS Adj. R-squared: 0.798 Method: Least Squares F-statistic: 34.14 Date: Wed, 14 Jul 2021 Prob (F-statistic): 0.00 Time: 16:33:12 Log-Likelihood: -17284. No. Observations: 1465 AIC: 3.492e+04 Df Residuals: 1290 BIC: 3.584e+04 Df Model: 174 Covariance Type: nonrobust ========================================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------------ MS Zoning_C (all) 7.772e+04 1.83e+04 4.242 0.000 4.18e+04 1.14e+05 MS Zoning_FV 1.146e+05 1.67e+04 6.877 0.000 8.19e+04 1.47e+05 MS Zoning_I (all) 1.914e+04 3.77e+04 0.507 0.612 -5.49e+04 9.32e+04 MS Zoning_RH 1.229e+05 1.78e+04 6.902 0.000 8.79e+04 1.58e+05 MS Zoning_RL 1.205e+05 1.47e+04 8.168 0.000 9.15e+04 1.49e+05 MS Zoning_RM 1.218e+05 1.51e+04 8.085 0.000 9.23e+04 1.51e+05 Lot Config_CulDSac 1.007e+04 4692.480 2.145 0.032 861.303 1.93e+04 Lot Config_FR2 -1.14e+04 6381.047 -1.786 0.074 -2.39e+04 1120.779 Lot Config_FR3 -1.718e+04 1.51e+04 -1.139 0.255 -4.68e+04 1.24e+04 Lot Config_Inside -1556.7795 2606.623 -0.597 0.550 -6670.465 3556.906 Neighborhood_Blueste -7801.7947 1.67e+04 -0.468 0.640 -4.05e+04 2.49e+04 Neighborhood_BrDale -1.873e+04 1.53e+04 -1.223 0.222 -4.88e+04 1.13e+04 Neighborhood_BrkSide -4.506e+04 1.27e+04 -3.541 0.000 -7e+04 -2.01e+04 Neighborhood_ClearCr 149.3453 1.34e+04 0.011 0.991 -2.62e+04 2.65e+04 Neighborhood_CollgCr -2.714e+04 1.01e+04 -2.688 0.007 -4.7e+04 -7329.576 Neighborhood_Crawfor 1.662e+04 1.13e+04 1.471 0.142 -5545.276 3.88e+04 Neighborhood_Edwards -5.512e+04 1.1e+04 -5.012 0.000 -7.67e+04 -3.35e+04 Neighborhood_Gilbert -3.51e+04 1.07e+04 -3.289 0.001 -5.6e+04 -1.42e+04 Neighborhood_Greens 5.276e+04 3.69e+04 1.430 0.153 -1.96e+04 1.25e+05 Neighborhood_GrnHill 9.973e-10 1.96e-10 5.081 0.000 6.12e-10 1.38e-09 Neighborhood_IDOTRR -5.078e+04 1.39e+04 -3.648 0.000 -7.81e+04 -2.35e+04 Neighborhood_Landmrk -2.621e+04 3.78e+04 -0.693 0.489 -1e+05 4.8e+04 Neighborhood_MeadowV -7.18e+04 1.56e+04 -4.601 0.000 -1.02e+05 -4.12e+04 Neighborhood_Mitchel -4.192e+04 1.12e+04 -3.740 0.000 -6.39e+04 -1.99e+04 Neighborhood_NAmes -3.858e+04 1.07e+04 -3.600 0.000 -5.96e+04 -1.76e+04 Neighborhood_NPkVill 1.702e+04 2.81e+04 0.606 0.545 -3.81e+04 7.21e+04 Neighborhood_NWAmes -1.965e+04 1.14e+04 -1.722 0.085 -4.2e+04 2735.529 Neighborhood_NoRidge 5.284e+04 1.16e+04 4.550 0.000 3.01e+04 7.56e+04 Neighborhood_NridgHt 2.508e+04 1.05e+04 2.386 0.017 4460.308 4.57e+04 Neighborhood_OldTown -5.102e+04 1.26e+04 -4.035 0.000 -7.58e+04 -2.62e+04 Neighborhood_SWISU -3.731e+04 1.31e+04 -2.839 0.005 -6.31e+04 -1.15e+04 Neighborhood_Sawyer -3.895e+04 1.13e+04 -3.446 0.001 -6.11e+04 -1.68e+04 Neighborhood_SawyerW -2.856e+04 1.1e+04 -2.595 0.010 -5.02e+04 -6969.205 Neighborhood_Somerst -5970.6198 1.21e+04 -0.494 0.622 -2.97e+04 1.78e+04 Neighborhood_StoneBr 5.144e+04 1.19e+04 4.315 0.000 2.81e+04 7.48e+04 Neighborhood_Timber -4661.3655 1.13e+04 -0.413 0.680 -2.68e+04 1.75e+04 Neighborhood_Veenker 1.575e+04 1.59e+04 0.994 0.321 -1.53e+04 4.68e+04 Condition 1_Feedr 4821.2500 7373.224 0.654 0.513 -9643.574 1.93e+04 Condition 1_Norm 8019.4388 5917.846 1.355 0.176 -3590.219 1.96e+04 Condition 1_PosA 4.331e+04 1.91e+04 2.265 0.024 5796.701 8.08e+04 Condition 1_PosN 1.471e+04 1.07e+04 1.372 0.170 -6329.459 3.57e+04 Condition 1_RRAe -1.227e+04 1.11e+04 -1.109 0.268 -3.4e+04 9427.839 Condition 1_RRAn 6990.9353 9366.367 0.746 0.456 -1.14e+04 2.54e+04 Condition 1_RRNe -1.342e+04 2.14e+04 -0.626 0.531 -5.55e+04 2.86e+04 Condition 1_RRNn -364.2623 2.23e+04 -0.016 0.987 -4.4e+04 4.33e+04 Condition 2_Feedr 1.9e+04 2.53e+04 0.752 0.452 -3.06e+04 6.86e+04 Condition 2_Norm 8378.1302 2.16e+04 0.389 0.698 -3.39e+04 5.07e+04 Condition 2_PosA 3.544e+04 4.52e+04 0.784 0.433 -5.33e+04 1.24e+05 Condition 2_PosN -1.678e+04 3.42e+04 -0.491 0.623 -8.38e+04 5.02e+04 Condition 2_RRAe 9.18e-11 1.48e-10 0.622 0.534 -1.98e-10 3.81e-10 Condition 2_RRAn 1.651e+04 4.2e+04 0.393 0.694 -6.59e+04 9.89e+04 Condition 2_RRNn 3.047e+04 4.19e+04 0.727 0.467 -5.17e+04 1.13e+05 Bldg Type_2fmCon -1.135e+04 6992.811 -1.624 0.105 -2.51e+04 2365.144 Bldg Type_Duplex 1905.3256 6147.617 0.310 0.757 -1.02e+04 1.4e+04 Bldg Type_Twnhs -5.808e+04 7731.124 -7.513 0.000 -7.32e+04 -4.29e+04 Bldg Type_TwnhsE -4.571e+04 5125.030 -8.920 0.000 -5.58e+04 -3.57e+04 House Style_1.5Unf -1.789e+04 1.19e+04 -1.504 0.133 -4.12e+04 5448.042 House Style_1Story -1.327e+04 3737.956 -3.551 0.000 -2.06e+04 -5939.984 House Style_2.5Fin 2097.6550 1.84e+04 0.114 0.909 -3.39e+04 3.81e+04 House Style_2.5Unf 3.62e+04 1.16e+04 3.115 0.002 1.34e+04 5.9e+04 House Style_2Story 4525.0635 3955.381 1.144 0.253 -3234.622 1.23e+04 House Style_SFoyer -1.592e+04 7405.048 -2.150 0.032 -3.05e+04 -1397.046 House Style_SLvl -1.327e+04 6201.639 -2.139 0.033 -2.54e+04 -1100.577 Roof Style_Gable 2.398e+04 2.15e+04 1.114 0.265 -1.82e+04 6.62e+04 Roof Style_Gambrel 1.739e+04 2.38e+04 0.732 0.464 -2.92e+04 6.4e+04 Roof Style_Hip 4.057e+04 2.17e+04 1.871 0.062 -1972.228 8.31e+04 Roof Style_Mansard 1.622e+04 2.61e+04 0.621 0.534 -3.5e+04 6.74e+04 Roof Style_Shed 6741.2864 3.1e+04 0.217 0.828 -5.42e+04 6.76e+04 Roof Matl_CompShg 6.758e+04 1.89e+04 3.570 0.000 3.04e+04 1.05e+05 Roof Matl_Membran 1.007e+05 3.81e+04 2.645 0.008 2.6e+04 1.75e+05 Roof Matl_Metal 7.109e+04 3.69e+04 1.929 0.054 -1223.448 1.43e+05 Roof Matl_Roll 8.284e+04 3.8e+04 2.181 0.029 8312.239 1.57e+05 Roof Matl_Tar&Grv 9.898e+04 1.98e+04 5.002 0.000 6.02e+04 1.38e+05 Roof Matl_WdShake 1.07e+05 2.38e+04 4.493 0.000 6.03e+04 1.54e+05 Roof Matl_WdShngl 1.521e+05 2.32e+04 6.567 0.000 1.07e+05 1.98e+05 Exterior 1st_AsphShn -4.236e+04 5.45e+04 -0.778 0.437 -1.49e+05 6.45e+04 Exterior 1st_BrkComm 5.465e+04 2.73e+04 2.003 0.045 1114.294 1.08e+05 Exterior 1st_BrkFace 5.277e+04 1.69e+04 3.125 0.002 1.96e+04 8.59e+04 Exterior 1st_CBlock 1.036e+05 4e+04 2.589 0.010 2.51e+04 1.82e+05 Exterior 1st_CemntBd 7.774e+04 2.38e+04 3.262 0.001 3.1e+04 1.24e+05 Exterior 1st_HdBoard 2.162e+04 1.62e+04 1.333 0.183 -1.02e+04 5.34e+04 Exterior 1st_ImStucc -998.7501 4.03e+04 -0.025 0.980 -8e+04 7.8e+04 Exterior 1st_MetalSd 798.1393 1.85e+04 0.043 0.966 -3.54e+04 3.7e+04 Exterior 1st_Plywood 3.003e+04 1.59e+04 1.888 0.059 -1180.718 6.12e+04 Exterior 1st_PreCast 7.169e+04 2.06e+04 3.483 0.001 3.13e+04 1.12e+05 Exterior 1st_Stone 8852.2871 4.82e+04 0.184 0.854 -8.58e+04 1.03e+05 Exterior 1st_Stucco 2.899e+04 1.84e+04 1.573 0.116 -7173.441 6.52e+04 Exterior 1st_VinylSd 9010.0302 1.72e+04 0.525 0.600 -2.47e+04 4.27e+04 Exterior 1st_Wd Sdng 1.444e+04 1.56e+04 0.929 0.353 -1.61e+04 4.5e+04 Exterior 1st_WdShing 1.805e+04 1.71e+04 1.054 0.292 -1.55e+04 5.16e+04 Exterior 2nd_AsphShn 2.656e+04 4.2e+04 0.632 0.528 -5.59e+04 1.09e+05 Exterior 2nd_Brk Cmn -2.08e+04 3.07e+04 -0.677 0.499 -8.11e+04 3.95e+04 Exterior 2nd_BrkFace -4.02e+04 1.94e+04 -2.071 0.039 -7.83e+04 -2125.412 Exterior 2nd_CBlock 1.532e+04 4.48e+04 0.342 0.733 -7.27e+04 1.03e+05 Exterior 2nd_CmentBd -3.239e+04 2.46e+04 -1.318 0.188 -8.06e+04 1.58e+04 Exterior 2nd_HdBoard -6978.1506 1.74e+04 -0.402 0.688 -4.11e+04 2.71e+04 Exterior 2nd_ImStucc -3109.0141 2.14e+04 -0.145 0.885 -4.52e+04 3.9e+04 Exterior 2nd_MetalSd 1.956e+04 1.93e+04 1.015 0.310 -1.82e+04 5.73e+04 Exterior 2nd_Other 3.283e+04 3.99e+04 0.823 0.411 -4.55e+04 1.11e+05 Exterior 2nd_Plywood -5383.4469 1.67e+04 -0.323 0.747 -3.81e+04 2.73e+04 Exterior 2nd_PreCast 7.169e+04 2.06e+04 3.483 0.001 3.13e+04 1.12e+05 Exterior 2nd_Stone -3.235e+04 3.12e+04 -1.036 0.300 -9.36e+04 2.89e+04 Exterior 2nd_Stucco -1.605e+04 1.95e+04 -0.823 0.411 -5.43e+04 2.22e+04 Exterior 2nd_VinylSd 1.071e+04 1.79e+04 0.598 0.550 -2.44e+04 4.58e+04 Exterior 2nd_Wd Sdng 4019.5004 1.67e+04 0.241 0.809 -2.87e+04 3.67e+04 Exterior 2nd_Wd Shng -7688.7615 1.74e+04 -0.442 0.659 -4.18e+04 2.64e+04 Exter Cond_Fa -2.337e+04 1.91e+04 -1.224 0.221 -6.08e+04 1.41e+04 Exter Cond_Gd -1.071e+04 1.81e+04 -0.591 0.555 -4.63e+04 2.49e+04 Exter Cond_Po -2035.1372 4.06e+04 -0.050 0.960 -8.17e+04 7.76e+04 Exter Cond_TA -1.104e+04 1.8e+04 -0.614 0.539 -4.63e+04 2.42e+04 Foundation_CBlock 2378.3209 4607.669 0.516 0.606 -6661.025 1.14e+04 Foundation_PConc 6226.2124 4892.425 1.273 0.203 -3371.770 1.58e+04 Foundation_Slab -2.776e+04 9012.495 -3.080 0.002 -4.54e+04 -1.01e+04 Foundation_Stone 2.063e+04 1.41e+04 1.463 0.144 -7037.951 4.83e+04 Foundation_Wood 9958.1167 2.58e+04 0.386 0.699 -4.06e+04 6.05e+04 Heating_GasA 1.309e+05 1.99e+04 6.579 0.000 9.18e+04 1.7e+05 Heating_GasW 1.611e+05 2.22e+04 7.244 0.000 1.17e+05 2.05e+05 Heating_Grav 1.277e+05 2.55e+04 5.007 0.000 7.77e+04 1.78e+05 Heating_OthW 1.226e+05 3.78e+04 3.241 0.001 4.84e+04 1.97e+05 Heating_Wall 1.38e+05 2.75e+04 5.021 0.000 8.41e+04 1.92e+05 Heating QC_Fa -1.527e+04 6553.711 -2.330 0.020 -2.81e+04 -2414.048 Heating QC_Gd -6667.1623 2992.798 -2.228 0.026 -1.25e+04 -795.878 Heating QC_Po 3753.1146 3.87e+04 0.097 0.923 -7.22e+04 7.97e+04 Heating QC_TA -1.347e+04 2981.541 -4.519 0.000 -1.93e+04 -7624.306 Kitchen Qual_Fa -7.084e+04 9187.102 -7.711 0.000 -8.89e+04 -5.28e+04 Kitchen Qual_Gd -4.806e+04 5397.136 -8.904 0.000 -5.86e+04 -3.75e+04 Kitchen Qual_Po 5.66e-11 3.93e-11 1.441 0.150 -2.05e-11 1.34e-10 Kitchen Qual_TA -6.727e+04 5931.792 -11.340 0.000 -7.89e+04 -5.56e+04 Functional_Maj2 -6.411e+04 2.18e+04 -2.934 0.003 -1.07e+05 -2.13e+04 Functional_Min1 -2.567e+04 1.36e+04 -1.885 0.060 -5.24e+04 1040.244 Functional_Min2 -2.354e+04 1.37e+04 -1.722 0.085 -5.04e+04 3285.456 Functional_Mod -1.83e+04 1.52e+04 -1.203 0.229 -4.82e+04 1.16e+04 Functional_Sal -5.615e+04 3.96e+04 -1.416 0.157 -1.34e+05 2.16e+04 Functional_Sev -9.226e+04 3.8e+04 -2.427 0.015 -1.67e+05 -1.77e+04 Functional_Typ -2.368e+04 1.19e+04 -1.982 0.048 -4.71e+04 -239.588 Sale Type_CWD 523.9004 1.62e+04 0.032 0.974 -3.13e+04 3.24e+04 Sale Type_Con -3.638e+04 3.59e+04 -1.015 0.310 -1.07e+05 3.4e+04 Sale Type_ConLD 1.206e+04 1.21e+04 0.998 0.318 -1.16e+04 3.58e+04 Sale Type_ConLI -1030.9229 1.73e+04 -0.060 0.952 -3.49e+04 3.29e+04 Sale Type_ConLw -1.856e+04 1.8e+04 -1.033 0.302 -5.38e+04 1.67e+04 Sale Type_New 750.2250 2.23e+04 0.034 0.973 -4.3e+04 4.45e+04 Sale Type_Oth 8665.0277 1.88e+04 0.462 0.644 -2.82e+04 4.55e+04 Sale Type_VWD 2.215e-11 4.06e-11 0.546 0.585 -5.75e-11 1.02e-10 Sale Type_WD 121.7545 6439.496 0.019 0.985 -1.25e+04 1.28e+04 Sale Condition_AdjLand 6863.9745 1.95e+04 0.351 0.725 -3.15e+04 4.52e+04 Sale Condition_Alloca 2.228e+04 1.19e+04 1.867 0.062 -1130.113 4.57e+04 Sale Condition_Family 8252.8469 8775.724 0.940 0.347 -8963.408 2.55e+04 Sale Condition_Normal 1.338e+04 4419.296 3.028 0.003 4710.496 2.21e+04 Sale Condition_Partial 2.675e+04 2.14e+04 1.248 0.212 -1.53e+04 6.88e+04 Bsmt Qual_Fa -5.658e+04 8527.603 -6.635 0.000 -7.33e+04 -3.99e+04 Bsmt Qual_Gd -4.634e+04 4848.777 -9.557 0.000 -5.59e+04 -3.68e+04 Bsmt Qual_Po -5.846e+04 3.24e+04 -1.802 0.072 -1.22e+05 5193.998 Bsmt Qual_TA -5.738e+04 5874.823 -9.767 0.000 -6.89e+04 -4.59e+04 Bsmt Cond_Fa -1.145e+04 2.61e+04 -0.438 0.661 -6.27e+04 3.98e+04 Bsmt Cond_Gd -7222.2303 2.6e+04 -0.278 0.781 -5.82e+04 4.38e+04 Bsmt Cond_Po -8412.8794 4.4e+04 -0.191 0.848 -9.48e+04 7.79e+04 Bsmt Cond_TA -9877.2139 2.55e+04 -0.387 0.699 -6e+04 4.02e+04 BsmtFin Type 1_BLQ 5070.6075 4208.352 1.205 0.228 -3185.358 1.33e+04 BsmtFin Type 1_GLQ 1.358e+04 3604.645 3.767 0.000 6505.651 2.06e+04 BsmtFin Type 1_LwQ 406.6923 5158.051 0.079 0.937 -9712.396 1.05e+04 BsmtFin Type 1_Rec -1928.0510 4094.445 -0.471 0.638 -9960.553 6104.451 BsmtFin Type 1_Unf -9188.9446 3416.012 -2.690 0.007 -1.59e+04 -2487.396 BsmtFin Type 2_BLQ -5489.7372 9494.822 -0.578 0.563 -2.41e+04 1.31e+04 BsmtFin Type 2_GLQ 9316.2355 1.24e+04 0.752 0.452 -1.5e+04 3.36e+04 BsmtFin Type 2_LwQ -2532.2160 9417.837 -0.269 0.788 -2.1e+04 1.59e+04 BsmtFin Type 2_Rec -1.057e+04 9362.503 -1.129 0.259 -2.89e+04 7795.787 BsmtFin Type 2_Unf -5783.4160 7646.558 -0.756 0.450 -2.08e+04 9217.637 Electrical_FuseF -1.321e+04 8654.824 -1.527 0.127 -3.02e+04 3766.981 Electrical_FuseP 1.735e+04 1.67e+04 1.042 0.298 -1.53e+04 5e+04 Electrical_Mix 1.423e-11 9.4e-12 1.514 0.130 -4.21e-12 3.27e-11 Electrical_SBrkr 7584.6268 4270.795 1.776 0.076 -793.839 1.6e+04 Garage Type_Attchd 3001.5896 1.39e+04 0.216 0.829 -2.42e+04 3.02e+04 Garage Type_Basment -3002.0647 1.72e+04 -0.175 0.861 -3.67e+04 3.07e+04 Garage Type_BuiltIn 1.908e+04 1.46e+04 1.309 0.191 -9519.408 4.77e+04 Garage Type_CarPort -1.969e+04 2.06e+04 -0.955 0.340 -6.01e+04 2.07e+04 Garage Type_Detchd -8135.3689 1.39e+04 -0.585 0.559 -3.54e+04 1.91e+04 Garage Qual_Fa -2.174e+05 4.66e+04 -4.668 0.000 -3.09e+05 -1.26e+05 Garage Qual_Gd -1.873e+05 4.74e+04 -3.949 0.000 -2.8e+05 -9.43e+04 Garage Qual_Po -1.975e+05 5.39e+04 -3.662 0.000 -3.03e+05 -9.17e+04 Garage Qual_TA -2.053e+05 4.62e+04 -4.446 0.000 -2.96e+05 -1.15e+05 Garage Cond_Fa 1.79e+05 2.36e+04 7.598 0.000 1.33e+05 2.25e+05 Garage Cond_Gd 1.682e+05 2.6e+04 6.470 0.000 1.17e+05 2.19e+05 Garage Cond_Po 1.595e+05 2.67e+04 5.978 0.000 1.07e+05 2.12e+05 Garage Cond_TA 1.735e+05 2.31e+04 7.513 0.000 1.28e+05 2.19e+05 ============================================================================== Omnibus: 188.369 Durbin-Watson: 1.923 Prob(Omnibus): 0.000 Jarque-Bera (JB): 711.512 Skew: 0.584 Prob(JB): 3.14e-155 Kurtosis: 6.208 Cond. No. 1.16e+16 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The smallest eigenvalue is 1.73e-28. This might indicate that there are strong multicollinearity problems or that the design matrix is singular.
I would say an initial adjusted R-Squared value of 79.8% for categorical type data is quite good.
However, there were many dummy columns: 184. I looked over the P>|t| values in the output table above and decided to remove dummy columns that had 'weak' values greater than 0.050.
text_df4 = text_df3.drop(['Condition 1', 'Condition 2', 'Lot Config',\
'House Style', 'Roof Style', 'Exterior 2nd',\
'Exter Cond', 'Functional', 'Electrical',\
'Foundation', 'Garage Type', 'Sale Type',\
'Sale Condition', 'BsmtFin Type 2','Bsmt Cond'\
], axis = 1)
print(len(text_df4))
print(text_df4.columns)
Y = df2['SalePrice']
X = pd.get_dummies(data=text_df4, drop_first=True)
print('There are', len(X.columns), 'dummy columns created.', '\n')
from sklearn import linear_model
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .50, random_state = 40)
regr = linear_model.LinearRegression() # Do not use fit_intercept = False if you have removed 1 column after dummy encoding
regr.fit(X_train, Y_train)
predicted = regr.predict(X_test)
model = sm.OLS(Y_test, X_test).fit()
predictions = model.predict(X)
print_model = model.summary()
print(print_model)
2930 Index(['MS Zoning', 'Neighborhood', 'Bldg Type', 'Roof Matl', 'Exterior 1st', 'Heating', 'Heating QC', 'Kitchen Qual', 'Bsmt Qual', 'BsmtFin Type 1', 'Garage Qual', 'Garage Cond'], dtype='object') There are 89 dummy columns created. OLS Regression Results ============================================================================== Dep. Variable: SalePrice R-squared: 0.781 Model: OLS Adj. R-squared: 0.768 Method: Least Squares F-statistic: 59.45 Date: Wed, 14 Jul 2021 Prob (F-statistic): 0.00 Time: 16:33:12 Log-Likelihood: -17433. No. Observations: 1465 AIC: 3.503e+04 Df Residuals: 1381 BIC: 3.548e+04 Df Model: 83 Covariance Type: nonrobust ======================================================================================== coef std err t P>|t| [0.025 0.975] ---------------------------------------------------------------------------------------- MS Zoning_C (all) 7.855e+04 1.53e+04 5.147 0.000 4.86e+04 1.08e+05 MS Zoning_FV 1.119e+05 1.38e+04 8.079 0.000 8.47e+04 1.39e+05 MS Zoning_I (all) 1.068e+05 3.53e+04 3.022 0.003 3.75e+04 1.76e+05 MS Zoning_RH 1.025e+05 1.46e+04 6.996 0.000 7.37e+04 1.31e+05 MS Zoning_RL 1.132e+05 1.14e+04 9.905 0.000 9.08e+04 1.36e+05 MS Zoning_RM 1.188e+05 1.17e+04 10.170 0.000 9.59e+04 1.42e+05 Neighborhood_Blueste -3909.0526 1.72e+04 -0.227 0.821 -3.77e+04 2.99e+04 Neighborhood_BrDale -1.985e+04 1.57e+04 -1.267 0.205 -5.06e+04 1.09e+04 Neighborhood_BrkSide -6.404e+04 1.28e+04 -4.999 0.000 -8.92e+04 -3.89e+04 Neighborhood_ClearCr -7503.7012 1.37e+04 -0.548 0.584 -3.44e+04 1.93e+04 Neighborhood_CollgCr -3.027e+04 1.06e+04 -2.848 0.004 -5.11e+04 -9424.382 Neighborhood_Crawfor 1.195e+04 1.17e+04 1.022 0.307 -1.1e+04 3.49e+04 Neighborhood_Edwards -6.523e+04 1.14e+04 -5.717 0.000 -8.76e+04 -4.28e+04 Neighborhood_Gilbert -2.759e+04 1.11e+04 -2.477 0.013 -4.94e+04 -5738.240 Neighborhood_Greens 4.245e+04 3.81e+04 1.113 0.266 -3.23e+04 1.17e+05 Neighborhood_GrnHill 1.543e-09 2.66e-10 5.801 0.000 1.02e-09 2.06e-09 Neighborhood_IDOTRR -7.089e+04 1.41e+04 -5.030 0.000 -9.85e+04 -4.32e+04 Neighborhood_Landmrk -6480.9426 4.01e+04 -0.162 0.872 -8.51e+04 7.21e+04 Neighborhood_MeadowV -6.335e+04 1.62e+04 -3.922 0.000 -9.5e+04 -3.17e+04 Neighborhood_Mitchel -4.799e+04 1.16e+04 -4.137 0.000 -7.08e+04 -2.52e+04 Neighborhood_NAmes -4.758e+04 1.12e+04 -4.259 0.000 -6.95e+04 -2.57e+04 Neighborhood_NPkVill 1939.1548 1.59e+04 0.122 0.903 -2.93e+04 3.32e+04 Neighborhood_NWAmes -2.142e+04 1.18e+04 -1.814 0.070 -4.46e+04 1739.797 Neighborhood_NoRidge 6.321e+04 1.21e+04 5.244 0.000 3.96e+04 8.69e+04 Neighborhood_NridgHt 2.447e+04 1.11e+04 2.204 0.028 2688.219 4.62e+04 Neighborhood_OldTown -6.391e+04 1.29e+04 -4.956 0.000 -8.92e+04 -3.86e+04 Neighborhood_SWISU -5.158e+04 1.34e+04 -3.862 0.000 -7.78e+04 -2.54e+04 Neighborhood_Sawyer -4.915e+04 1.17e+04 -4.216 0.000 -7.2e+04 -2.63e+04 Neighborhood_SawyerW -3.718e+04 1.14e+04 -3.265 0.001 -5.95e+04 -1.48e+04 Neighborhood_Somerst -8254.2430 1.26e+04 -0.656 0.512 -3.29e+04 1.64e+04 Neighborhood_StoneBr 6.476e+04 1.24e+04 5.233 0.000 4.05e+04 8.9e+04 Neighborhood_Timber -5745.4786 1.19e+04 -0.483 0.629 -2.91e+04 1.76e+04 Neighborhood_Veenker 1.301e+04 1.62e+04 0.805 0.421 -1.87e+04 4.47e+04 Bldg Type_2fmCon -5685.7843 6903.046 -0.824 0.410 -1.92e+04 7855.805 Bldg Type_Duplex -5991.4774 5720.495 -1.047 0.295 -1.72e+04 5230.322 Bldg Type_Twnhs -6.448e+04 7915.962 -8.146 0.000 -8e+04 -4.9e+04 Bldg Type_TwnhsE -5.284e+04 5194.145 -10.173 0.000 -6.3e+04 -4.27e+04 Roof Matl_CompShg 7.702e+04 1.39e+04 5.549 0.000 4.98e+04 1.04e+05 Roof Matl_Membran 1.08e+05 3.55e+04 3.043 0.002 3.84e+04 1.78e+05 Roof Matl_Metal 6.3e+04 3.55e+04 1.774 0.076 -6665.233 1.33e+05 Roof Matl_Roll 1.064e+05 3.67e+04 2.897 0.004 3.44e+04 1.79e+05 Roof Matl_Tar&Grv 9.96e+04 1.74e+04 5.733 0.000 6.55e+04 1.34e+05 Roof Matl_WdShake 1.176e+05 1.99e+04 5.900 0.000 7.85e+04 1.57e+05 Roof Matl_WdShngl 1.483e+05 1.91e+04 7.777 0.000 1.11e+05 1.86e+05 Exterior 1st_AsphShn 5992.4559 3.84e+04 0.156 0.876 -6.93e+04 8.13e+04 Exterior 1st_BrkComm 3.728e+04 2.1e+04 1.776 0.076 -3901.442 7.85e+04 Exterior 1st_BrkFace 3.682e+04 1.04e+04 3.554 0.000 1.65e+04 5.71e+04 Exterior 1st_CBlock 8.831e+04 3.57e+04 2.473 0.014 1.82e+04 1.58e+05 Exterior 1st_CemntBd 4.514e+04 1.07e+04 4.237 0.000 2.42e+04 6.6e+04 Exterior 1st_HdBoard 1.536e+04 9140.801 1.680 0.093 -2572.690 3.33e+04 Exterior 1st_ImStucc -2.453e+04 3.83e+04 -0.641 0.521 -9.96e+04 5.05e+04 Exterior 1st_MetalSd 1.898e+04 8822.900 2.151 0.032 1670.512 3.63e+04 Exterior 1st_Plywood 2.563e+04 9770.808 2.623 0.009 6459.394 4.48e+04 Exterior 1st_PreCast 1.308e+05 4.1e+04 3.193 0.001 5.04e+04 2.11e+05 Exterior 1st_Stone -4.04e+04 3.86e+04 -1.048 0.295 -1.16e+05 3.53e+04 Exterior 1st_Stucco 1.331e+04 1.12e+04 1.187 0.235 -8682.606 3.53e+04 Exterior 1st_VinylSd 1.974e+04 8974.599 2.200 0.028 2139.127 3.73e+04 Exterior 1st_Wd Sdng 1.781e+04 8888.009 2.003 0.045 370.057 3.52e+04 Exterior 1st_WdShing 9286.0695 1.12e+04 0.829 0.407 -1.27e+04 3.13e+04 Heating_GasA 1.486e+05 1.48e+04 10.048 0.000 1.2e+05 1.78e+05 Heating_GasW 1.659e+05 1.77e+04 9.397 0.000 1.31e+05 2e+05 Heating_Grav 1.211e+05 2.2e+04 5.499 0.000 7.79e+04 1.64e+05 Heating_OthW 1.809e+05 3.55e+04 5.089 0.000 1.11e+05 2.51e+05 Heating_Wall 1.035e+05 2.18e+04 4.748 0.000 6.07e+04 1.46e+05 Heating QC_Fa -1.965e+04 6503.085 -3.022 0.003 -3.24e+04 -6896.695 Heating QC_Gd -1.052e+04 3070.148 -3.426 0.001 -1.65e+04 -4497.127 Heating QC_Po -4.995e+04 3.75e+04 -1.331 0.183 -1.24e+05 2.36e+04 Heating QC_TA -1.699e+04 3024.539 -5.616 0.000 -2.29e+04 -1.11e+04 Kitchen Qual_Fa -8.664e+04 9345.350 -9.271 0.000 -1.05e+05 -6.83e+04 Kitchen Qual_Gd -5.822e+04 5570.721 -10.451 0.000 -6.91e+04 -4.73e+04 Kitchen Qual_Po 9.502e-12 2.04e-11 0.466 0.641 -3.05e-11 4.95e-11 Kitchen Qual_TA -8.256e+04 6056.496 -13.632 0.000 -9.44e+04 -7.07e+04 Bsmt Qual_Fa -6.956e+04 8480.149 -8.202 0.000 -8.62e+04 -5.29e+04 Bsmt Qual_Gd -5.182e+04 4975.917 -10.414 0.000 -6.16e+04 -4.21e+04 Bsmt Qual_Po -4.957e+04 3.26e+04 -1.520 0.129 -1.14e+05 1.44e+04 Bsmt Qual_TA -6.735e+04 5890.731 -11.433 0.000 -7.89e+04 -5.58e+04 BsmtFin Type 1_BLQ 7391.9778 4240.091 1.743 0.081 -925.738 1.57e+04 BsmtFin Type 1_GLQ 7554.7384 3612.208 2.091 0.037 468.730 1.46e+04 BsmtFin Type 1_LwQ 941.2438 5122.629 0.184 0.854 -9107.732 1.1e+04 BsmtFin Type 1_Rec -99.0146 4103.114 -0.024 0.981 -8148.024 7949.995 BsmtFin Type 1_Unf -9216.8850 3397.949 -2.712 0.007 -1.59e+04 -2551.185 Garage Qual_Fa -2.071e+05 4.4e+04 -4.710 0.000 -2.93e+05 -1.21e+05 Garage Qual_Gd -1.757e+05 4.5e+04 -3.902 0.000 -2.64e+05 -8.74e+04 Garage Qual_Po -1.931e+05 5.16e+04 -3.744 0.000 -2.94e+05 -9.19e+04 Garage Qual_TA -1.95e+05 4.36e+04 -4.471 0.000 -2.81e+05 -1.09e+05 Garage Cond_Fa 1.779e+05 1.72e+04 10.335 0.000 1.44e+05 2.12e+05 Garage Cond_Gd 1.768e+05 2.05e+04 8.618 0.000 1.37e+05 2.17e+05 Garage Cond_Po 1.833e+05 2.14e+04 8.577 0.000 1.41e+05 2.25e+05 Garage Cond_TA 1.82e+05 1.67e+04 10.922 0.000 1.49e+05 2.15e+05 ============================================================================== Omnibus: 184.219 Durbin-Watson: 1.909 Prob(Omnibus): 0.000 Jarque-Bera (JB): 589.136 Skew: 0.622 Prob(JB): 1.18e-128 Kurtosis: 5.847 Cond. No. 1.29e+16 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The smallest eigenvalue is 5.16e-29. This might indicate that there are strong multicollinearity problems or that the design matrix is singular.
Reducing the number of dummy columns from 184 to 89, reduced the adjusted R-squared from 79.9% to 76.9%.
I would consider this not too big of a loss in return for substantially reducing the number of dummy cvolumns.
# filter categorical data to only within Sale Condition = 'Normal'.
text_df3['SalePrice'] = df2['SalePrice']
text_df3_normal = text_df3[text_df3['Sale Condition'] == 'Normal']
print(text_df3_normal['Sale Condition'].value_counts(dropna=False))
text_df4_normal = text_df3_normal.drop(['Condition 1', 'Condition 2', 'Lot Config',\
'House Style', 'Roof Style', 'Exterior 2nd',\
'Exter Cond', 'Functional', 'Electrical',\
'Foundation', 'Garage Type', 'Sale Type',\
'Sale Condition', 'BsmtFin Type 2','Bsmt Cond'\
], axis = 1)
print(len(text_df4_normal))
text_df5_normal = text_df4_normal.drop(['SalePrice'], axis=1)
print(text_df5_normal.columns)
Y = text_df4_normal['SalePrice']
X2 = pd.get_dummies(data=text_df5_normal, drop_first=True)
print('There are', len(X2.columns), 'dummy columns created.', '\n')
from sklearn import linear_model
from sklearn.model_selection import train_test_split
X2_train, X2_test, Y_train, Y_test = train_test_split(X2, Y, test_size = .50, random_state = 40)
regr = linear_model.LinearRegression() # Do not use fit_intercept = False if you have removed 1 column after dummy encoding
regr.fit(X2_train, Y_train)
predicted = regr.predict(X2_test)
model = sm.OLS(Y_test, X2_test).fit()
predictions = model.predict(X2)
print_model = model.summary()
print(print_model)
Normal 2413 Name: Sale Condition, dtype: int64 2413 Index(['MS Zoning', 'Neighborhood', 'Bldg Type', 'Roof Matl', 'Exterior 1st', 'Heating', 'Heating QC', 'Kitchen Qual', 'Bsmt Qual', 'BsmtFin Type 1', 'Garage Qual', 'Garage Cond'], dtype='object') There are 87 dummy columns created. OLS Regression Results ============================================================================== Dep. Variable: SalePrice R-squared: 0.750 Model: OLS Adj. R-squared: 0.733 Method: Least Squares F-statistic: 43.48 Date: Wed, 14 Jul 2021 Prob (F-statistic): 4.94e-285 Time: 16:33:12 Log-Likelihood: -14351. No. Observations: 1207 AIC: 2.886e+04 Df Residuals: 1128 BIC: 2.926e+04 Df Model: 78 Covariance Type: nonrobust ======================================================================================== coef std err t P>|t| [0.025 0.975] ---------------------------------------------------------------------------------------- MS Zoning_C (all) 1.085e+05 2.5e+04 4.345 0.000 5.95e+04 1.58e+05 MS Zoning_FV 1.145e+05 2.03e+04 5.648 0.000 7.47e+04 1.54e+05 MS Zoning_I (all) 1.126e+05 3.91e+04 2.883 0.004 3.6e+04 1.89e+05 MS Zoning_RH 1.268e+05 1.91e+04 6.634 0.000 8.93e+04 1.64e+05 MS Zoning_RL 1.242e+05 1.65e+04 7.517 0.000 9.18e+04 1.57e+05 MS Zoning_RM 1.213e+05 1.72e+04 7.050 0.000 8.76e+04 1.55e+05 Neighborhood_Blueste -2.227e+04 2.15e+04 -1.035 0.301 -6.45e+04 2e+04 Neighborhood_BrDale -3.424e+04 1.81e+04 -1.896 0.058 -6.97e+04 1184.675 Neighborhood_BrkSide -7.483e+04 1.51e+04 -4.949 0.000 -1.04e+05 -4.52e+04 Neighborhood_ClearCr -1.411e+04 1.51e+04 -0.936 0.350 -4.37e+04 1.55e+04 Neighborhood_CollgCr -4.333e+04 1.28e+04 -3.393 0.001 -6.84e+04 -1.83e+04 Neighborhood_Crawfor -2.362e+04 1.43e+04 -1.654 0.098 -5.16e+04 4396.245 Neighborhood_Edwards -7.059e+04 1.34e+04 -5.264 0.000 -9.69e+04 -4.43e+04 Neighborhood_Gilbert -4.151e+04 1.34e+04 -3.100 0.002 -6.78e+04 -1.52e+04 Neighborhood_Greens -348.2828 2.26e+04 -0.015 0.988 -4.47e+04 4.4e+04 Neighborhood_GrnHill 7.842e+04 2.91e+04 2.698 0.007 2.14e+04 1.35e+05 Neighborhood_IDOTRR -7.771e+04 1.71e+04 -4.532 0.000 -1.11e+05 -4.41e+04 Neighborhood_Landmrk 1.171e-10 6.5e-11 1.800 0.072 -1.05e-11 2.45e-10 Neighborhood_MeadowV -7.688e+04 1.81e+04 -4.257 0.000 -1.12e+05 -4.14e+04 Neighborhood_Mitchel -5.87e+04 1.39e+04 -4.236 0.000 -8.59e+04 -3.15e+04 Neighborhood_NAmes -6.244e+04 1.33e+04 -4.708 0.000 -8.85e+04 -3.64e+04 Neighborhood_NPkVill -2.678e+04 1.67e+04 -1.602 0.109 -5.96e+04 6012.781 Neighborhood_NWAmes -3.258e+04 1.4e+04 -2.330 0.020 -6e+04 -5147.418 Neighborhood_NoRidge 5.408e+04 1.4e+04 3.863 0.000 2.66e+04 8.15e+04 Neighborhood_NridgHt 1.426e+04 1.31e+04 1.086 0.278 -1.15e+04 4e+04 Neighborhood_OldTown -6.951e+04 1.54e+04 -4.517 0.000 -9.97e+04 -3.93e+04 Neighborhood_SWISU -7.07e+04 1.58e+04 -4.485 0.000 -1.02e+05 -3.98e+04 Neighborhood_Sawyer -7.12e+04 1.38e+04 -5.155 0.000 -9.83e+04 -4.41e+04 Neighborhood_SawyerW -5.362e+04 1.33e+04 -4.041 0.000 -7.96e+04 -2.76e+04 Neighborhood_Somerst -5235.2425 1.66e+04 -0.315 0.753 -3.79e+04 2.74e+04 Neighborhood_StoneBr 2.043e+04 1.42e+04 1.444 0.149 -7339.072 4.82e+04 Neighborhood_Timber -1.065e+04 1.46e+04 -0.730 0.465 -3.93e+04 1.8e+04 Neighborhood_Veenker -9760.4948 1.96e+04 -0.498 0.619 -4.82e+04 2.87e+04 Bldg Type_2fmCon -1.274e+04 7542.857 -1.689 0.091 -2.75e+04 2056.776 Bldg Type_Duplex -772.1272 6664.154 -0.116 0.908 -1.38e+04 1.23e+04 Bldg Type_Twnhs -5.756e+04 8138.598 -7.073 0.000 -7.35e+04 -4.16e+04 Bldg Type_TwnhsE -5.539e+04 6042.165 -9.166 0.000 -6.72e+04 -4.35e+04 Roof Matl_Membran -2.742e-10 6.99e-11 -3.922 0.000 -4.11e-10 -1.37e-10 Roof Matl_Metal -1.8e-10 9.73e-11 -1.850 0.065 -3.71e-10 1.09e-11 Roof Matl_Roll 1.971e+04 4.02e+04 0.491 0.624 -5.91e+04 9.85e+04 Roof Matl_Tar&Grv 3.238e+04 1.28e+04 2.532 0.011 7289.735 5.75e+04 Roof Matl_WdShake 8511.2381 1.69e+04 0.504 0.614 -2.46e+04 4.16e+04 Roof Matl_WdShngl 2.785e+04 2.64e+04 1.056 0.291 -2.39e+04 7.96e+04 Exterior 1st_AsphShn -2.173e+04 2.89e+04 -0.753 0.452 -7.84e+04 3.49e+04 Exterior 1st_BrkComm 2.964e+04 3.06e+04 0.969 0.333 -3.04e+04 8.97e+04 Exterior 1st_BrkFace 4.041e+04 1.27e+04 3.184 0.001 1.55e+04 6.53e+04 Exterior 1st_CBlock 8.276e+04 3.79e+04 2.181 0.029 8314.362 1.57e+05 Exterior 1st_CemntBd 3.042e+04 1.4e+04 2.177 0.030 3002.746 5.78e+04 Exterior 1st_HdBoard 7463.6995 1.19e+04 0.626 0.532 -1.59e+04 3.09e+04 Exterior 1st_ImStucc -3.786e+04 3.88e+04 -0.976 0.329 -1.14e+05 3.82e+04 Exterior 1st_MetalSd 8397.7574 1.16e+04 0.724 0.469 -1.44e+04 3.12e+04 Exterior 1st_Plywood 2.042e+04 1.26e+04 1.625 0.104 -4230.240 4.51e+04 Exterior 1st_PreCast 1.853e-10 5.2e-11 3.566 0.000 8.33e-11 2.87e-10 Exterior 1st_Stucco -5450.2247 1.41e+04 -0.386 0.700 -3.32e+04 2.23e+04 Exterior 1st_VinylSd 8815.1323 1.18e+04 0.748 0.455 -1.43e+04 3.19e+04 Exterior 1st_Wd Sdng 1.152e+04 1.15e+04 0.998 0.319 -1.11e+04 3.42e+04 Exterior 1st_WdShing -134.6544 1.39e+04 -0.010 0.992 -2.75e+04 2.72e+04 Heating_GasA 2.088e+05 2.8e+04 7.448 0.000 1.54e+05 2.64e+05 Heating_GasW 2.495e+05 2.94e+04 8.473 0.000 1.92e+05 3.07e+05 Heating_Grav 1.856e+05 3.38e+04 5.494 0.000 1.19e+05 2.52e+05 Heating_OthW 5.717e-11 3.84e-11 1.490 0.136 -1.81e-11 1.32e-10 Heating_Wall 1.469e+05 3.52e+04 4.173 0.000 7.78e+04 2.16e+05 Heating QC_Fa -2.376e+04 7329.521 -3.241 0.001 -3.81e+04 -9376.961 Heating QC_Gd -1.095e+04 3441.011 -3.182 0.002 -1.77e+04 -4197.537 Heating QC_Po -3.991e+04 3.76e+04 -1.061 0.289 -1.14e+05 3.39e+04 Heating QC_TA -1.283e+04 3227.587 -3.975 0.000 -1.92e+04 -6497.862 Kitchen Qual_Fa -1.034e+05 1.03e+04 -10.036 0.000 -1.24e+05 -8.32e+04 Kitchen Qual_Gd -5.445e+04 6320.963 -8.614 0.000 -6.69e+04 -4.2e+04 Kitchen Qual_Po -6.085e-12 1.09e-11 -0.560 0.575 -2.74e-11 1.52e-11 Kitchen Qual_TA -7.774e+04 6634.536 -11.718 0.000 -9.08e+04 -6.47e+04 Bsmt Qual_Fa -7.916e+04 9619.650 -8.229 0.000 -9.8e+04 -6.03e+04 Bsmt Qual_Gd -4.715e+04 5818.418 -8.104 0.000 -5.86e+04 -3.57e+04 Bsmt Qual_Po -3.609e+04 6.43e+04 -0.561 0.575 -1.62e+05 9.01e+04 Bsmt Qual_TA -6.296e+04 6669.427 -9.440 0.000 -7.6e+04 -4.99e+04 BsmtFin Type 1_BLQ -3889.3236 4536.134 -0.857 0.391 -1.28e+04 5010.886 BsmtFin Type 1_GLQ 3728.2450 3918.461 0.951 0.342 -3960.047 1.14e+04 BsmtFin Type 1_LwQ 520.0943 5792.363 0.090 0.928 -1.08e+04 1.19e+04 BsmtFin Type 1_Rec 1508.9451 4611.080 0.327 0.744 -7538.313 1.06e+04 BsmtFin Type 1_Unf -9197.2286 3779.375 -2.434 0.015 -1.66e+04 -1781.833 Garage Qual_Fa -8248.8121 2.09e+04 -0.395 0.693 -4.92e+04 3.27e+04 Garage Qual_Gd 3.415e+04 2.25e+04 1.517 0.129 -1e+04 7.83e+04 Garage Qual_Po -3.7e+04 4.45e+04 -0.832 0.405 -1.24e+05 5.02e+04 Garage Qual_TA 4000.0555 2.03e+04 0.197 0.844 -3.59e+04 4.39e+04 Garage Cond_Fa -6282.1362 2.12e+04 -0.296 0.767 -4.8e+04 3.54e+04 Garage Cond_Gd -1.704e+04 2.37e+04 -0.718 0.473 -6.36e+04 2.95e+04 Garage Cond_Po 9512.0120 2.38e+04 0.400 0.689 -3.71e+04 5.62e+04 Garage Cond_TA 6716.6715 2.04e+04 0.329 0.742 -3.33e+04 4.67e+04 ============================================================================== Omnibus: 276.095 Durbin-Watson: 2.001 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1418.389 Skew: 0.963 Prob(JB): 1.00e-308 Kurtosis: 7.949 Cond. No. 1.37e+16 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The smallest eigenvalue is 3.2e-29. This might indicate that there are strong multicollinearity problems or that the design matrix is singular.
It's interesting that there was a slight reduction in Adjusted R-Squared value (76.9% to 73.2%) for categorical data whereas for numeical data there was an increase.
I don't have an explanation for that.
df2_numerical = df2[features]
df_combined = pd.concat([df2_numerical, X], axis=1)
print('There are', len(df_combined.columns), 'numerical and dummy columns created.', '\n')
X = df_combined
Y = df2['SalePrice']
from sklearn import linear_model
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .50, random_state = 40)
regr = linear_model.LinearRegression() # Do not use fit_intercept = False if you have removed 1 column after dummy encoding
regr.fit(X_train, Y_train)
predicted = regr.predict(X_test)
model = sm.OLS(Y_test, X_test).fit()
predictions = model.predict(X)
R2 = model.rsquared
print_model = model.summary()
print(R2)
print(print_model)
There are 99 numerical and dummy columns created. 0.9091012292453284 OLS Regression Results ============================================================================== Dep. Variable: SalePrice R-squared: 0.909 Model: OLS Adj. R-squared: 0.903 Method: Least Squares F-statistic: 147.4 Date: Wed, 14 Jul 2021 Prob (F-statistic): 0.00 Time: 16:33:12 Log-Likelihood: -16790. No. Observations: 1465 AIC: 3.377e+04 Df Residuals: 1371 BIC: 3.427e+04 Df Model: 93 Covariance Type: nonrobust ======================================================================================== coef std err t P>|t| [0.025 0.975] ---------------------------------------------------------------------------------------- Open Porch SF 20.3018 10.768 1.885 0.060 -0.822 41.425 Wood Deck SF 16.0119 5.834 2.745 0.006 4.568 27.456 BsmtFin SF 1 18.5239 2.372 7.808 0.000 13.870 23.178 Fireplaces 4041.6344 1265.329 3.194 0.001 1559.443 6523.826 TotRms AbvGrd 2622.5292 661.540 3.964 0.000 1324.788 3920.270 Mas Vnr Area 24.6705 4.902 5.033 0.000 15.054 34.287 Full Bath 5464.7134 1903.616 2.871 0.004 1730.397 9199.029 Garage Area 33.7477 4.075 8.281 0.000 25.753 41.742 Total_SF 28.7634 1.826 15.754 0.000 25.182 32.345 Overall Qual 1.03e+04 897.242 11.474 0.000 8535.001 1.21e+04 MS Zoning_C (all) 2.54e+04 1e+04 2.536 0.011 5752.664 4.51e+04 MS Zoning_FV 5.107e+04 9283.669 5.501 0.000 3.29e+04 6.93e+04 MS Zoning_I (all) 1.55e+04 2.3e+04 0.673 0.501 -2.97e+04 6.07e+04 MS Zoning_RH 3.491e+04 9722.546 3.591 0.000 1.58e+04 5.4e+04 MS Zoning_RL 3.836e+04 7758.322 4.945 0.000 2.31e+04 5.36e+04 MS Zoning_RM 3.61e+04 7987.933 4.519 0.000 2.04e+04 5.18e+04 Neighborhood_Blueste 1.264e+04 1.12e+04 1.126 0.261 -9390.840 3.47e+04 Neighborhood_BrDale 1347.1087 1.04e+04 0.130 0.897 -1.9e+04 2.17e+04 Neighborhood_BrkSide -8867.8292 8450.784 -1.049 0.294 -2.54e+04 7710.038 Neighborhood_ClearCr 7647.9629 8955.433 0.854 0.393 -9919.872 2.52e+04 Neighborhood_CollgCr -5542.2506 6986.466 -0.793 0.428 -1.92e+04 8163.071 Neighborhood_Crawfor 2.284e+04 7644.191 2.988 0.003 7845.442 3.78e+04 Neighborhood_Edwards -1.35e+04 7572.481 -1.783 0.075 -2.84e+04 1353.648 Neighborhood_Gilbert -590.9699 7282.476 -0.081 0.935 -1.49e+04 1.37e+04 Neighborhood_Greens 3.157e+04 2.48e+04 1.273 0.203 -1.71e+04 8.02e+04 Neighborhood_GrnHill -3.803e-10 9.24e-11 -4.114 0.000 -5.62e-10 -1.99e-10 Neighborhood_IDOTRR -1.515e+04 9261.233 -1.636 0.102 -3.33e+04 3017.459 Neighborhood_Landmrk 5061.9801 2.6e+04 0.195 0.846 -4.59e+04 5.6e+04 Neighborhood_MeadowV -1.113e+04 1.06e+04 -1.047 0.295 -3.2e+04 9724.297 Neighborhood_Mitchel -7522.1580 7655.764 -0.983 0.326 -2.25e+04 7496.123 Neighborhood_NAmes -1.199e+04 7397.484 -1.621 0.105 -2.65e+04 2523.067 Neighborhood_NPkVill 7502.2339 1.04e+04 0.724 0.469 -1.28e+04 2.78e+04 Neighborhood_NWAmes -2.11e+04 7725.641 -2.731 0.006 -3.63e+04 -5942.982 Neighborhood_NoRidge 2.307e+04 8058.976 2.863 0.004 7264.624 3.89e+04 Neighborhood_NridgHt 8619.0458 7339.920 1.174 0.240 -5779.645 2.3e+04 Neighborhood_OldTown -1.822e+04 8472.203 -2.151 0.032 -3.48e+04 -1600.793 Neighborhood_SWISU -1.071e+04 8769.045 -1.222 0.222 -2.79e+04 6489.568 Neighborhood_Sawyer -6494.5450 7709.365 -0.842 0.400 -2.16e+04 8628.885 Neighborhood_SawyerW -4023.6034 7468.926 -0.539 0.590 -1.87e+04 1.06e+04 Neighborhood_Somerst -7900.8433 8241.671 -0.959 0.338 -2.41e+04 8266.808 Neighborhood_StoneBr 4.274e+04 8087.805 5.285 0.000 2.69e+04 5.86e+04 Neighborhood_Timber -4207.0559 7778.087 -0.541 0.589 -1.95e+04 1.11e+04 Neighborhood_Veenker -1.888e+04 1.06e+04 -1.774 0.076 -3.98e+04 1999.795 Bldg Type_2fmCon -1.466e+04 4555.086 -3.219 0.001 -2.36e+04 -5727.760 Bldg Type_Duplex -2.536e+04 4036.460 -6.283 0.000 -3.33e+04 -1.74e+04 Bldg Type_Twnhs -2.879e+04 5253.901 -5.480 0.000 -3.91e+04 -1.85e+04 Bldg Type_TwnhsE -2.433e+04 3537.217 -6.879 0.000 -3.13e+04 -1.74e+04 Roof Matl_CompShg 1.929e+04 9241.513 2.087 0.037 1156.717 3.74e+04 Roof Matl_Membran 6.771e+04 2.31e+04 2.928 0.003 2.24e+04 1.13e+05 Roof Matl_Metal 2.369e+04 2.32e+04 1.021 0.308 -2.18e+04 6.92e+04 Roof Matl_Roll 1.599e+04 2.41e+04 0.663 0.507 -3.13e+04 6.33e+04 Roof Matl_Tar&Grv 3.698e+04 1.14e+04 3.236 0.001 1.46e+04 5.94e+04 Roof Matl_WdShake 3.233e+04 1.31e+04 2.468 0.014 6632.421 5.8e+04 Roof Matl_WdShngl 6.258e+04 1.28e+04 4.872 0.000 3.74e+04 8.78e+04 Exterior 1st_AsphShn 2.096e+04 2.49e+04 0.841 0.400 -2.79e+04 6.99e+04 Exterior 1st_BrkComm -4506.4695 1.37e+04 -0.330 0.742 -3.13e+04 2.23e+04 Exterior 1st_BrkFace 1.311e+04 6778.901 1.934 0.053 -184.361 2.64e+04 Exterior 1st_CBlock 5.722e+04 2.32e+04 2.461 0.014 1.16e+04 1.03e+05 Exterior 1st_CemntBd 1.413e+04 6970.717 2.027 0.043 456.150 2.78e+04 Exterior 1st_HdBoard -5052.6089 5957.403 -0.848 0.397 -1.67e+04 6634.003 Exterior 1st_ImStucc -2.622e+04 2.5e+04 -1.049 0.295 -7.53e+04 2.28e+04 Exterior 1st_MetalSd 673.7221 5746.047 0.117 0.907 -1.06e+04 1.19e+04 Exterior 1st_Plywood -2347.2648 6374.201 -0.368 0.713 -1.49e+04 1.02e+04 Exterior 1st_PreCast 7.743e+04 2.67e+04 2.903 0.004 2.51e+04 1.3e+05 Exterior 1st_Stone -1.233e+04 2.5e+04 -0.492 0.623 -6.15e+04 3.68e+04 Exterior 1st_Stucco 2544.5360 7296.113 0.349 0.727 -1.18e+04 1.69e+04 Exterior 1st_VinylSd -321.3025 5852.141 -0.055 0.956 -1.18e+04 1.12e+04 Exterior 1st_Wd Sdng -5321.5633 5809.087 -0.916 0.360 -1.67e+04 6074.098 Exterior 1st_WdShing -5007.7650 7276.786 -0.688 0.491 -1.93e+04 9267.076 Heating_GasA 5.297e+04 9924.084 5.337 0.000 3.35e+04 7.24e+04 Heating_GasW 4.977e+04 1.19e+04 4.184 0.000 2.64e+04 7.31e+04 Heating_Grav 4.248e+04 1.45e+04 2.935 0.003 1.41e+04 7.09e+04 Heating_OthW 4.252e+04 2.34e+04 1.814 0.070 -3467.524 8.85e+04 Heating_Wall 7.082e+04 1.44e+04 4.923 0.000 4.26e+04 9.9e+04 Heating QC_Fa -1.076e+04 4226.869 -2.546 0.011 -1.91e+04 -2471.661 Heating QC_Gd -2756.7916 2004.725 -1.375 0.169 -6689.451 1175.868 Heating QC_Po -1.514e+04 2.43e+04 -0.623 0.533 -6.28e+04 3.25e+04 Heating QC_TA -8192.6848 1976.887 -4.144 0.000 -1.21e+04 -4314.634 Kitchen Qual_Fa -3.7e+04 6277.411 -5.895 0.000 -4.93e+04 -2.47e+04 Kitchen Qual_Gd -2.991e+04 3709.966 -8.063 0.000 -3.72e+04 -2.26e+04 Kitchen Qual_Po 3.443e-11 2.01e-11 1.714 0.087 -4.99e-12 7.38e-11 Kitchen Qual_TA -3.733e+04 4116.143 -9.069 0.000 -4.54e+04 -2.93e+04 Bsmt Qual_Fa -3.054e+04 5595.860 -5.457 0.000 -4.15e+04 -1.96e+04 Bsmt Qual_Gd -2.643e+04 3299.828 -8.009 0.000 -3.29e+04 -2e+04 Bsmt Qual_Po -2420.1087 2.11e+04 -0.114 0.909 -4.39e+04 3.91e+04 Bsmt Qual_TA -3.06e+04 3938.270 -7.769 0.000 -3.83e+04 -2.29e+04 BsmtFin Type 1_BLQ -115.4930 2765.819 -0.042 0.967 -5541.189 5310.203 BsmtFin Type 1_GLQ 4015.4595 2371.161 1.693 0.091 -636.037 8666.956 BsmtFin Type 1_LwQ -5308.4878 3369.415 -1.575 0.115 -1.19e+04 1301.280 BsmtFin Type 1_Rec -5191.8383 2702.452 -1.921 0.055 -1.05e+04 109.550 BsmtFin Type 1_Unf -174.3089 2596.436 -0.067 0.946 -5267.727 4919.109 Garage Qual_Fa -1.264e+05 2.89e+04 -4.372 0.000 -1.83e+05 -6.97e+04 Garage Qual_Gd -1.166e+05 2.96e+04 -3.944 0.000 -1.75e+05 -5.86e+04 Garage Qual_Po -1.305e+05 3.37e+04 -3.868 0.000 -1.97e+05 -6.43e+04 Garage Qual_TA -1.233e+05 2.87e+04 -4.301 0.000 -1.8e+05 -6.7e+04 Garage Cond_Fa 6.408e+04 1.17e+04 5.490 0.000 4.12e+04 8.7e+04 Garage Cond_Gd 6.739e+04 1.37e+04 4.907 0.000 4.04e+04 9.43e+04 Garage Cond_Po 5.565e+04 1.43e+04 3.879 0.000 2.75e+04 8.38e+04 Garage Cond_TA 7.144e+04 1.13e+04 6.298 0.000 4.92e+04 9.37e+04 ============================================================================== Omnibus: 155.772 Durbin-Watson: 1.955 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1248.394 Skew: 0.034 Prob(JB): 8.22e-272 Kurtosis: 7.522 Cond. No. 2.83e+19 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The smallest eigenvalue is 1.34e-29. This might indicate that there are strong multicollinearity problems or that the design matrix is singular.
Wow! A nice increase in Adjusted R-Squared value to 90.3% for the combined data types!
df2_numerical2 = df2_normal[features]
df_combined = pd.concat([df2_numerical2, X2], axis=1)
print('There are', len(df_combined.columns), 'numerical and dummy columns created.', '\n')
X = df_combined
Y = text_df4_normal['SalePrice']
from sklearn import linear_model
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .50, random_state = 40)
regr = linear_model.LinearRegression() # Do not use fit_intercept = False if you have removed 1 column after dummy encoding
regr.fit(X_train, Y_train)
predicted = regr.predict(X_test)
model = sm.OLS(Y_test, X_test).fit()
predictions = model.predict(X)
R2 = model.rsquared
print_model = model.summary()
print(R2)
print(print_model)
There are 97 numerical and dummy columns created. 0.9167741975924575 OLS Regression Results ============================================================================== Dep. Variable: SalePrice R-squared: 0.917 Model: OLS Adj. R-squared: 0.910 Method: Least Squares F-statistic: 139.9 Date: Wed, 14 Jul 2021 Prob (F-statistic): 0.00 Time: 16:33:12 Log-Likelihood: -13688. No. Observations: 1207 AIC: 2.755e+04 Df Residuals: 1118 BIC: 2.801e+04 Df Model: 88 Covariance Type: nonrobust ======================================================================================== coef std err t P>|t| [0.025 0.975] ---------------------------------------------------------------------------------------- Open Porch SF 36.8597 10.786 3.417 0.001 15.697 58.023 Wood Deck SF 28.6674 5.418 5.292 0.000 18.038 39.297 BsmtFin SF 1 18.0298 2.559 7.045 0.000 13.008 23.052 Fireplaces 5563.2346 1249.703 4.452 0.000 3111.207 8015.262 TotRms AbvGrd 2424.6744 650.033 3.730 0.000 1149.252 3700.097 Mas Vnr Area 27.2985 4.787 5.703 0.000 17.906 36.691 Full Bath 1759.9739 1837.162 0.958 0.338 -1844.700 5364.648 Garage Area 28.2365 4.173 6.767 0.000 20.049 36.424 Total_SF 31.6868 1.759 18.010 0.000 28.235 35.139 Overall Qual 9434.9449 873.553 10.801 0.000 7720.957 1.11e+04 MS Zoning_C (all) 2720.8656 1.48e+04 0.183 0.855 -2.64e+04 3.19e+04 MS Zoning_FV 2.848e+04 1.21e+04 2.359 0.018 4795.595 5.22e+04 MS Zoning_I (all) -6105.6821 2.29e+04 -0.266 0.790 -5.11e+04 3.89e+04 MS Zoning_RH 2.357e+04 1.15e+04 2.043 0.041 937.888 4.62e+04 MS Zoning_RL 2.319e+04 1e+04 2.317 0.021 3549.366 4.28e+04 MS Zoning_RM 1.595e+04 1.04e+04 1.528 0.127 -4531.386 3.64e+04 Neighborhood_Blueste 2422.5880 1.26e+04 0.192 0.848 -2.23e+04 2.72e+04 Neighborhood_BrDale 429.4518 1.07e+04 0.040 0.968 -2.06e+04 2.14e+04 Neighborhood_BrkSide -1.215e+04 8943.024 -1.359 0.174 -2.97e+04 5394.958 Neighborhood_ClearCr -913.7958 8853.141 -0.103 0.918 -1.83e+04 1.65e+04 Neighborhood_CollgCr -1.014e+04 7524.299 -1.348 0.178 -2.49e+04 4619.264 Neighborhood_Crawfor 3596.7432 8382.840 0.429 0.668 -1.29e+04 2e+04 Neighborhood_Edwards -1.452e+04 7963.161 -1.824 0.068 -3.01e+04 1100.896 Neighborhood_Gilbert -5754.1292 7844.948 -0.733 0.463 -2.11e+04 9638.351 Neighborhood_Greens 6110.1197 1.34e+04 0.458 0.647 -2.01e+04 3.23e+04 Neighborhood_GrnHill 1.166e+05 1.7e+04 6.878 0.000 8.33e+04 1.5e+05 Neighborhood_IDOTRR -1.252e+04 1.01e+04 -1.235 0.217 -3.24e+04 7362.550 Neighborhood_Landmrk 2.547e-11 4.63e-11 0.550 0.583 -6.54e-11 1.16e-10 Neighborhood_MeadowV -3283.5589 1.07e+04 -0.306 0.760 -2.44e+04 1.78e+04 Neighborhood_Mitchel -1.22e+04 8185.906 -1.491 0.136 -2.83e+04 3857.468 Neighborhood_NAmes -1.945e+04 7856.447 -2.475 0.013 -3.49e+04 -4032.254 Neighborhood_NPkVill -2301.9005 9749.512 -0.236 0.813 -2.14e+04 1.68e+04 Neighborhood_NWAmes -2.11e+04 8180.734 -2.579 0.010 -3.71e+04 -5044.493 Neighborhood_NoRidge 2.318e+04 8307.113 2.790 0.005 6878.596 3.95e+04 Neighborhood_NridgHt 7312.4096 7703.803 0.949 0.343 -7803.131 2.24e+04 Neighborhood_OldTown -2.097e+04 9079.112 -2.310 0.021 -3.88e+04 -3157.249 Neighborhood_SWISU -1.806e+04 9279.565 -1.947 0.052 -3.63e+04 143.048 Neighborhood_Sawyer -1.488e+04 8198.978 -1.814 0.070 -3.1e+04 1210.927 Neighborhood_SawyerW -1.476e+04 7801.044 -1.892 0.059 -3.01e+04 546.082 Neighborhood_Somerst 375.9867 9718.144 0.039 0.969 -1.87e+04 1.94e+04 Neighborhood_StoneBr 1.576e+04 8283.393 1.902 0.057 -493.936 3.2e+04 Neighborhood_Timber 486.6076 8529.129 0.057 0.955 -1.62e+04 1.72e+04 Neighborhood_Veenker 6.5250 1.15e+04 0.001 1.000 -2.25e+04 2.25e+04 Bldg Type_2fmCon -1.602e+04 4474.601 -3.580 0.000 -2.48e+04 -7241.156 Bldg Type_Duplex -2.321e+04 4239.439 -5.476 0.000 -3.15e+04 -1.49e+04 Bldg Type_Twnhs -2.732e+04 4824.966 -5.663 0.000 -3.68e+04 -1.79e+04 Bldg Type_TwnhsE -2.268e+04 3678.040 -6.168 0.000 -2.99e+04 -1.55e+04 Roof Matl_Membran -1.226e-10 3.62e-11 -3.386 0.001 -1.94e-10 -5.16e-11 Roof Matl_Metal -4.376e-11 4.26e-11 -1.027 0.305 -1.27e-10 3.99e-11 Roof Matl_Roll -3916.0750 2.34e+04 -0.167 0.867 -4.99e+04 4.21e+04 Roof Matl_Tar&Grv 1.658e+04 7465.319 2.221 0.027 1931.572 3.12e+04 Roof Matl_WdShake 1175.0583 9867.517 0.119 0.905 -1.82e+04 2.05e+04 Roof Matl_WdShngl 2.916e+04 1.56e+04 1.874 0.061 -1366.337 5.97e+04 Exterior 1st_AsphShn 1.445e+04 1.69e+04 0.858 0.391 -1.86e+04 4.75e+04 Exterior 1st_BrkComm 2.537e+04 1.78e+04 1.424 0.155 -9584.071 6.03e+04 Exterior 1st_BrkFace 2.448e+04 7441.494 3.289 0.001 9877.437 3.91e+04 Exterior 1st_CBlock 3.892e+04 2.21e+04 1.761 0.079 -4448.732 8.23e+04 Exterior 1st_CemntBd 8104.1408 8154.578 0.994 0.321 -7895.860 2.41e+04 Exterior 1st_HdBoard -2097.8774 6972.085 -0.301 0.764 -1.58e+04 1.16e+04 Exterior 1st_ImStucc -2.359e+04 2.28e+04 -1.036 0.300 -6.83e+04 2.11e+04 Exterior 1st_MetalSd 4204.2140 6753.348 0.623 0.534 -9046.451 1.75e+04 Exterior 1st_Plywood -1870.4736 7352.423 -0.254 0.799 -1.63e+04 1.26e+04 Exterior 1st_PreCast -1.096e-11 2.76e-11 -0.396 0.692 -6.52e-11 4.33e-11 Exterior 1st_Stucco -1562.3802 8274.193 -0.189 0.850 -1.78e+04 1.47e+04 Exterior 1st_VinylSd 2888.3940 6883.072 0.420 0.675 -1.06e+04 1.64e+04 Exterior 1st_Wd Sdng -1252.1381 6752.181 -0.185 0.853 -1.45e+04 1.2e+04 Exterior 1st_WdShing 1839.9761 8156.956 0.226 0.822 -1.42e+04 1.78e+04 Heating_GasA 2.805e+04 1.7e+04 1.648 0.100 -5346.703 6.14e+04 Heating_GasW 2.23e+04 1.8e+04 1.237 0.217 -1.31e+04 5.77e+04 Heating_Grav 2.609e+04 2.01e+04 1.299 0.194 -1.33e+04 6.55e+04 Heating_OthW 3.487e-11 1.38e-11 2.532 0.011 7.85e-12 6.19e-11 Heating_Wall 5.03e+04 2.09e+04 2.408 0.016 9308.230 9.13e+04 Heating QC_Fa -3639.2942 4281.086 -0.850 0.395 -1.2e+04 4760.573 Heating QC_Gd -1740.8398 2018.133 -0.863 0.389 -5700.594 2218.915 Heating QC_Po -1.35e+04 2.18e+04 -0.618 0.537 -5.64e+04 2.94e+04 Heating QC_TA -3367.1913 1890.807 -1.781 0.075 -7077.121 342.738 Kitchen Qual_Fa -3.782e+04 6246.000 -6.055 0.000 -5.01e+04 -2.56e+04 Kitchen Qual_Gd -3.038e+04 3726.586 -8.152 0.000 -3.77e+04 -2.31e+04 Kitchen Qual_Po 2.105e-11 1.7e-11 1.241 0.215 -1.22e-11 5.43e-11 Kitchen Qual_TA -3.77e+04 4001.877 -9.421 0.000 -4.56e+04 -2.99e+04 Bsmt Qual_Fa -2.399e+04 5737.936 -4.182 0.000 -3.53e+04 -1.27e+04 Bsmt Qual_Gd -1.792e+04 3490.543 -5.133 0.000 -2.48e+04 -1.11e+04 Bsmt Qual_Po 1.201e+04 3.74e+04 0.321 0.748 -6.13e+04 8.54e+04 Bsmt Qual_TA -2.274e+04 4049.648 -5.616 0.000 -3.07e+04 -1.48e+04 BsmtFin Type 1_BLQ -3406.7497 2665.320 -1.278 0.201 -8636.343 1822.844 BsmtFin Type 1_GLQ 944.7496 2300.554 0.411 0.681 -3569.140 5458.639 BsmtFin Type 1_LwQ -5071.0262 3433.707 -1.477 0.140 -1.18e+04 1666.209 BsmtFin Type 1_Rec -3492.4030 2728.033 -1.280 0.201 -8845.044 1860.238 BsmtFin Type 1_Unf -2532.4512 2643.337 -0.958 0.338 -7718.911 2654.009 Garage Qual_Fa 557.7551 1.22e+04 0.046 0.964 -2.35e+04 2.46e+04 Garage Qual_Gd 8025.3972 1.32e+04 0.609 0.542 -1.78e+04 3.39e+04 Garage Qual_Po -5521.7926 2.59e+04 -0.214 0.831 -5.62e+04 4.52e+04 Garage Qual_TA 2112.1843 1.19e+04 0.177 0.859 -2.12e+04 2.55e+04 Garage Cond_Fa -17.8318 1.24e+04 -0.001 0.999 -2.44e+04 2.44e+04 Garage Cond_Gd -767.7169 1.38e+04 -0.055 0.956 -2.79e+04 2.64e+04 Garage Cond_Po 976.9037 1.39e+04 0.070 0.944 -2.63e+04 2.83e+04 Garage Cond_TA 4982.1890 1.19e+04 0.418 0.676 -1.84e+04 2.84e+04 ============================================================================== Omnibus: 273.816 Durbin-Watson: 2.075 Prob(Omnibus): 0.000 Jarque-Bera (JB): 2136.856 Skew: 0.825 Prob(JB): 0.00 Kurtosis: 9.306 Cond. No. 1.28e+16 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The smallest eigenvalue is 5.4e-23. This might indicate that there are strong multicollinearity problems or that the design matrix is singular.
So, we've reached a maximum Adjusted R-Squared value of 91.0% It looks to be a good sign that for all combined data types regarding the chosen independent variables, there wasn't much difference in total explained variablity whether predicting across all Sale Condition types or just within 'Normal' type.
That's enough analysis for now. Don't want to over anaylze and create pseudo results.
Data leakage can cause you to create overly optimistic if not completely invalid predictive models.
Data leakage is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the mode being constructed.
How could data leakage pose a problem?
Was there any data leakage in this project analysis? I can't say for sure as I tried to prevent it.
pd.set_option('display.max_colwidth', 90)
tab = pd.read_csv('R Squared.csv')
display(tab)
pd.reset_option('display.max_colwidth')
Num | Data Description | Features | Dummy Columns | R-Squared | Adjusted R-Squared | |
---|---|---|---|---|---|---|
0 | 1 | Numerical Data | 7 | 0 | 80.70% | 80.60% |
1 | 2 | Numerical Data within Sale Condition 'Normal' | 6 | 0 | 84.70% | 84.70% |
2 | 3 | Categorical Data with 28 Features/184 Dummy Columns | 28 | 184 | 82.20% | 79.80% |
3 | 4 | Categorical Data with 12 Features/89 Dummy Columns | 12 | 89 | 78.10% | 76.80% |
4 | 5 | Categorical Data within Sale Condition 'Normal' | 12 | 87 | 74.90% | 73.20% |
5 | 6 | Numerical + Categorical Data Combined (1 + 3 above) | 19 | 89 | 90.90% | 90.30% |
6 | 7 | Numerical a+Categorical Data Combined (2 + 4 above) | 19 | 89 | 91.70% | 91.00% |
It took a lot of time and work to screen through and clean the data numerical and non-numerical columns and to come up with various predictor model results as shown in the summary table above.
There were some judgement calls to make along the way in terms of what to drop and what to include.
With a final Adjusted R_Squared value of 91% for combined numerical and non-numerical data, I would say that is very good!