Visual analytics and diagnostics of model fit for linear regression¶

Dr. Tirthajyoti Sarkar¶

Fremont, CA 94536, June 2019¶

In this notebook, we show the basic visual analytics and diagnostic tests that should be run after fitting a linear regression model, for checking the goodness of fit by verifying the fundamental assumptions of linear regression - linearity, independence, constant variance, and normality. In particular, following analytics are shown,

Pairwise scatterplot of the data matrix
Correlation matrix and heatmap
Creating a new dataset of predicting features and their statistical significance (based on p-values)
Residuals vs. predicting variables plots
Fitted vs. residuals plot
Histogram of the normalized residuals
Q-Q plot of the normalized residuals
Shapiro-Wilk normality test on the residuals
Cook's distance plot of the residuals
Variance inflation factor (VIF) of the predicting features

Requirements¶

Python 3.6+
Numpy
Pandas
Seaborn
Statsmodels
Scipy

The dataset¶

We analyze the Concrete Compressive Strength Data Set from UCI ML repository in this notebook.

*Abstract*: Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. Can we predict the strength from other measurement values?

Data Set Information:¶

Number of instances: 1030
Number of Attributes: 9
Attribute breakdown 8 quantitative input variables, and 1 quantitative output variable
Missing Attribute Values: None

Import the libraries and read the data in¶

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm

In [2]:

# The dataset path may be different in your situation. Please use the correct path
df = pd.read_excel("Datasets/Concrete_Data.xls")

In [3]:

df.head(10)

Out[3]:

	Cement (component 1)(kg in a m^3 mixture)	Blast Furnace Slag (component 2)(kg in a m^3 mixture)	Water (component 4)(kg in a m^3 mixture)	Superplasticizer (component 5)(kg in a m^3 mixture)	Coarse Aggregate (component 6)(kg in a m^3 mixture)	Fine Aggregate (component 7)(kg in a m^3 mixture)	Age (day)	Concrete compressive strength(MPa, megapascals)
0	540.0	0.0	162.0	2.5	1040.0	676.0	28	79.986111
1	540.0	0.0	162.0	2.5	1055.0	676.0	28	61.887366
2	332.5	142.5	228.0	0.0	932.0	594.0	270	40.269535
3	332.5	142.5	228.0	0.0	932.0	594.0	365	41.052780
4	198.6	132.4	192.0	0.0	978.4	825.5	360	44.296075
5	266.0	114.0	228.0	0.0	932.0	670.0	90	47.029847
6	380.0	95.0	228.0	0.0	932.0	594.0	365	43.698299
7	380.0	95.0	228.0	0.0	932.0	594.0	28	36.447770
8	266.0	114.0	228.0	0.0	932.0	670.0	28	45.854291
9	475.0	0.0	228.0	0.0	932.0	594.0	28	39.289790

In [4]:

df.describe().T

Out[4]:

	count	mean	std	min	25%	50%	75%	max
Cement (component 1)(kg in a m^3 mixture)	1030.0	281.165631	104.507142	102.000000	192.375000	272.900000	350.000000	540.000000
Blast Furnace Slag (component 2)(kg in a m^3 mixture)	1030.0	73.895485	86.279104	0.000000	0.000000	22.000000	142.950000	359.400000
Fly Ash (component 3)(kg in a m^3 mixture)	1030.0	54.187136	63.996469	0.000000	0.000000	0.000000	118.270000	200.100000
Water (component 4)(kg in a m^3 mixture)	1030.0	181.566359	21.355567	121.750000	164.900000	185.000000	192.000000	247.000000
Superplasticizer (component 5)(kg in a m^3 mixture)	1030.0	6.203112	5.973492	0.000000	0.000000	6.350000	10.160000	32.200000
Coarse Aggregate (component 6)(kg in a m^3 mixture)	1030.0	972.918592	77.753818	801.000000	932.000000	968.000000	1029.400000	1145.000000
Fine Aggregate (component 7)(kg in a m^3 mixture)	1030.0	773.578883	80.175427	594.000000	730.950000	779.510000	824.000000	992.600000
Age (day)	1030.0	45.662136	63.169912	1.000000	7.000000	28.000000	56.000000	365.000000
Concrete compressive strength(MPa, megapascals)	1030.0	35.817836	16.705679	2.331808	23.707115	34.442774	46.136287	82.599225

Taking a peek at the relationship between the predicting variables and the response¶

In [5]:

for c in df.columns[:-1]:
    plt.figure(figsize=(8,5))
    plt.title("{} vs. \nConcrete Compressive Strength".format(c),fontsize=16)
    plt.scatter(x=df[c],y=df['Concrete compressive strength(MPa, megapascals) '],color='blue',edgecolor='k')
    plt.grid(True)
    plt.xlabel(c,fontsize=14)
    plt.ylabel('Concrete compressive strength\n(MPa, megapascals)',fontsize=14)
    plt.show()

Creating a copy with suitable column names for processing with `statsmodels.OLS()`¶

In [6]:

df1 = df.copy()

In [7]:

df1.columns=['Component'+str(i) for i in range(1,8)]+['Age']+['y']

In [8]:

df1.head()

Out[8]:

	Component1	Component2	Component4	Component5	Component6	Component7	Age	y
0	540.0	0.0	162.0	2.5	1040.0	676.0	28	79.986111
1	540.0	0.0	162.0	2.5	1055.0	676.0	28	61.887366
2	332.5	142.5	228.0	0.0	932.0	594.0	270	40.269535
3	332.5	142.5	228.0	0.0	932.0	594.0	365	41.052780
4	198.6	132.4	192.0	0.0	978.4	825.5	360	44.296075

Pairwise scatter plots¶

In [9]:

from seaborn import pairplot

In [10]:

pairplot(df1)

Out[10]:

<seaborn.axisgrid.PairGrid at 0x1e4cfeaf400>

Correlation matrix and heatmap to visually check for multicollinearity ¶

In [11]:

corr = df1[:-1].corr()

In [12]:

corr

Out[12]:

	Component1	Component2	Component3	Component4	Component5	Component6	Component7	Age	y
Component1	1.000000	-0.275153	-0.397439	-0.081408	0.092856	-0.109725	-0.222755	0.081899	0.497813
Component2	-0.275153	1.000000	-0.323720	0.107065	0.043261	-0.283862	-0.281564	-0.044166	0.134895
Component3	-0.397439	-0.323720	1.000000	-0.257488	0.377248	-0.009475	0.079138	-0.154284	-0.105688
Component4	-0.081408	0.107065	-0.257488	1.000000	-0.658118	-0.181344	-0.450683	0.277965	-0.289554
Component5	0.092856	0.043261	0.377248	-0.658118	1.000000	-0.266031	0.222580	-0.192630	0.366218
Component6	-0.109725	-0.283862	-0.009475	-0.181344	-0.266031	1.000000	-0.178881	-0.003398	-0.165365
Component7	-0.222755	-0.281564	0.079138	-0.450683	0.222580	-0.178881	1.000000	-0.156143	-0.167284
Age	0.081899	-0.044166	-0.154284	0.277965	-0.192630	-0.003398	-0.156143	1.000000	0.328841
y	0.497813	0.134895	-0.105688	-0.289554	0.366218	-0.165365	-0.167284	0.328841	1.000000

In [13]:

from statsmodels.graphics.correlation import plot_corr

In [14]:

fig=plot_corr(corr,xnames=corr.columns)

Creating a formula string for using in the `statsmodels.OLS()`¶

In [15]:

formula_str = df1.columns[-1]+' ~ '+'+'.join(df1.columns[:-1])

In [16]:

formula_str

Out[16]:

'y ~ Component1+Component2+Component3+Component4+Component5+Component6+Component7+Age'

Construct and fit the model. Print summary of the fitted model¶

In [17]:

model=sm.ols(formula=formula_str, data=df1)

In [18]:

fitted = model.fit()

In [19]:

print(fitted.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.615
Model:                            OLS   Adj. R-squared:                  0.612
Method:                 Least Squares   F-statistic:                     204.3
Date:                Mon, 03 Jun 2019   Prob (F-statistic):          6.76e-206
Time:                        22:53:37   Log-Likelihood:                -3869.0
No. Observations:                1030   AIC:                             7756.
Df Residuals:                    1021   BIC:                             7800.
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -23.1638     26.588     -0.871      0.384     -75.338      29.010
Component1     0.1198      0.008     14.110      0.000       0.103       0.136
Component2     0.1038      0.010     10.245      0.000       0.084       0.124
Component3     0.0879      0.013      6.988      0.000       0.063       0.113
Component4    -0.1503      0.040     -3.741      0.000      -0.229      -0.071
Component5     0.2907      0.093      3.110      0.002       0.107       0.474
Component6     0.0180      0.009      1.919      0.055      -0.000       0.036
Component7     0.0202      0.011      1.883      0.060      -0.001       0.041
Age            0.1142      0.005     21.046      0.000       0.104       0.125
==============================================================================
Omnibus:                        5.379   Durbin-Watson:                   1.281
Prob(Omnibus):                  0.068   Jarque-Bera (JB):                5.305
Skew:                          -0.174   Prob(JB):                       0.0705
Kurtosis:                       3.045   Cond. No.                     1.06e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.06e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

A new Result dataframe: p-values and statistical significance of the features¶

In [20]:

df_result=pd.DataFrame()

In [21]:

df_result['pvalues']=fitted.pvalues[1:]

In [22]:

df_result['Features']=df.columns[:-1]

In [23]:

df_result.set_index('Features',inplace=True)

In [24]:

def yes_no(b):
    if b:
        return 'Yes'
    else:
        return 'No'

In [25]:

df_result['Statistically significant?']= df_result['pvalues'].apply(yes_no)

In [26]:

df_result

Out[26]:

	pvalues	Statistically significant?
Features
Cement (component 1)(kg in a m^3 mixture)	1.962842e-41	Yes
Blast Furnace Slag (component 2)(kg in a m^3 mixture)	1.633083e-23	Yes
Fly Ash (component 3)(kg in a m^3 mixture)	5.029952e-12	Yes
Water (component 4)(kg in a m^3 mixture)	1.937337e-04	Yes
Superplasticizer (component 5)(kg in a m^3 mixture)	1.920901e-03	Yes
Coarse Aggregate (component 6)(kg in a m^3 mixture)	5.522656e-02	Yes
Fine Aggregate (component 7)(kg in a m^3 mixture)	5.996805e-02	Yes
Age (day)	5.841045e-82	Yes

All the predicting variables are statisticall significant with the threshold of p-value <0.01 ¶

Residuals vs. predicting variables plots¶

In [30]:

for c in df.columns[:-1]:
    plt.figure(figsize=(8,5))
    plt.title("{} vs. \nModel residuals".format(c),fontsize=16)
    plt.scatter(x=df[c],y=fitted.resid,color='blue',edgecolor='k')
    plt.grid(True)
    xmin=min(df[c])
    xmax = max(df[c])
    plt.hlines(y=0,xmin=xmin*0.9,xmax=xmax*1.1,color='red',linestyle='--',lw=3)
    plt.xlabel(c,fontsize=14)
    plt.ylabel('Residuals',fontsize=14)
    plt.show()

Residual plots show some bit of clustering but overall the assumptions linearity and independence seem to hold because the distribution seem random around the 0 axis.¶

Fitted vs. residuals¶

In [32]:

plt.figure(figsize=(8,5))
p=plt.scatter(x=fitted.fittedvalues,y=fitted.resid,edgecolor='k')
xmin=min(fitted.fittedvalues)
xmax = max(fitted.fittedvalues)
plt.hlines(y=0,xmin=xmin*0.9,xmax=xmax*1.1,color='red',linestyle='--',lw=3)
plt.xlabel("Fitted values",fontsize=15)
plt.ylabel("Residuals",fontsize=15)
plt.title("Fitted vs. residuals plot",fontsize=18)
plt.grid(True)
plt.show()

The fitted vs. residuals plot shows violation of the constant variance assumption - Heteroscedasticity.¶

Histogram of normalized residuals¶

In [34]:

plt.figure(figsize=(8,5))
plt.hist(fitted.resid_pearson,bins=20,edgecolor='k')
plt.ylabel('Count',fontsize=15)
plt.xlabel('Normalized residuals',fontsize=15)
plt.title("Histogram of normalized residuals",fontsize=18)
plt.show()

Q-Q plot of the residuals¶

In [35]:

from statsmodels.graphics.gofplots import qqplot

In [36]:

plt.figure(figsize=(8,5))
fig=qqplot(fitted.resid_pearson,line='45',fit='True')
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.xlabel("Theoretical quantiles",fontsize=15)
plt.ylabel("Sample quantiles",fontsize=15)
plt.title("Q-Q plot of normalized residuals",fontsize=18)
plt.grid(True)
plt.show()

<Figure size 576x360 with 0 Axes>

The Q-Q plot (and the histogram above) shows that the normality assumption is satisfied pretty good¶

Normality (Shapiro-Wilk) test of the residuals¶

In [37]:

from scipy.stats import shapiro

In [38]:

_,p=shapiro(fitted.resid)

In [39]:

if p<0.01:
    print("The residuals seem to come from Gaussian process")
else:
    print("The normality assumption may not hold")

The residuals seem to come from Gaussian process

Cook's distance (checking for outliers in residuals)¶

In [40]:

from statsmodels.stats.outliers_influence import OLSInfluence as influence

In [41]:

inf=influence(fitted)

In [45]:

(c, p) = inf.cooks_distance
plt.figure(figsize=(8,5))
plt.title("Cook's distance plot for the residuals",fontsize=16)
plt.stem(np.arange(len(c)), c, markerfmt=",")
plt.grid(True)
plt.show()

There are few data points with residuals being possible outliers¶

Variance inflation factor¶

In [43]:

from statsmodels.stats.outliers_influence import variance_inflation_factor as vif

In [44]:

for i in range(len(df1.columns[:-1])):
    v=vif(np.matrix(df1[:-1]),i)
    print("Variance inflation factor for {}: {}".format(df.columns[i],round(v,2)))

Variance inflation factor for Cement (component 1)(kg in a m^3 mixture): 26.23
Variance inflation factor for Blast Furnace Slag (component 2)(kg in a m^3 mixture): 4.44
Variance inflation factor for Fly Ash (component 3)(kg in a m^3 mixture): 4.56
Variance inflation factor for Water  (component 4)(kg in a m^3 mixture): 92.59
Variance inflation factor for Superplasticizer (component 5)(kg in a m^3 mixture): 5.52
Variance inflation factor for Coarse Aggregate  (component 6)(kg in a m^3 mixture): 85.97
Variance inflation factor for Fine Aggregate (component 7)(kg in a m^3 mixture): 73.46
Variance inflation factor for Age (day): 2.43