The purpose of variable transformation is to make extremely right/left skewed data appear normally distributed. We first check the distribution of the numeric variables we have selected (cadmium, zinc and om).
library(leaps)
library(sp)
data(meuse)
meuse <- na.omit(meuse)
var_selec <- c("cadmium","zinc","om","ffreq","lime","lead")
## creating the sub
data_selec <- meuse[var_selec]
data_selec$ffreq <- as.numeric(data_selec$ffreq)
## replicating previous model
model_1 <- lm(lead~. ,data = data_selec)
summary(model_1)
##
## Call:
## lm(formula = lead ~ ., data = data_selec)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77.514 -12.536 -0.082 13.660 59.689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.5617 7.0328 5.057 1.26e-06 ***
## cadmium -13.0918 1.5459 -8.469 2.43e-14 ***
## zinc 0.4300 0.0128 33.606 < 2e-16 ***
## om -2.3889 0.8193 -2.916 0.004110 **
## ffreq -10.5112 2.9432 -3.571 0.000481 ***
## lime1 -25.0169 5.8700 -4.262 3.62e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.72 on 146 degrees of freedom
## Multiple R-squared: 0.9597, Adjusted R-squared: 0.9584
## F-statistic: 696.1 on 5 and 146 DF, p-value: < 2.2e-16
We check distribution of error and the distribution of the selected variables to determine if we need to do variable transformation
qqnorm(residuals(model_1),
ylab="Sample Quantiles for residuals")
qqline(residuals(model_1),
col="red")
We can see that the residuals are not normally distributed, thus variable transformation is necessary.
for (i in 1: 3) {
d <- density(data_selec[,i])
plot(d)
}