The demo illustrates:
See splinesP.ipynb for some of these examples in python. See Elements of Statistical Learning Section 5.2 for a discussion of splines.
## load some data
library(L1pack)
library(splines)
library(data.table)
set.seed(1234)
## read data, order times, fold lc, put in x,y,sig variables
dat <- as.data.frame(fread("http://longjp.github.io/statcomp/lectures/star.dat"))
x <- dat[,1]
y <- dat[,2]
sig <- dat[,3]
Our goal is to model $y$ as a function of $x$. Clearly this is a nonlinear relationship.
options(repr.plot.width=7,repr.plot.height=4)
par(mar=c(5,5,1,1))
plot(x,y,ylab="y",xlab="x")
segments(x,y + sig,x,y - sig,col='grey')
One option is to represent y as a linear function of $x$, $x^2$, $x^3$, . . . i.e. as a polynomial of x up to some order. This does not work well for any reasonable order because:
We demostrate this issue now
## polynomial basis
p <- 6
X <- vapply(0:(p-1),function(y) x^y,rep(0,length(x)))
coeffs <- matrix(lm.fit(X,y)$coefficients,ncol=1)
## for p > about 15, some coefficients are NA, this is
## due to X being nearly column rank deficient which
## prevents lm.fit from finding a unique soln
print(sum(is.na(coeffs)))
coeffs[is.na(coeffs)] <- 0
preds <- X%*%coeffs
[1] 0
options(repr.plot.width=7,repr.plot.height=4)
par(mar=c(5,5,1,1))
plot(x,y,ylab="y",xlab="x")
segments(x,y + sig,x,y - sig,col='grey')
points(x,preds,col='red',type='l',lwd=2)
The $X$ matrix is often called the design. If the $p$ columns of $X$ do not span a $p$ dimensional subspace of $n$, then $X^TX$ will not be of full rank (i.e. not invertible) and there will not be a unique solution to the normal equations. In particular $X^TX$ has $0$ eigenvalues. A similar phenomenon happens with polynomial bases where the matrix is "almost" not invertible. Essentially the smallest eigenvalues of $X^TX$ are almost $0$.
a <- rnorm(100)
A <- cbind(1,a,a^2,a+a^2)
colnames(A) <- c("1","x","x^2","x^2 + x")
z <- A%*%matrix(1:4,ncol=1) + rnorm(100)
par(mar=c(5,5,1,1))
plot(a,z)
lmf <- lm(z~A-1)
summary(lmf)
Call: lm(formula = z ~ A - 1) Residuals: Min 1Q Median 3Q Max -1.86127 -0.67476 -0.08071 0.63893 1.77840 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) A1 0.91350 0.11050 8.267 7.25e-13 *** Ax 6.02400 0.09599 62.757 < 2e-16 *** Ax^2 6.98725 0.06380 109.521 < 2e-16 *** Ax^2 + x NA NA NA NA --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9045 on 97 degrees of freedom Multiple R-squared: 0.9969, Adjusted R-squared: 0.9968 F-statistic: 1.029e+04 on 3 and 97 DF, p-value: < 2.2e-16
## we can determine that some columns are functions of other columns
alias(lmf)
Model : z ~ A - 1 Complete : A1 Ax Ax^2 Ax^2 + x 0 1 1
We fit local polynomials using our own basis functions. In practice it is better to use bsplines (see below) because they are optimized for speed and numberical stability.
range(x)
## creates a design matrix for local constant fitting
p <- 20 ## number of segments
breaks <- (1:p)/p
## idea: fit low degree polynomial regression separately to segments
options(repr.plot.width=7,repr.plot.height=4)
par(mar=c(5,5,1,1))
plot(x,y,ylab="y",xlab="x")
abline(v=breaks)
which_segment <- vapply(x,function(y) sum(y > breaks),0) + 1
basis <- diag(p)
X <- t(vapply(which_segment,function(y){basis[y,]},rep(0,p)))
dim(X)
## here linear model just takes average of values
## in segments defined by breaks
fit <- lm(y~X-1)
options(repr.plot.width=7,repr.plot.height=4)
par(mar=c(5,5,1,1))
plot(x,y,ylab="y",xlab="x")
points(x,fit$fitted.values,col='red',type='l',lwd=2)
abline(v=breaks)
X <- t(vapply(which_segment,function(y){basis[y,]},rep(0,p)))
X2 <- X*x
head(X)
head(X2)
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0.01237213 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0.01504452 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0.01676609 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0.01895597 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0.01997927 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0.02133762 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
X <- cbind(X,X2)
## fit piecewise linear
fit <- lm(y~X-1)
options(repr.plot.width=7,repr.plot.height=4)
par(mar=c(5,5,1,1))
plot(x,y,ylab="y",xlab="x")
points(x,fit$fitted.values,col='red',type='l',lwd=2)
Splines fit local polynomials (usually degree 3) and ensure that at the breaks between polynomials, known as knots, the function remains smooth (usually twice continuously differentiable).
The popular splines basis, known as bSplines, is constructed in such a way as to ensure all eigenvalues of $X^TX$ remain reasonably large and that fitting is fast. The bSpline basis construction is more complex than other methods however.
## regression splines
N <- 30
X <- bs(x,knots=(1:N)/(N+1),intercept=TRUE)
coeffs <- matrix(lm.fit(X,y)$coefficients,ncol=1)
preds <- X%*%coeffs
dim(X)
## matrix is well conditioned
e <- eigen(t(X)%*%X)
e$values
## should look much better than polynomial
par(mar=c(5,5,1,1))
plot(x,y,ylab="y",xlab="x")
segments(x,y + sig,x,y - sig,col='grey')
points(x,preds,col='red',type='l',lwd=2)
points(x,preds,col='blue',type='l',lwd=3)
The least squares fit above is not robust to outliers. Least absolute deviations is robust to outliers. We can easily fit the spline basis expansion with a L1 solver. This is an example of robust non-linear regression.
## add a large amount of error to p fraction of the observations
p <- 0.3
errs <- rbinom(length(x),prob=p,size=1)*rt(length(x),df=2)*0.3
ye <- y + errs
par(mar=c(5,5,1,1))
plot(x,ye)
## regression splines
N <- 30
X <- bs(x,knots=(1:N)/(N+1),intercept=TRUE)
coeffsLS <- matrix(lm.fit(X,ye)$coefficients,ncol=1)
predsLS <- X%*%coeffsLS
coeffsLAD <- l1fit(X,ye, intercept = FALSE)$coefficients
predsLAD <- X%*%coeffsLAD
par(mar=c(5,5,1,1))
plot(x,ye)
points(x,predsLS,col='blue',type='l',lwd=3)
points(x,predsLAD,col='orange',type='l',lwd=3)
legend("topleft",c("LS","LAD"),col=c("blue","orange"),lwd=2)