#!/usr/bin/env python # coding: utf-8 # --- # title: Fitting a Simple Additive Model in Python # tags: Statistics, Python # --- # Recently, I have been learning about [(generalized) additive models](https://en.wikipedia.org/wiki/Generalized_additive_model) by working through [Simon Wood's](http://people.bath.ac.uk/sw283/) [book](http://www.amazon.com/Generalized-Additive-Models-Introduction-Statistical/dp/1584884746). I have previously posted an IPython [notebook](http://nbviewer.ipython.org/gist/AustinRochford/bb19863446d46dbcdc83) implementing the models from Chapter 3 of the book. In this post, I will show how to fit a simple additive model in Python in a bit more detail. # # We will use a [LIDAR](https://en.wikipedia.org/wiki/Lidar) dataset that is available on the [website](http://www.stat.cmu.edu/~larry/all-of-nonpar/data.html) for Larry Wasserman's book [_All of Nonparametric Statistics_](http://www.stat.cmu.edu/~larry/all-of-nonpar/). # In[1]: get_ipython().run_line_magic('matplotlib', 'inline') # In[2]: from matplotlib import pyplot as plt import numpy as np import pandas as pd import patsy import scipy as sp import seaborn as sns from statsmodels import api as sm # In[3]: df = pd.read_csv('http://www.stat.cmu.edu/~larry/all-of-nonpar/=data/lidar.dat', sep=' *', engine='python') df['std_range'] = (df.range - df.range.min()) / df.range.ptp() n = df.shape[0] # In[4]: df.head() # This data set is well-suited to additive modeling because the relationship between the variables is highly non-linear. # In[5]: fig, ax = plt.subplots(figsize=(8, 6)) blue = sns.color_palette()[0] ax.scatter(df.std_range, df.logratio, c=blue, alpha=0.5); ax.set_xlim(-0.01, 1.01); ax.set_xlabel('Scaled range'); ax.set_ylabel('Log ratio'); # An additive model represents the relationship between explanatory variables $\mathbf{x}$ and a response variable $y$ as a sum of smooth functions of the explanatory variables # # $$y = \beta_0 + f_1(x_1) + f_2(x_2) + \cdots + f_k(x_k) + \varepsilon.$$ # # The smooth functions $f_i$ can be estimated using a variety of nonparametric techniques. Following Chapter 3 of Wood's book, we will fit our additive model using penalized regression splines. # # Since our LIDAR data set has only one explanatory variable, our additive model takes the form # # $$y = \beta_0 + f(x) + \varepsilon.$$ # # We fit this model by minimizing the penalized residual sum of squares # # $$PRSS = \sum_{i = 1}^n \left(y_i - \beta_0 - f(x_i)\right)^2 + \lambda \int_0^1 \left(f''(x)\right)^2\ dx.$$ # # The penalty term # # $$\int_0^1 \left(f''(x)\right)^2\ dx$$ # # causes us to only choose less smooth functions if they fit the data much better. The smoothing parameter $\lambda$ controls the rate at which decreased smoothness is traded for a better fit. # # In the penalized regression splines model, we must also choose basis functions $\varphi_1, \varphi_2, \ldots, \varphi_k$, which we then use to express the smooth function $f$ as # # $$f(x) = \beta_1 \varphi_1(x) + \beta_2 \varphi_2(x) + \cdots + \beta_k \varphi_k(x).$$ # # With these basis functions in place, if we define $\mathbf{x}_i = [1\ x_i\ \varphi_2(x_i)\ \cdots \varphi_k(x_i)]$ and # # $$\mathbf{X} = \begin{bmatrix} # \mathbf{x}_1 \\ # \vdots \\ # \mathbf{x}_n # \end{bmatrix},$$ # # the model $y_i = \beta_0 + f(x_i) + \varepsilon$ can be rewritten as $\mathbf{y} = \mathbf{X} \beta + \varepsilon$. It is tedious but not difficult to show that when $f$ is expressed as a linear combination of basis functions, there is always a [positive semidefinite](https://en.wikipedia.org/wiki/Positive-definite_matrix#Negative-definite.2C_semidefinite_and_indefinite_matrices) matrix $\mathbf{S}$ such that # # $$\int_0^1 \left(f''(x)\right)^2\ dx = \beta^{\intercal} \mathbf{S} \beta.$$ # # Since $\mathbf{S}$ is positive semidefinite, it has a square root $\mathbf{B}$ such that $\mathbf{B}^{\intercal} \mathbf{B} = \mathbf{S}$. The penalized residual sum of squares objective function can then be written as # # $$ # \begin{align*} # PRSS # & = (\mathbf{y} - \mathbf{X} \beta)^{\intercal} (\mathbf{y} - \mathbf{X} \beta) + \lambda \beta^{\intercal} \mathbf{B}^{\intercal} \mathbf{B} \beta # = (\mathbf{\tilde{y}} - \mathbf{\tilde{X}} \beta)^{\intercal} (\mathbf{\tilde{y}} - \mathbf{\tilde{X}} \beta), # \end{align*} # $$ # # where # # $$\mathbf{\tilde{y}} = \begin{bmatrix} # \mathbf{y} \\ # \mathbf{0}_{k + 1} # \end{bmatrix} # $$ # # and # # $$\mathbf{\tilde{X}} = \begin{bmatrix} # \mathbf{X} \\ # \sqrt{\lambda}\ \mathbf{B} # \end{bmatrix}. # $$ # # Therefore the augmented data matrices $\mathbf{\tilde{y}}$ and $\mathbf{\tilde{X}}$ allow us to express the penalized residual sum of squares for the original model as the residual sum of squares of the OLS model $\mathbf{\tilde{y}} = \mathbf{\tilde{X}} \beta + \tilde{\varepsilon}$. This augmented model allows us to use widely available machinery for fitting OLS models to fit the additive model as well. # # The last step before we can fit the model in Python is to choose the basis functions $\varphi_i$. Again, following Chapter 3 of Wood's book, we let # # $$R(x, z) = \frac{1}{4} \left(\left(z - \frac{1}{2}\right)^2 - \frac{1}{12}\right) \left(\left(x - \frac{1}{2}\right)^2 - \frac{1}{12}\right) - \frac{1}{24} \left(\left(\left|x - z\right| - \frac{1}{2}\right)^4 - \frac{1}{2} \left(\left|x - z\right| - \frac{1}{2}\right)^2 + \frac{7}{240}\right).$$ # In[6]: def R(x, z): return ((z - 0.5)**2 - 1 / 12) * ((x - 0.5)**2 - 1 / 12) / 4 - ((np.abs(x - z) - 0.5)**4 - 0.5 * (np.abs(x - z) - 0.5)**2 + 7 / 240) / 24 R = np.frompyfunc(R, 2, 1) def R_(x): return R.outer(x, knots).astype(np.float64) # Though this function is quite complicated, we will see that it has some very conveient properties. We must also choose a set of knots $z_i$ in $[0, 1]$, $i = 1, 2, \ldots, q$. # In[7]: q = 20 knots = df.std_range.quantile(np.linspace(0, 1, q)) # Here we have used twenty knots situatied at percentiles of `std_range`. # # Now we define our basis functions as $\varphi_1(x) = x$, $\varphi_{i}(x) = R(x, z_{i - 1})$ for $i = 2, 3, \ldots q + 1$. # # Our model matrices $\mathbf{y}$ and $\mathbf{X}$ are therefore # In[8]: y, X = patsy.dmatrices('logratio ~ std_range + R_(std_range)', data=df) # Note that, by default, `patsy` always includes an intercept column in `X`. # # The advantage of the function $R$ is the the penalty matrix $\mathbf{S}$ has the form # # $$S = \begin{bmatrix} # \mathbf{0}_{2 \times 2} & \mathbf{0}_{2 \times q} \\ # \mathbf{0}_{q \times 2} & \mathbf{\tilde{S}} # \end{bmatrix},$$ # # where $\mathbf{\tilde{S}}_{ij} = R(z_i, z_j)$. We now calculate $\mathbf{S}$ and its square root $\mathbf{B}$. # In[9]: S = np.zeros((q + 2, q + 2)) S[2:, 2:] = R_(knots) # In[10]: B = np.zeros_like(S) B[2:, 2:] = np.real_if_close(sp.linalg.sqrtm(S[2:, 2:]), tol=10**8) # We now have all the ingredients necessary to fit some additive models to the LIDAR data set. # In[11]: def fit(y, X, B, lambda_=1.0): # build the augmented matrices y_ = np.vstack((y, np.zeros((q + 2, 1)))) X_ = np.vstack((X, np.sqrt(lambda_) * B)) return sm.OLS(y_, X_).fit() # We have not yet discussed how to choose the smoothing parameter $\lambda$, so we will fit several models with different values of $\lambda$ to see how it affects the results. # In[12]: fig, axes = plt.subplots(nrows=3, ncols=2, sharex=True, sharey=True, squeeze=True, figsize=(12, 13.5)) plot_lambdas = np.array([1.0, 0.1, 0.01, 0.001, 0.0001, 0.00001]) plot_x = np.linspace(0, 1, 100) plot_X = patsy.dmatrix('std_range + R_(std_range)', {'std_range': plot_x}) for lambda_, ax in zip(plot_lambdas, np.ravel(axes)): ax.scatter(df.std_range, df.logratio, c=blue, alpha=0.5); results = fit(y, X, B, lambda_=lambda_) ax.plot(plot_x, results.predict(plot_X)); ax.set_xlim(-0.01, 1.01); ax.set_xlabel('Scaled range'); ax.set_ylabel('Log ratio'); ax.set_title(r'$\lambda = {}$'.format(lambda_)); fig.tight_layout(); # We can see that as $\lambda$ decreases, the model becomes less smooth. Visually, it seems that the optimal value of $\lambda$ lies somewhere between $10^{-2}$ and $10^{-4}$. We need a rigorous way to choose the optimal value of $\lambda$. As is often the case in such situations, we turn to [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)). Specifically, we will use [generalized cross-validation](http://www.stat.wisc.edu/~wahba/ftp1/oldie/golub.heath.wahba.pdf) to choose the optimal value of $\lambda$. The GCV score is given by # # $$\operatorname{GCV}(\lambda) = \frac{n \sum_{i = 1}^n \left(y_i - \hat{y}_i\right)^2}{\left(n - \operatorname{tr} \mathbf{H}\right)^2}.$$ # # Here, $\hat{y}_i$ is the $i$-th predicted value, and $\mathbf{H}$ is upper left $n \times n$ submatrix of the the [influence matrix](https://en.wikipedia.org/wiki/Hat_matrix) for the OLS model $\mathbf{\tilde{y}} = \mathbf{\tilde{X}} + \varepsilon$. # In[13]: def gcv_score(results): X = results.model.exog[:-(q + 2), :] n = X.shape[0] y = results.model.endog[:n] y_hat = results.predict(X) hat_matrix_trace = results.get_influence().hat_matrix_diag[:n].sum() return n * np.power(y - y_hat, 2).sum() / np.power(n - hat_matrix_trace, 2) # Now we evaluate the GCV score of the model over a range of $\lambda$ values. # In[14]: lambdas = np.logspace(0, 50, 100, base=1.5) * 1e-8 # In[15]: gcv_scores = np.array([gcv_score(fit(y, X, B, lambda_=lambda_)) for lambda_ in lambdas]) # In[16]: fig, ax = plt.subplots(figsize=(8, 6)) ax.plot(lambdas, gcv_scores); ax.set_xscale('log'); ax.set_xlabel(r'$\lambda$'); ax.set_ylabel(r'$\operatorname{GCV}(\lambda)$'); # The GCV-optimal value of $\lambda$ is therefore # In[17]: lambda_best = lambdas[gcv_scores.argmin()] lambda_best # This value of $\lambda$ produces a visually reasonable fit. # In[18]: fig, ax = plt.subplots(figsize=(8, 6)) ax.scatter(df.std_range, df.logratio, c=blue, alpha=0.5); results = fit(y, X, B, lambda_=lambda_best) ax.plot(plot_x, results.predict(plot_X), label=r'$\lambda = {}$'.format(lambda_best)); ax.set_xlim(-0.01, 1.01); ax.legend(); # We have only scratched the surface of additive models, fitting a simple model of one variable with penalized regression splines. In general, additive models are quite powerful and flexible, while remaining quite interpretable. # # This post is available as an [IPython](http://ipython.org/) notebook [here](http://nbviewer.ipython.org/gist/AustinRochford/c9c6862c6f7f7a28870a).