Notebook

Model fitting¶

$\newcommand{\eg}{{\it e.g.}} \newcommand{\ie}{{\it i.e.}} \newcommand{\argmin}{\operatornamewithlimits{argmin}} \newcommand{\mc}{\mathcal} \newcommand{\mb}{\mathbb} \newcommand{\mf}{\mathbf} \newcommand{\minimize}{{\text{minimize}}} \newcommand{\diag}{{\text{diag}}} \newcommand{\cond}{{\text{cond}}} \newcommand{\rank}{{\text{rank }}} \newcommand{\range}{{\mathcal{R}}} \newcommand{\null}{{\mathcal{N}}} \newcommand{\tr}{{\text{trace}}} \newcommand{\dom}{{\text{dom}}} \newcommand{\dist}{{\text{dist}}} \newcommand{\R}{\mathbf{R}} \newcommand{\SM}{\mathbf{S}} \newcommand{\ball}{\mathcal{B}} \newcommand{\bmat}[1]{\begin{bmatrix}#1\end{bmatrix}} \newcommand{\loss}{\ell} \newcommand{\eloss}{\mc{L}} \newcommand{\abs}[1]{| #1 |} \newcommand{\norm}[1]{\| #1 \|} \newcommand{\tp}{T}$

ASE2030: Linear algebra and statistics, Inha University.

Jong-Han Kim (jonghank@inha.ac.kr)

Polynomial fit¶

The following downloads a csv file from a webpage. It contains raw data, $x$ , and raw data, $y$ , for which we'd like to build a prediction model that describes the functional relations from $x$ to $y$ .

In [1]:

import numpy as np
import matplotlib.pyplot as plt

data = np.loadtxt('https://jonghank.github.io/ee370/files/fit_data.csv', \
                  delimiter=',')
x, y = data[:,0], data[:,1]

plt.figure(figsize=(6,6), dpi=100)
plt.plot(x, y, 'o', alpha=0.5)
plt.grid()
plt.axis('square')
plt.xlim(0, 1)
plt.ylim(0, 1)
plt.xlabel(r'$x$')
plt.ylabel(r'$y$')
plt.title('Raw data')
plt.show()

We will fit the data to a 5th order polynomial model,

$\begin{align*} \hat{y} &= \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3 + \theta_4 x^4 + \theta_5 x^5 \\ &= \bmat{1 & x & x^2 & x^3 & x^4 & x^5 }\bmat{\theta_0 \\ \theta_1 \\ \theta_2 \\ \theta_3 \\ \theta_4 \\ \theta_5} \end{align*}$

so we will find $\theta = \bmat{\theta_0 & \cdots & \theta_5}^T$ such that $\hat{y}$ is overall close enough to $y$ for all given data points. In other words we'd like to find $\theta$ that minimizes

$\begin{align*} \eloss\left(\theta\right) &= \sum_{i=1}^{n} \left(\hat{y}^{(i)}-y^{(i)}\right)^2 \\ &= \sum_{i=1}^{n} \left( \bmat{1 & x^{(i)} & (x^{(i)})^2 & (x^{(i)})^3 & (x^{(i)})^4 & (x^{(i)})^5} \bmat{\theta_0 \\ \theta_1 \\ \theta_2 \\ \theta_3 \\ \theta_4 \\ \theta_5} - y^{(i)} \right)^2 \\ &= \left\| \bmat{1 & x^{(1)} & (x^{(1)})^2 & (x^{(1)})^3 & (x^{(1)})^4 & (x^{(1)})^5 \\ 1 & x^{(2)} & (x^{(2)})^2 & (x^{(2)})^3 & (x^{(2)})^4 & (x^{(2)})^5 \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x^{(n)} & (x^{(n)})^2 & (x^{(n)})^3 & (x^{(n)})^4 & (x^{(n)})^5 } \bmat{\theta_0 \\ \theta_1 \\ \theta_2 \\ \theta_3 \\ \theta_4 \\ \theta_5} - \bmat{y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(n)}} \right\|_2^2 \\ &= \left\| X \theta - y \right\|_2^2 \end{align*}$

where

$X = \bmat{1 & x^{(1)} & (x^{(1)})^2 & (x^{(1)})^3 & (x^{(1)})^4 & (x^{(1)})^5 \\ 1 & x^{(2)} & (x^{(2)})^2 & (x^{(2)})^3 & (x^{(2)})^4 & (x^{(2)})^5 \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x^{(n)} & (x^{(n)})^2 & (x^{(n)})^3 & (x^{(n)})^4 & (x^{(n)})^5 } , \qquad y = \bmat{y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(n)}}$

In [2]:

n = len(x)
d = 6

X = np.zeros((n,d))
for i in range(d):
  X[:,i] = x**i


theta_opt = np.linalg.lstsq(X, y, rcond=None)[0]
residual = y - X@theta_opt

print(f'Optimal theta: {theta_opt}')
print(f'RMSE: {np.linalg.norm(residual)/np.sqrt(n)}')

vp = np.linspace(0, 1, 100)

X_vp = np.zeros((len(vp),d))
for i in range(d):
  X_vp[:,i] = vp**i;

plt.figure(figsize=(6,6), dpi=100)
plt.plot(x, y, 'o', alpha=0.5, label='Raw data')
plt.plot(vp, np.dot(X_vp, theta_opt), label='Predictor')
plt.grid()
plt.axis('square')
plt.xlim(0, 1)
plt.ylim(0, 1)
plt.xlabel(r'$x$')
plt.ylabel(r'$y$')
plt.title("Polynomial predictor")
plt.legend()
plt.show()

Optimal theta: [-1.08775601e-01  1.51579304e+01 -9.49190641e+01  2.39806312e+02
 -2.64399727e+02  1.05545151e+02]
RMSE: 0.05310465265353179

Diabete progression¶

The following loads a list of medical records collected from 442 patients, where each row corresponds to a patient's medical record, with the first 10 columns standing for the 10 explanatory variables (age, bmi, bp, ... ), and the last column being the outcome $y$ , the measure of diabetes pregression over a year.

In [3]:

import pandas as pd

df = pd.read_csv('https://web.stanford.edu/~hastie/Papers/LARS/diabetes.data', delimiter='\t')

df

Out[3]:

	AGE	SEX	BMI	BP	S1	S2	S3	S4	S5	S6	Y
0	59	2	32.1	101.00	157	93.2	38.0	4.00	4.8598	87	151
1	48	1	21.6	87.00	183	103.2	70.0	3.00	3.8918	69	75
2	72	2	30.5	93.00	156	93.6	41.0	4.00	4.6728	85	141
3	24	1	25.3	84.00	198	131.4	40.0	5.00	4.8903	89	206
4	50	1	23.0	101.00	192	125.4	52.0	4.00	4.2905	80	135
...	...	...	...	...	...	...	...	...	...	...	...
437	60	2	28.2	112.00	185	113.8	42.0	4.00	4.9836	93	178
438	47	2	24.9	75.00	225	166.0	42.0	5.00	4.4427	102	104
439	60	2	24.9	99.67	162	106.6	43.0	3.77	4.1271	95	132
440	36	1	30.0	95.00	201	125.2	42.0	4.79	5.1299	85	220
441	36	1	19.6	71.00	250	133.2	97.0	3.00	4.5951	92	57

442 rows × 11 columns

We'd like to fit a linear model that predicts the measure of diabete progression over a year, from those 10 features (plus a constant feature).

Diabete progression, $y$ :

$\begin{align*} &y\approx \theta_0 + \theta_1\text{Age} + \theta_2\text{Sex} + \theta_3\text{BMI} + \theta_4\text{BP} + \theta_5\text{S}_1 + \theta_6\text{S}_2 + \theta_7\text{S}_3 + \theta_8\text{S}_4 + \theta_9\text{S}_5 + \theta_{10}\text{S}_6 \end{align*}$

In [4]:

n, d = df.shape
X = np.hstack((np.ones((n,1)), df.values[:,:-1]))
y = df.values[:,-1]

theta_opt = np.linalg.lstsq(X, y, rcond=None)[0]

MSE = np.linalg.norm(X.dot(theta_opt)-y)**2/n

print(f'MSE: {MSE}')

MSE: 2859.6963475867506

In [5]:

theta_opt

Out[5]:

array([-3.34567139e+02, -3.63612242e-02, -2.28596481e+01,  5.60296209e+00,
        1.11680799e+00, -1.08999633e+00,  7.46450456e-01,  3.72004715e-01,
        6.53383194e+00,  6.84831250e+01,  2.80116989e-01])

The following shows the prediction performance where the scatter points on the black line ( $\hat{y}=y$ ) imply perfect prediction.

In [6]:

plt.figure(figsize=(6,6), dpi=100)
plt.plot(y, y, 'k')
plt.plot(y, X.dot(theta_opt), 'o', alpha=0.5)
plt.xlabel('y')
plt.ylabel(r'$\hat{y}$')
plt.axis('square')
plt.grid()

Now suppose we got a new medical record ( $X_\text{new}$ ) from a new patient. The predicted diabete progress for the patient over the next year can be easily assed by the following, $\hat{y}_\text{new}=X_\text{new}\theta^*$ . So you can get a rough idea on how much for this patient the diabetes will progress.

In [7]:

# note that each feature represents
# u1:  age
# u2:  sex
# u3:  bmi         body mass index
# u4:  map	       mean arterial pressure				
# u5:  s1 (tc) :   total cholesterol
# u6:  s2 (ldl):   low density lipoprotein
# u7:  s3 (hdl):   high density lipoprotein
# u8:  s4 (tch):
# u9:  s5 (ltg):
# u10: s6 (glu):
# features:      age  sex      bmi   map      tc      ldl     hdl   tch       ltg    glu
X_JHK = np.array([41,   1,    18.3,   90,    171,    80.0,   74.9,    2,     4.75,  90.0])
X_JHK = np.hstack((1, X_JHK))

y_JHK = X_JHK.dot(theta_opt)
print (y_JHK)
plt.figure(figsize=(6,6), dpi=100)
plt.plot(y, y, 'k')
plt.plot(y, X.dot(theta_opt), 'o', alpha=0.5)
plt.plot(y_JHK, y_JHK, 'ro', label='JHK')
plt.xlabel('y')
plt.ylabel(r'$\hat{y}$')
plt.legend()
plt.axis('square')
plt.grid()

108.89218184854774

In [7]: