The following downloads a csv
file from a webpage. It contains raw data, x, and raw data, y, for which we'd like to build a prediction model that describes the functional relations from x to y.
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt('https://jonghank.github.io/ee370/files/fit_data.csv', \
delimiter=',')
x, y = data[:,0], data[:,1]
plt.figure(figsize=(6,6), dpi=100)
plt.plot(x, y, 'o', alpha=0.5)
plt.grid()
plt.axis('square')
plt.xlim(0, 1)
plt.ylim(0, 1)
plt.xlabel(r'$x$')
plt.ylabel(r'$y$')
plt.title('Raw data')
plt.show()
We will fit the data to a 5th order polynomial model,
ˆy=θ0+θ1x+θ2x2+θ3x3+θ4x4+θ5x5=[1xx2x3x4x5][θ0θ1θ2θ3θ4θ5]so we will find θ=[θ0⋯θ5]T such that ˆy is overall close enough to y for all given data points. In other words we'd like to find θ that minimizes
L(θ)=n∑i=1(ˆy(i)−y(i))2=n∑i=1([1x(i)(x(i))2(x(i))3(x(i))4(x(i))5][θ0θ1θ2θ3θ4θ5]−y(i))2=‖[1x(1)(x(1))2(x(1))3(x(1))4(x(1))51x(2)(x(2))2(x(2))3(x(2))4(x(2))5⋮⋮⋮⋮⋮⋮1x(n)(x(n))2(x(n))3(x(n))4(x(n))5][θ0θ1θ2θ3θ4θ5]−[y(1)y(2)⋮y(n)]‖22=‖Xθ−y‖22where
X=[1x(1)(x(1))2(x(1))3(x(1))4(x(1))51x(2)(x(2))2(x(2))3(x(2))4(x(2))5⋮⋮⋮⋮⋮⋮1x(n)(x(n))2(x(n))3(x(n))4(x(n))5],y=[y(1)y(2)⋮y(n)]n = len(x)
d = 6
X = np.zeros((n,d))
for i in range(d):
X[:,i] = x**i
theta_opt = np.linalg.lstsq(X, y, rcond=None)[0]
residual = y - X@theta_opt
print(f'Optimal theta: {theta_opt}')
print(f'RMSE: {np.linalg.norm(residual)/np.sqrt(n)}')
vp = np.linspace(0, 1, 100)
X_vp = np.zeros((len(vp),d))
for i in range(d):
X_vp[:,i] = vp**i;
plt.figure(figsize=(6,6), dpi=100)
plt.plot(x, y, 'o', alpha=0.5, label='Raw data')
plt.plot(vp, np.dot(X_vp, theta_opt), label='Predictor')
plt.grid()
plt.axis('square')
plt.xlim(0, 1)
plt.ylim(0, 1)
plt.xlabel(r'$x$')
plt.ylabel(r'$y$')
plt.title("Polynomial predictor")
plt.legend()
plt.show()
Optimal theta: [-1.08775601e-01 1.51579304e+01 -9.49190641e+01 2.39806312e+02 -2.64399727e+02 1.05545151e+02] RMSE: 0.05310465265353179
The following loads a list of medical records collected from 442 patients, where each row corresponds to a patient's medical record, with the first 10 columns standing for the 10 explanatory variables (age, bmi, bp, ... ), and the last column being the outcome y, the measure of diabetes pregression over a year.
import pandas as pd
df = pd.read_csv('https://web.stanford.edu/~hastie/Papers/LARS/diabetes.data', delimiter='\t')
df
AGE | SEX | BMI | BP | S1 | S2 | S3 | S4 | S5 | S6 | Y | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 59 | 2 | 32.1 | 101.00 | 157 | 93.2 | 38.0 | 4.00 | 4.8598 | 87 | 151 |
1 | 48 | 1 | 21.6 | 87.00 | 183 | 103.2 | 70.0 | 3.00 | 3.8918 | 69 | 75 |
2 | 72 | 2 | 30.5 | 93.00 | 156 | 93.6 | 41.0 | 4.00 | 4.6728 | 85 | 141 |
3 | 24 | 1 | 25.3 | 84.00 | 198 | 131.4 | 40.0 | 5.00 | 4.8903 | 89 | 206 |
4 | 50 | 1 | 23.0 | 101.00 | 192 | 125.4 | 52.0 | 4.00 | 4.2905 | 80 | 135 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
437 | 60 | 2 | 28.2 | 112.00 | 185 | 113.8 | 42.0 | 4.00 | 4.9836 | 93 | 178 |
438 | 47 | 2 | 24.9 | 75.00 | 225 | 166.0 | 42.0 | 5.00 | 4.4427 | 102 | 104 |
439 | 60 | 2 | 24.9 | 99.67 | 162 | 106.6 | 43.0 | 3.77 | 4.1271 | 95 | 132 |
440 | 36 | 1 | 30.0 | 95.00 | 201 | 125.2 | 42.0 | 4.79 | 5.1299 | 85 | 220 |
441 | 36 | 1 | 19.6 | 71.00 | 250 | 133.2 | 97.0 | 3.00 | 4.5951 | 92 | 57 |
442 rows × 11 columns
We'd like to fit a linear model that predicts the measure of diabete progression over a year, from those 10 features (plus a constant feature).
n, d = df.shape
X = np.hstack((np.ones((n,1)), df.values[:,:-1]))
y = df.values[:,-1]
theta_opt = np.linalg.lstsq(X, y, rcond=None)[0]
MSE = np.linalg.norm(X.dot(theta_opt)-y)**2/n
print(f'MSE: {MSE}')
MSE: 2859.6963475867506
theta_opt
array([-3.34567139e+02, -3.63612242e-02, -2.28596481e+01, 5.60296209e+00, 1.11680799e+00, -1.08999633e+00, 7.46450456e-01, 3.72004715e-01, 6.53383194e+00, 6.84831250e+01, 2.80116989e-01])
The following shows the prediction performance where the scatter points on the black line (ˆy=y) imply perfect prediction.
plt.figure(figsize=(6,6), dpi=100)
plt.plot(y, y, 'k')
plt.plot(y, X.dot(theta_opt), 'o', alpha=0.5)
plt.xlabel('y')
plt.ylabel(r'$\hat{y}$')
plt.axis('square')
plt.grid()
Now suppose we got a new medical record (Xnew) from a new patient. The predicted diabete progress for the patient over the next year can be easily assed by the following, ˆynew=Xnewθ∗. So you can get a rough idea on how much for this patient the diabetes will progress.
# note that each feature represents
# u1: age
# u2: sex
# u3: bmi body mass index
# u4: map mean arterial pressure
# u5: s1 (tc) : total cholesterol
# u6: s2 (ldl): low density lipoprotein
# u7: s3 (hdl): high density lipoprotein
# u8: s4 (tch):
# u9: s5 (ltg):
# u10: s6 (glu):
# features: age sex bmi map tc ldl hdl tch ltg glu
X_JHK = np.array([41, 1, 18.3, 90, 171, 80.0, 74.9, 2, 4.75, 90.0])
X_JHK = np.hstack((1, X_JHK))
y_JHK = X_JHK.dot(theta_opt)
print (y_JHK)
plt.figure(figsize=(6,6), dpi=100)
plt.plot(y, y, 'k')
plt.plot(y, X.dot(theta_opt), 'o', alpha=0.5)
plt.plot(y_JHK, y_JHK, 'ro', label='JHK')
plt.xlabel('y')
plt.ylabel(r'$\hat{y}$')
plt.legend()
plt.axis('square')
plt.grid()
108.89218184854774