# Parametric Generalized Linear Regression¶

In this notebook generalized linear regression is implemented and applied for estimating a model that maps the speed of long distance runners to their heartrate. For training the model a set of 30 samples is applied, each containing the speed (in m/s) of a runner and the heartrate measured at this speed. The data can be downloaded from here

Required Modules:

In [14]:
import numpy as np
import math
import matplotlib.pyplot as plt


Read data from file. The first column contains an ID, the second column is the speed in m/s and the third column is the heartrate in beats/s.

In [15]:
customerArray=np.fromfile("./Res/HeartRate.txt",sep=' ').reshape(-1,3)
numdata=customerArray.shape[0]
print "Number of samples:  ",numdata
customerInd=range(numdata)
print "    ID    Speed   Heartrate"
print customerArray

Number of samples:   30
ID    Speed   Heartrate
[[   1.      4.5   155.15]
[   2.      5.    166.68]
[   3.      4.5   164.37]
[   4.      5.25  160.82]
[   5.      4.5   148.51]
[   6.      4.75  169.83]
[   7.      5.    188.01]
[   8.      5.5   187.9 ]
[   9.      4.5   157.96]
[  10.      5.25  178.29]
[  11.      5.25  179.55]
[  12.      5.5   203.81]
[  13.      4.    150.74]
[  14.      4.5   171.78]
[  15.      5.    172.52]
[  16.      4.25  148.92]
[  17.      4.25  160.02]
[  18.      5.    183.83]
[  19.      5.    156.67]
[  20.      5.    162.4 ]
[  21.      5.    171.39]
[  22.      4.    156.1 ]
[  23.      4.5   153.15]
[  24.      4.75  161.35]
[  25.      4.25  163.16]
[  26.      5.25  165.42]
[  27.      5.25  189.25]
[  28.      4.25  165.56]
[  29.      5.25  172.35]
[  30.      4.    158.07]]


Next the weights $$w=\left( D^T D\right)^{-1} D^T r$$ are calculated, where matrix $D$ is calculated from the input features. Actually the $i.th$ column of $D$ contains the $i.th$ power of the 1-dimensional input vector for $i \in \{0,\ldots,deg\}$. The vector $r$ contains the target values of the training data. If the degree $deg$ is chosen to be 1 a linear model is learned. For $deg>1$ a generalized linear model, i.e. a polynomial of degree $deg$ is learned.

In [16]:
deg=3
D=np.zeros((numdata,deg+1))
for p in customerInd:
for ex in range(deg+1):
D[p][ex]=math.pow(float(customerArray[p][1]),ex)
DT=np.transpose(D)
DTD=np.dot(DT,D)
r=customerArray[:,2]
# Option 1: numerical problems for large degrees and small number of training instances
#w=np.dot(np.dot(np.linalg.inv(DTD),DT),r)
#Option 2: Choose this solution since it is numerically more stable
y=np.dot(DT,r)
w=np.linalg.lstsq(DTD,y)[0]
print 'Calculated weights w= \n', w

Calculated weights w=
[-1374.82684788  1025.96427699  -230.63228071    17.43724377]


The learned model and the training samples are plotted below

In [17]:
plt.scatter(customerArray[:,1],customerArray[:,2],marker='o', color='red')

plt.title('heartrate vs. speed of long distance runners')
plt.xlabel('speed in m/s')
plt.ylabel('heartrate in bps')
plt.hold(True)
RES=0.05 # resolution of speed-axis
# plot calculated linear regression
minS=np.min(customerArray[:,1])
maxS=np.max(customerArray[:,1])
speedrange=np.arange(minS,maxS+RES,RES)
hrrange=np.zeros(speedrange.shape[0])
for si,s in enumerate(speedrange):
hrrange[si]=np.sum([w[d]*s**d for d in range(deg+1)])
plt.plot(speedrange,hrrange)

Out[17]:
[<matplotlib.lines.Line2D at 0xdf10590>]

Finally the mean absolute distance (MAD) and the Mean Square Error (MSE) are calculated.

In [18]:
pred=np.zeros(numdata)
for si,x in enumerate(customerArray[:,1]):
pred[si]=np.sum([w[d]*x**d for d in range(deg+1)])

mse=1.0/numdata*np.sum((pred-customerArray[:,2])**2)
print 'MSE = ',mse

6.9507131235
MSE =  72.8395306513


## Question:¶

Which of the generalized linear models fits best

• if you just regard MAD and MSE?
• if you take into account domain knowledge?

## Same solution, now using Scikit Learn¶

In [19]:
from sklearn import linear_model
speed=np.transpose(np.atleast_2d(customerArray[:,1]))
for d in range(1,deg):
newcol=np.transpose(np.atleast_2d(np.power(speed[:,0],d+1)))
speed=np.concatenate((speed,newcol),axis=1)
heartrate=customerArray[:,2]

# Train Linear Regression Model
reg=linear_model.LinearRegression()
reg.fit(speed,heartrate)
print reg

# Parameters of Trained Model
print "Degree = ",deg
print "Learned coefficients w0, w1, w2, ....:"
wlist=[reg.intercept_]
wlist.extend(reg.coef_)
w=np.array(wlist)
print w

# Plot training samples
plt.scatter(speed[:,0],heartrate,marker='o', color='red')
plt.title('heartrate vs. speed of long distance runners')
plt.xlabel('speed in m/s')
plt.ylabel('heartrate in bps')
plt.hold(True)
for si,s in enumerate(speedrange):
hrrange[si]=np.sum([w[d]*s**d for d in range(deg+1)])
plt.plot(speedrange,hrrange)

LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
Degree =  3
Learned coefficients w0, w1, w2, ....:
[-1374.8269195   1025.96432274  -230.63229039    17.43724445]

Out[19]:
[<matplotlib.lines.Line2D at 0xe3bd310>]