Parametric Generalized Linear Regression

In this notebook generalized linear regression is implemented and applied for estimating a model that maps the speed of long distance runners to their heartrate. For training the model a set of 30 samples is applied, each containing the speed (in m/s) of a runner and the heartrate measured at this speed. The data can be downloaded from here

Required Modules:

In [14]:
import numpy as np
import math
import matplotlib.pyplot as plt

Read data from file. The first column contains an ID, the second column is the speed in m/s and the third column is the heartrate in beats/s.

In [15]:
customerArray=np.fromfile("./Res/HeartRate.txt",sep=' ').reshape(-1,3)
numdata=customerArray.shape[0]
print "Number of samples:  ",numdata
customerInd=range(numdata)
print "    ID    Speed   Heartrate"
print customerArray
Number of samples:   30
    ID    Speed   Heartrate
[[   1.      4.5   155.15]
 [   2.      5.    166.68]
 [   3.      4.5   164.37]
 [   4.      5.25  160.82]
 [   5.      4.5   148.51]
 [   6.      4.75  169.83]
 [   7.      5.    188.01]
 [   8.      5.5   187.9 ]
 [   9.      4.5   157.96]
 [  10.      5.25  178.29]
 [  11.      5.25  179.55]
 [  12.      5.5   203.81]
 [  13.      4.    150.74]
 [  14.      4.5   171.78]
 [  15.      5.    172.52]
 [  16.      4.25  148.92]
 [  17.      4.25  160.02]
 [  18.      5.    183.83]
 [  19.      5.    156.67]
 [  20.      5.    162.4 ]
 [  21.      5.    171.39]
 [  22.      4.    156.1 ]
 [  23.      4.5   153.15]
 [  24.      4.75  161.35]
 [  25.      4.25  163.16]
 [  26.      5.25  165.42]
 [  27.      5.25  189.25]
 [  28.      4.25  165.56]
 [  29.      5.25  172.35]
 [  30.      4.    158.07]]

Next the weights $$ w=\left( D^T D\right)^{-1} D^T r $$ are calculated, where matrix $D$ is calculated from the input features. Actually the $i.th$ column of $D$ contains the $i.th$ power of the 1-dimensional input vector for $i \in \{0,\ldots,deg\}$. The vector $r$ contains the target values of the training data. If the degree $deg$ is chosen to be 1 a linear model is learned. For $deg>1$ a generalized linear model, i.e. a polynomial of degree $deg$ is learned.

In [16]:
deg=3
D=np.zeros((numdata,deg+1))
for p in customerInd:
    for ex in range(deg+1):
        D[p][ex]=math.pow(float(customerArray[p][1]),ex)    
DT=np.transpose(D)
DTD=np.dot(DT,D)
r=customerArray[:,2]
# Option 1: numerical problems for large degrees and small number of training instances
#w=np.dot(np.dot(np.linalg.inv(DTD),DT),r)
#Option 2: Choose this solution since it is numerically more stable
y=np.dot(DT,r)
w=np.linalg.lstsq(DTD,y)[0]
print 'Calculated weights w= \n', w
Calculated weights w= 
[-1374.82684788  1025.96427699  -230.63228071    17.43724377]

The learned model and the training samples are plotted below

In [17]:
plt.scatter(customerArray[:,1],customerArray[:,2],marker='o', color='red')

plt.title('heartrate vs. speed of long distance runners')
plt.xlabel('speed in m/s')
plt.ylabel('heartrate in bps')
plt.hold(True)
RES=0.05 # resolution of speed-axis
# plot calculated linear regression 
minS=np.min(customerArray[:,1])
maxS=np.max(customerArray[:,1])
speedrange=np.arange(minS,maxS+RES,RES)
hrrange=np.zeros(speedrange.shape[0])
for si,s in enumerate(speedrange):
    hrrange[si]=np.sum([w[d]*s**d for d in range(deg+1)])
plt.plot(speedrange,hrrange)
Out[17]:
[<matplotlib.lines.Line2D at 0xdf10590>]

Finally the mean absolute distance (MAD) and the Mean Square Error (MSE) are calculated.

In [18]:
pred=np.zeros(numdata)
for si,x in enumerate(customerArray[:,1]):
    pred[si]=np.sum([w[d]*x**d for d in range(deg+1)])
    
mad=1.0/numdata*np.sum(np.abs(pred-customerArray[:,2]))
mse=1.0/numdata*np.sum((pred-customerArray[:,2])**2)
print mad  
print 'MAD = ',mad   
print 'MSE = ',mse 
6.9507131235
MAD =  6.9507131235
MSE =  72.8395306513

Question:

Which of the generalized linear models fits best

  • if you just regard MAD and MSE?
  • if you take into account domain knowledge?

Same solution, now using Scikit Learn

In [19]:
from sklearn import linear_model
speed=np.transpose(np.atleast_2d(customerArray[:,1]))
for d in range(1,deg):
    newcol=np.transpose(np.atleast_2d(np.power(speed[:,0],d+1)))
    speed=np.concatenate((speed,newcol),axis=1)
heartrate=customerArray[:,2]

# Train Linear Regression Model
reg=linear_model.LinearRegression()
reg.fit(speed,heartrate)
print reg

# Parameters of Trained Model 
print "Degree = ",deg
print "Learned coefficients w0, w1, w2, ....:"
wlist=[reg.intercept_]
wlist.extend(reg.coef_)
w=np.array(wlist)
print w


# Plot training samples
plt.scatter(speed[:,0],heartrate,marker='o', color='red')
plt.title('heartrate vs. speed of long distance runners')
plt.xlabel('speed in m/s')
plt.ylabel('heartrate in bps')
plt.hold(True)
for si,s in enumerate(speedrange):
    hrrange[si]=np.sum([w[d]*s**d for d in range(deg+1)])
plt.plot(speedrange,hrrange)
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
Degree =  3
Learned coefficients w0, w1, w2, ....:
[-1374.8269195   1025.96432274  -230.63229039    17.43724445]
Out[19]:
[<matplotlib.lines.Line2D at 0xe3bd310>]
In [19]: