**Author:**Johannes Maucher**Last Update:**10th October, 2016- List of all IPython Notebooks for this lecture

In the previous notebook Credit Worthiness the overall supervised machine learning process has been demonstrated for a linear **classification** example. Now, this process is demonstrated for a **regression** task. Moreover, the current notebook demonstrates how different types of models (generalized linear models) can be learned and compared.

Percentage of body fat for an individual can be estimated once *body density* has been determined. A function which can be applied to calculate body-fat ratio from the measured body density is defined here.
The problem is, that *body density* can hardly be measured. It would be easier to estimate the bodyfat ratio from other parameters, which can easily be measured, e.g. weight, height, chest-circumference,...In this case one must know how these *measurable parameters* correlate with the *hidden target variable bodyfat*. A usually non-deterministic function defines the correlations between measurable variables and the target variable. Sometimes experts can apply their domain knowledge to estimate such a function. This is the old approach! **The data-centric approach is to learn this function from example pairs of of measurable input variables and corresponding target value**. For this approach no domain-knowledge is required - but data and a suitable Machine Learning algorithm.

In this exercise the goal is to learn a function (more generally: a model) that maps the measurable variable *chest circumference* to the target variable *bodyfat*. For this we need a set of training data. Each training element must contain the chest circumference and the measured bodyfat ratio of a person. Since the bodyfat measurement is expensive we have only a small number of training elements.

Actually we apply the bodyfat dataset, which is provided and described at Data. For use in the following python program the same dataset without description can be downloaded from here. This dataset contains data of 252 persons. Each person is described in a single row by 15 features. In the following playground-experiment we use data of only **20 persons** and we do not apply all features, but only

- the single input feature
**chest circumference**, contained in the 7th column (column index 6) - the target parameter, which we like to predict:
**bodyfat ratio**, contained in the 2nd column (column index 1)

Import of required modules:

In [1]:

```
%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
from sklearn import linear_model
from sklearn.cross_validation import cross_val_score
from sklearn import metrics
from IPython.display import Image
```

The entire dataset - 252 persons, each described by 15 parameters - is imported into the program:

In [2]:

```
np.set_printoptions(precision=3,suppress=True)
infileName='./Res/bodyfat.txt'
inp=open(infileName,'r')
plist=[];
for row in inp.readlines():
k=row.split()
u=[float(a) for a in k]
plist.append(u)
numInstances=len(plist);
numTrain=20
print 'Number of available persons: ',numInstances
print 'Number of persons used for training: ',numTrain
```

Now the list *plist* consists of 252 rows and each row describes the parameters of a single person. Since the file *bodyfat.txt* does not contain a header, we define a list *columslist*, which contains the parameter names. I.e. the *i.th* element in this list is the parameter, which is contained in the i.th column of the input file.

In [3]:

```
columnslist=["Density","Body Fat","Age","Weight","Height","Neck circumference",
"Chest circumference","Abdomen circumference","Hip circumference",
"Thigh circumference","Knee circumference","Ankle circumference",
"Biceps circumference","Forearm circumference","Wrist circumference"]
```

As mentioned above we pretend to have only 20 training elements. The chest circumference and the corresponding bodyfat ratio of the 20 training elements are plotted below.

In [4]:

```
columnIdx=7 #Choose a feature index between 2 (Age) and 14 (Wrist circumference). Chest circumference is at column-Index 6
dataArray=np.array(plist)
trainIdx=np.arange(numTrain)
feature=dataArray[trainIdx,columnIdx]
targets=dataArray[trainIdx,1]
plt.plot(feature,targets,'sb',label='training data')
plt.xlabel(columnslist[columnIdx])
plt.ylabel('Ratio of bodyfat [%]')
plt.title('Data of the first '+str(numTrain)+' persons')
print columnslist[columnIdx][:8]+" | Bodyfat"
print np.transpose(np.array([feature,targets]))
#plt.hold(True)
#plt.draw()
```

Given the training data we now like to learn a linear function, that best fits to the data. For this we apply the *LinearRegression* class from *scikit learn*. The method *fit(features,targets)* takes the features (chest circumference) and the targets (bodyfat ratio) and trains a linear model. After the training we access the parameters of the linear model, which uniquely define the best line fitting the training data. The parameter *b=intercept_* keeps the bias and the parameter *c=coef_* keeps the gradient of the learned line. Our linear prediction model is defined by

where $bf$ denotes bodyfat ration and $cc$ denotes chest circumference.

In [5]:

```
features=np.transpose(np.atleast_2d(feature))
minF=np.min(features)
maxF=np.max(features)
regressor=linear_model.LinearRegression()
regressor.fit(features,targets)
c= regressor.coef_
b= regressor.intercept_
print "Learned parameters of the linear model: \nBias = %2.2f, \ngradient =%2.2f" %(b,c)
```

The learned *LinearRegression*-object can be applied for predicting new bodyfat ratios for arbitrary chest circumferences. This is realized by calling the object's method *predict(features)*. The learned line is plotted in the graph below.

In [6]:

```
plt.plot(feature,targets,'sb',label='training data')
linearpred=regressor.predict(features)
plt.plot(features,linearpred,'r',label='best linear fit')
plt.xlabel(columnslist[columnIdx])
plt.ylabel('Ratio of bodyfat [%]')
plt.title('Learned linear function')
plt.legend(loc=2,numpoints=1)
```

Out[6]:

The *LinearRegression()*-class can also be used for learning a quadratic function. For this we generate a second pseudo-feature by just calculating the square of the first feature. Thus the new feature matrix has two columns:

In [7]:

```
features1=np.transpose(np.atleast_2d(feature))
features2=np.power(np.transpose(np.atleast_2d(feature)),2)
featuresQuad=np.concatenate([features1,features2],axis=1)
print "Feature is: ",columnslist[columnIdx]
print " Feature | Feature^2 | bodyfat"
print np.concatenate([featuresQuad,np.transpose(np.atleast_2d(targets))],axis=1)
```

We apply this features together with the targets to learn a generalized linear model. The learned model now defines a quadratic function of type

\begin{equation} bf=b + c[0] \cdot cc + c[1] \cdot cc^2, \end{equation}where $bf$ denotes bodyfat ration and $cc$ denotes chest circumference.

In [8]:

```
regressor2=linear_model.LinearRegression()
regressor2.fit(featuresQuad,targets)
c= regressor2.coef_
b= regressor2.intercept_
print """Learned parameters of the quadratic model:
\nBias = %2.2f,
\nweight of linear term c[0] =%2.2f,
\nweight of quadratic term c[1] =%2.2f""" %(b,c[0],c[1])
```

The quadratic function that best fits the given training data is plotted in the graph below.

In [9]:

```
quadpred=regressor2.predict(featuresQuad)
plt.plot(feature,targets,'sb',label='training data')
predFat=[b+c[0]*x+c[1]*np.power(x,2) for x in np.arange(minF,maxF+1)]
plt.plot(np.arange(minF,maxF+1),predFat,'r',label='best quadratic fit')
plt.xlabel(columnslist[columnIdx])
plt.ylabel('Ratio of bodyfat [%]')
plt.title('Learned quadratic function')
plt.legend(loc=2,numpoints=1)
```

Out[9]:

In the same way as the best qudratic function has been learned by a *LinearRegression()*-model, the best cubic function can be learned. Now we have to introduce a third pseudo-feature, which is just the 3rd power of the chest circumference. The function to be learned is of type:

where $bf$ denotes bodyfat ration and $cc$ denotes chest circumference.

In [10]:

```
features3=np.power(np.transpose(np.atleast_2d(feature)),3)
featuresCubic=np.concatenate([featuresQuad,features3],axis=1)
print "Feature is: ",columnslist[columnIdx]
print " Feature | Feature^2 | Feature^3 | bodyfat"
print np.concatenate([featuresCubic,np.transpose(np.atleast_2d(targets))],axis=1)
```

In [11]:

```
regressor3=linear_model.LinearRegression()
regressor3.fit(featuresCubic,targets)
c= regressor3.coef_
b= regressor3.intercept_
print """Learned parameters of the cubic model:
\nBias = %2.2f,
\nweight of linear term c[0] =%2.2f,
\nweight of quadratic term c[1] =%2.2f
\nweight of cubic term c[2] =%2.2f""" %(b,c[0],c[1],c[2])
```

In [12]:

```
cubicpred=regressor3.predict(featuresCubic)
plt.plot(feature,targets,'sb',label='training data')
predFat=[b+c[0]*x+c[1]*np.power(x,2)+c[2]*np.power(x,3) for x in np.arange(minF,maxF+1)]
plt.plot(np.arange(minF,maxF+1),predFat,'r',label='best cubic fit')
plt.xlabel(columnslist[columnIdx])
plt.ylabel('Ratio of bodyfat [%]')
plt.title('Learned cubic function')
plt.legend(loc=2,numpoints=1)
```

Out[12]:

In order to determine the quality of the learned model we can just calculate the *mean square error (mse)* and/or the *mean absolute difference (mad)* between the actual targets of the the training data and the values predicted by the model:

In [13]:

```
mseLin=1.0/numTrain*np.sum(np.power((linearpred-targets),2))
madLin=1.0/numTrain*np.sum(np.abs(linearpred-targets))
print "Mean Square Error on training data: ",mseLin
print "Mean Absolute Difference on training data: ",madLin
```

In [14]:

```
mseQuad=1.0/numTrain*np.sum(np.power((quadpred-targets),2))
madQuad=1.0/numTrain*np.sum(np.abs(quadpred-targets))
print "Mean Square Error on training data: ",mseQuad
print "Mean Absolute Difference on training data: ",madQuad
```

In [15]:

```
mseCubic=1.0/numTrain*np.sum(np.power((cubicpred-targets),2))
madCubic=1.0/numTrain*np.sum(np.abs(cubicpred-targets))
print "Mean Square Error on training data: ",mseCubic
print "Mean Absolute Difference on training data: ",madCubic
```

- Which of the three models is the best?
- Why is this type of evaluating the performance of a model inadequate?
- How should it be done?
- What is overfitting?

From the fact that a model fits well to the training data one can not conclude that it works well for new datasamples that have not been used for training. A model can be adapted specifically to the training data but shows weak generalization performance. This is called **ooverfitting**. We like to determine the model which generalizes best. This can be achieved by testing on labeled datasets, which have not used for training. This implies that not all available labeled datasamples can be applied for training. The set of labeled data must be partitioned into a training set and a test set.

In our example we have lots of labeled data that can be applied for testing, since we used only a small set for training. We use all samples that have not been applied for training in order to test the three models. Note that the models are already learned. We just use other input data and calculate the model's predictions for this test data. Finally the MSE and MAD is caluclated by comparing the model's prediction with the actual targets.

In [16]:

```
testIdx=np.arange(numTrain,numInstances)
numTest=numInstances-numTrain
chestTest=dataArray[testIdx,columnIdx]
targetsTest=dataArray[testIdx,1]
featuresTest=np.transpose(np.atleast_2d(chestTest))
featuresTest2=np.power(featuresTest,2)
featuresTest3=np.power(featuresTest,3)
featuresTestQuad=np.concatenate([featuresTest,featuresTest2],axis=1)
featuresTestCubic=np.concatenate([featuresTest,featuresTest2,featuresTest3],axis=1)
linearpredTest=regressor.predict(featuresTest)
quadpredTest=regressor2.predict(featuresTestQuad)
cubicpredTest=regressor3.predict(featuresTestCubic)
```

In [17]:

```
mseLinTest=1.0/numTest*np.sum(np.power((linearpredTest-targetsTest),2))
madLinTest=1.0/numTest*np.sum(np.abs(linearpredTest-targetsTest))
print "Mean Square Error on test data: ",mseLinTest
print "Mean Absolute Difference on test data: ",madLinTest
```

In [18]:

```
mseCubicTest=1.0/numTest*np.sum(np.power((cubicpredTest-targetsTest),2))
madCubicTest=1.0/numTest*np.sum(np.abs(cubicpredTest-targetsTest))
print "Mean Square Error on test data: ",mseCubicTest
print "Mean Absolute Difference on test data: ",madCubicTest
```

In [19]:

```
mseQuadTest=1.0/numTest*np.sum(np.power((quadpredTest-targetsTest),2))
madQuadTest=1.0/numTest*np.sum(np.abs(quadpredTest-targetsTest))
print "Mean Square Error on test data: ",mseQuadTest
print "Mean Absolute Difference on test data: ",madQuadTest
```

**Questions:**

- Which model would you prefer now?
- Compare the learned parameters (coefficients of the learned function) of the good and the bad models. What do you conclude?

In the case that the number of labeled data is too low for partitioning it into a sufficiently large training and a sufficiently large test set, one usually applies x-fold cross validation. The integer $x$ is often selected to be $10$. For 10-fold cross the entire set of labeled data is partitioned into $x=10$ partitions of approximately the same size. The overall training and test is performed in $x=10$ iterations. In the i.th iteration partition $i$ is used for testing and all other partitions are applied for training the model. In each iteration a model is trained with the training partitions and tested with the test partition. The overall performance criteria is then the mean over all iteration-specific performance values (e.g. the mean over all mse).

In [20]:

```
i = Image(filename='./Res/CrossValidation.jpg')
i
```

Out[20]:

*Scikit-learn* provides the *cross_val_score*-method for x-fold crossvalidation. Arguments of this method are the set of all features, all targets and an object of the learner-model class. The function returns a *scores*-object, which contains the performance values of all iterations:

In [21]:

```
allchest=dataArray[:,columnIdx]
alltargets=dataArray[:,1]
allfeatures=np.transpose(np.atleast_2d(allchest))
allfeatures2=np.power(allfeatures,2)
allfeatures3=np.power(allfeatures,3)
allfeaturesQuad=np.concatenate([allfeatures,allfeatures2],axis=1)
allfeaturesCubic=np.concatenate([allfeatures,allfeatures2,allfeatures3],axis=1)
regressor1=linear_model.LinearRegression()
regressor2=linear_model.LinearRegression()
regressor3=linear_model.LinearRegression()
```

In [22]:

```
#scores = cross_val_score(regressor1,allfeatures,alltargets,cv=10,score_func=metrics.mean_squared_error)
scores = cross_val_score(regressor1,allfeatures,alltargets,cv=10,scoring="mean_squared_error")
print scores
print "Linear Model Cross Validation Score: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() / 2)
```

In [23]:

```
scores = cross_val_score(regressor2,allfeaturesQuad,alltargets,cv=10,scoring="mean_squared_error")
print scores
print "Quadratic Model Cross Validation Score: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() / 2)
```

In [24]:

```
scores = cross_val_score(regressor3,allfeaturesCubic,alltargets,cv=10,scoring="mean_squared_error")
print scores
print "Cubic Model Cross Validation Score: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() / 2)
```

- We performed linear and generalized linear regression
- In this experiment only one feature (chest circumference) has been applied
- For regression MSE or MAD are typical performance measures
- A good model is one whose predictions are close to the the actual labels. But this comparison must be done on test datasamples, i.e. on samples that have not been used for training.
- A model is overfitted, if it performs well on the training data, but relatively bad on the test data
- X-fold crossvalidation is the standard method for evaluating a supervised learner, both for regression and classification.

**The MSE of cross validation is still large. How can it be improved?**

In all of the previous experiments we applied only one feature, the chest circumference, for predicting bodyfat ratio. We have seen that all of the learned models yielded a relatively large MSE. In general prediction is weak, if there are **latent** parameters, i.e. parameters that have not been used in the model but have a considerable influence on the target that shall be predicted. In our example it is obvious that bodyfat ratio depends not only on chest circumference but many other features, such as weight, height, age etc. Since our dataset contains much more features than only chest circumference, we can apply all of them for learning a linear model for bodyfat regression. This is implemented in the code snippet below. Here, we only apply a linear model and perform 10-fold cross-validation.

In [25]:

```
allfeatures=dataArray[:,2:] # now the applied features are columns 2 to 14
alltargets=dataArray[:,1]
regressor1=linear_model.LinearRegression()
scores = cross_val_score(regressor1,allfeatures,alltargets,cv=10,scoring="mean_squared_error")
print scores
print "Linear Model Cross Validation Score: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() / 2)
```

The MSE is now much better than in the case of applying only a single feature.

- What is a good model?
- What is overfitting?
- Training and testing both require a sufficient amount of labeled data samples. What if only a relatively small number of labeled data is available?
- In these experiments the cross validation MSE is much lower than the MSE of the previous tests with the fixed training and test set. Why?
- Is it always better to apply as much as possible features?
- In this notebook first a single feature (chest circumference) then, in the last section, all features have been applied. Suggest a better method for feature analysis.
- Compare the conventional approach of expert-based modelling with the data-based modelling. Which type of knowledge is required?

By plotting the target value bodyfat-percentage over all the available measurement variables (columns 2 to 14) the most informative parameter can be determined. Which single parameter would be the best to determine bodyfat-percentage?

In [26]:

```
plt.figure(num=None, figsize=(18, 28), dpi=80, facecolor='w', edgecolor='k')
plt.subplot(7,2,1)
targets=dataArray[:,1]
for featIdx in range(2,14):
plt.subplot(7,2,featIdx-1)
feats=dataArray[:,featIdx]
plt.plot(feats,targets,'ob')
plt.title(columnslist[featIdx])
```

In [ ]:

```
```