Regression Example: Estimation of bodyfat from chest circumference

In the previous notebook Credit Worthiness the overall supervised machine learning process has been demonstrated for a linear classification example. Now, this process is demonstrated for a regression task. Moreover, the current notebook demonstrates how different types of models (generalized linear models) can be learned and compared.

Percentage of body fat for an individual can be estimated once body density has been determined. A function which can be applied to calculate body-fat ratio from the measured body density is defined here. The problem is, that body density can hardly be measured. It would be easier to estimate the bodyfat ratio from other parameters, which can easily be measured, e.g. weight, height, chest-circumference,...In this case one must know how these measurable parameters correlate with the hidden target variable bodyfat. A usually non-deterministic function defines the correlations between measurable variables and the target variable. Sometimes experts can apply their domain knowledge to estimate such a function. This is the old approach! The data-centric approach is to learn this function from example pairs of of measurable input variables and corresponding target value. For this approach no domain-knowledge is required - but data and a suitable Machine Learning algorithm.

In this exercise the goal is to learn a function (more generally: a model) that maps the measurable variable chest circumference to the target variable bodyfat. For this we need a set of training data. Each training element must contain the chest circumference and the measured bodyfat ratio of a person. Since the bodyfat measurement is expensive we have only a small number of training elements.

Actually we apply the bodyfat dataset, which is provided and described at Data. For use in the following python program the same dataset without description can be downloaded from here. This dataset contains data of 252 persons. Each person is described in a single row by 15 features. In the following playground-experiment we use data of only 20 persons and we do not apply all features, but only

  • the single input feature chest circumference, contained in the 7th column (column index 6)
  • the target parameter, which we like to predict: bodyfat ratio, contained in the 2nd column (column index 1)

Import of required modules:

In [1]:
%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
from sklearn import linear_model
from sklearn.cross_validation import cross_val_score
from sklearn import metrics
from IPython.display import Image

Read data from textfile

The entire dataset - 252 persons, each described by 15 parameters - is imported into the program:

In [2]:
np.set_printoptions(precision=3,suppress=True)
infileName='./Res/bodyfat.txt'
inp=open(infileName,'r')
plist=[];
for row in inp.readlines():
        k=row.split()
        u=[float(a) for a in k] 
        plist.append(u)
numInstances=len(plist);
numTrain=20
print 'Number of available persons:         ',numInstances
print 'Number of persons used for training: ',numTrain
Number of available persons:          252
Number of persons used for training:  20

Now the list plist consists of 252 rows and each row describes the parameters of a single person. Since the file bodyfat.txt does not contain a header, we define a list columslist, which contains the parameter names. I.e. the i.th element in this list is the parameter, which is contained in the i.th column of the input file.

In [3]:
columnslist=["Density","Body Fat","Age","Weight","Height","Neck circumference",
             "Chest circumference","Abdomen circumference","Hip circumference",
             "Thigh circumference","Knee circumference","Ankle circumference",
             "Biceps circumference","Forearm circumference","Wrist circumference"]

Training data

As mentioned above we pretend to have only 20 training elements. The chest circumference and the corresponding bodyfat ratio of the 20 training elements are plotted below.

In [4]:
columnIdx=7  #Choose a feature index between 2 (Age) and 14 (Wrist circumference). Chest circumference is at column-Index 6 
dataArray=np.array(plist)
trainIdx=np.arange(numTrain)
feature=dataArray[trainIdx,columnIdx]
targets=dataArray[trainIdx,1]
plt.plot(feature,targets,'sb',label='training data')
plt.xlabel(columnslist[columnIdx])
plt.ylabel('Ratio of bodyfat [%]')
plt.title('Data of the first '+str(numTrain)+' persons')
print columnslist[columnIdx][:8]+" | Bodyfat"
print np.transpose(np.array([feature,targets]))
#plt.hold(True)
#plt.draw()
Abdomen  | Bodyfat
[[  85.2   12.3]
 [  83.     6.1]
 [  87.9   25.3]
 [  86.4   10.4]
 [ 100.    28.7]
 [  94.4   20.9]
 [  90.7   19.2]
 [  88.5   12.4]
 [  82.5    4.1]
 [  88.6   11.7]
 [  83.6    7.1]
 [  90.9    7.8]
 [  91.6   20.8]
 [ 101.8   21.2]
 [  96.4   22.1]
 [  92.8   20.9]
 [  96.4   29. ]
 [  97.5   22.9]
 [  89.6   16. ]
 [ 100.5   16.5]]

Learning a linear function that maps chest circumference to bodyfat

Given the training data we now like to learn a linear function, that best fits to the data. For this we apply the LinearRegression class from scikit learn. The method fit(features,targets) takes the features (chest circumference) and the targets (bodyfat ratio) and trains a linear model. After the training we access the parameters of the linear model, which uniquely define the best line fitting the training data. The parameter b=intercept_ keeps the bias and the parameter c=coef_ keeps the gradient of the learned line. Our linear prediction model is defined by

\begin{equation} bf = b + c[0] \cdot cc, \end{equation}

where $bf$ denotes bodyfat ration and $cc$ denotes chest circumference.

In [5]:
features=np.transpose(np.atleast_2d(feature))
minF=np.min(features)
maxF=np.max(features)
regressor=linear_model.LinearRegression()
regressor.fit(features,targets)
c= regressor.coef_
b= regressor.intercept_
print "Learned parameters of the linear model: \nBias = %2.2f, \ngradient =%2.2f" %(b,c)
Learned parameters of the linear model: 
Bias = -69.09, 
gradient =0.94

Applying the learned model for prediction

The learned LinearRegression-object can be applied for predicting new bodyfat ratios for arbitrary chest circumferences. This is realized by calling the object's method predict(features). The learned line is plotted in the graph below.

In [6]:
plt.plot(feature,targets,'sb',label='training data')
linearpred=regressor.predict(features)
plt.plot(features,linearpred,'r',label='best linear fit')
plt.xlabel(columnslist[columnIdx])
plt.ylabel('Ratio of bodyfat [%]')
plt.title('Learned linear function')
plt.legend(loc=2,numpoints=1)
Out[6]:
<matplotlib.legend.Legend at 0x9027a20>

Generalized linear regression: Learning the best quadratic function

The LinearRegression()-class can also be used for learning a quadratic function. For this we generate a second pseudo-feature by just calculating the square of the first feature. Thus the new feature matrix has two columns:

In [7]:
features1=np.transpose(np.atleast_2d(feature))
features2=np.power(np.transpose(np.atleast_2d(feature)),2)
featuresQuad=np.concatenate([features1,features2],axis=1)
print "Feature is: ",columnslist[columnIdx]
print "   Feature  | Feature^2    | bodyfat"
print np.concatenate([featuresQuad,np.transpose(np.atleast_2d(targets))],axis=1)
Feature is:  Abdomen circumference
   Feature  | Feature^2    | bodyfat
[[    85.2    7259.04     12.3 ]
 [    83.     6889.        6.1 ]
 [    87.9    7726.41     25.3 ]
 [    86.4    7464.96     10.4 ]
 [   100.    10000.       28.7 ]
 [    94.4    8911.36     20.9 ]
 [    90.7    8226.49     19.2 ]
 [    88.5    7832.25     12.4 ]
 [    82.5    6806.25      4.1 ]
 [    88.6    7849.96     11.7 ]
 [    83.6    6988.96      7.1 ]
 [    90.9    8262.81      7.8 ]
 [    91.6    8390.56     20.8 ]
 [   101.8   10363.24     21.2 ]
 [    96.4    9292.96     22.1 ]
 [    92.8    8611.84     20.9 ]
 [    96.4    9292.96     29.  ]
 [    97.5    9506.25     22.9 ]
 [    89.6    8028.16     16.  ]
 [   100.5   10100.25     16.5 ]]

We apply this features together with the targets to learn a generalized linear model. The learned model now defines a quadratic function of type

\begin{equation} bf=b + c[0] \cdot cc + c[1] \cdot cc^2, \end{equation}

where $bf$ denotes bodyfat ration and $cc$ denotes chest circumference.

In [8]:
regressor2=linear_model.LinearRegression()
regressor2.fit(featuresQuad,targets)
c= regressor2.coef_
b= regressor2.intercept_
print """Learned parameters of the quadratic model: 
\nBias = %2.2f, 
\nweight of linear term c[0] =%2.2f, 
\nweight of quadratic term c[1] =%2.2f""" %(b,c[0],c[1])
Learned parameters of the quadratic model: 

Bias = -609.85, 

weight of linear term c[0] =12.74, 

weight of quadratic term c[1] =-0.06

The quadratic function that best fits the given training data is plotted in the graph below.

In [9]:
quadpred=regressor2.predict(featuresQuad)
plt.plot(feature,targets,'sb',label='training data')
predFat=[b+c[0]*x+c[1]*np.power(x,2) for x in np.arange(minF,maxF+1)]
plt.plot(np.arange(minF,maxF+1),predFat,'r',label='best quadratic fit')
plt.xlabel(columnslist[columnIdx])
plt.ylabel('Ratio of bodyfat [%]')
plt.title('Learned quadratic function')
plt.legend(loc=2,numpoints=1)
Out[9]:
<matplotlib.legend.Legend at 0xa834cf8>

Learning the best cubic function

In the same way as the best qudratic function has been learned by a LinearRegression()-model, the best cubic function can be learned. Now we have to introduce a third pseudo-feature, which is just the 3rd power of the chest circumference. The function to be learned is of type:

\begin{equation} bf=bias + c[0]\cdot cc + c[1] \cdot cc^2+c[2] \cdot cc^3 \end{equation}

where $bf$ denotes bodyfat ration and $cc$ denotes chest circumference.

In [10]:
features3=np.power(np.transpose(np.atleast_2d(feature)),3)
featuresCubic=np.concatenate([featuresQuad,features3],axis=1)
print "Feature is: ",columnslist[columnIdx]
print "     Feature    |   Feature^2  |  Feature^3    |    bodyfat"
print np.concatenate([featuresCubic,np.transpose(np.atleast_2d(targets))],axis=1)
Feature is:  Abdomen circumference
     Feature    |   Feature^2  |  Feature^3    |    bodyfat
[[      85.2       7259.04    618470.208       12.3  ]
 [      83.        6889.      571787.           6.1  ]
 [      87.9       7726.41    679151.439       25.3  ]
 [      86.4       7464.96    644972.544       10.4  ]
 [     100.       10000.     1000000.          28.7  ]
 [      94.4       8911.36    841232.384       20.9  ]
 [      90.7       8226.49    746142.643       19.2  ]
 [      88.5       7832.25    693154.125       12.4  ]
 [      82.5       6806.25    561515.625        4.1  ]
 [      88.6       7849.96    695506.456       11.7  ]
 [      83.6       6988.96    584277.056        7.1  ]
 [      90.9       8262.81    751089.429        7.8  ]
 [      91.6       8390.56    768575.296       20.8  ]
 [     101.8      10363.24   1054977.832       21.2  ]
 [      96.4       9292.96    895841.344       22.1  ]
 [      92.8       8611.84    799178.752       20.9  ]
 [      96.4       9292.96    895841.344       29.   ]
 [      97.5       9506.25    926859.375       22.9  ]
 [      89.6       8028.16    719323.136       16.   ]
 [     100.5      10100.25   1015075.125       16.5  ]]
In [11]:
regressor3=linear_model.LinearRegression()
regressor3.fit(featuresCubic,targets)
c= regressor3.coef_
b= regressor3.intercept_
print """Learned parameters of the cubic model: 
\nBias = %2.2f, 
\nweight of linear term c[0] =%2.2f, 
\nweight of quadratic term c[1] =%2.2f
\nweight of cubic term c[2] =%2.2f""" %(b,c[0],c[1],c[2])
Learned parameters of the cubic model: 

Bias = 2378.44, 

weight of linear term c[0] =-85.35, 

weight of quadratic term c[1] =1.01

weight of cubic term c[2] =-0.00
In [12]:
cubicpred=regressor3.predict(featuresCubic)
plt.plot(feature,targets,'sb',label='training data')
predFat=[b+c[0]*x+c[1]*np.power(x,2)+c[2]*np.power(x,3) for x in np.arange(minF,maxF+1)]
plt.plot(np.arange(minF,maxF+1),predFat,'r',label='best cubic fit')
plt.xlabel(columnslist[columnIdx])
plt.ylabel('Ratio of bodyfat [%]')
plt.title('Learned cubic function')
plt.legend(loc=2,numpoints=1)
Out[12]:
<matplotlib.legend.Legend at 0x90352e8>

Determining the quality of the learned model: A first (but inadequate) approach

In order to determine the quality of the learned model we can just calculate the mean square error (mse) and/or the mean absolute difference (mad) between the actual targets of the the training data and the values predicted by the model:

Quality of the linear model

In [13]:
mseLin=1.0/numTrain*np.sum(np.power((linearpred-targets),2))
madLin=1.0/numTrain*np.sum(np.abs(linearpred-targets))
print "Mean Square Error on training data:        ",mseLin
print "Mean Absolute Difference on training data: ",madLin
Mean Square Error on training data:         23.5096942936
Mean Absolute Difference on training data:  3.77246691819

Quality of the quadratic model

In [14]:
mseQuad=1.0/numTrain*np.sum(np.power((quadpred-targets),2))
madQuad=1.0/numTrain*np.sum(np.abs(quadpred-targets))
print "Mean Square Error on training data:        ",mseQuad
print "Mean Absolute Difference on training data: ",madQuad
Mean Square Error on training data:         19.2345133533
Mean Absolute Difference on training data:  2.92221991418

Quality of the cubic model

In [15]:
mseCubic=1.0/numTrain*np.sum(np.power((cubicpred-targets),2))
madCubic=1.0/numTrain*np.sum(np.abs(cubicpred-targets))
print "Mean Square Error on training data:        ",mseCubic
print "Mean Absolute Difference on training data: ",madCubic
Mean Square Error on training data:         18.8588406506
Mean Absolute Difference on training data:  2.8838021953

Questions

  • Which of the three models is the best?
  • Why is this type of evaluating the performance of a model inadequate?
  • How should it be done?
  • What is overfitting?

Determining the quality of the learned model: The correct way

From the fact that a model fits well to the training data one can not conclude that it works well for new datasamples that have not been used for training. A model can be adapted specifically to the training data but shows weak generalization performance. This is called ooverfitting. We like to determine the model which generalizes best. This can be achieved by testing on labeled datasets, which have not used for training. This implies that not all available labeled datasamples can be applied for training. The set of labeled data must be partitioned into a training set and a test set.

In our example we have lots of labeled data that can be applied for testing, since we used only a small set for training. We use all samples that have not been applied for training in order to test the three models. Note that the models are already learned. We just use other input data and calculate the model's predictions for this test data. Finally the MSE and MAD is caluclated by comparing the model's prediction with the actual targets.

In [16]:
testIdx=np.arange(numTrain,numInstances)
numTest=numInstances-numTrain
chestTest=dataArray[testIdx,columnIdx]
targetsTest=dataArray[testIdx,1]
featuresTest=np.transpose(np.atleast_2d(chestTest))
featuresTest2=np.power(featuresTest,2)
featuresTest3=np.power(featuresTest,3)
featuresTestQuad=np.concatenate([featuresTest,featuresTest2],axis=1)
featuresTestCubic=np.concatenate([featuresTest,featuresTest2,featuresTest3],axis=1)
linearpredTest=regressor.predict(featuresTest)
quadpredTest=regressor2.predict(featuresTestQuad)
cubicpredTest=regressor3.predict(featuresTestCubic)
In [17]:
mseLinTest=1.0/numTest*np.sum(np.power((linearpredTest-targetsTest),2))
madLinTest=1.0/numTest*np.sum(np.abs(linearpredTest-targetsTest))
print "Mean Square Error on test data:        ",mseLinTest
print "Mean Absolute Difference on test data: ",madLinTest
Mean Square Error on test data:         37.396164004
Mean Absolute Difference on test data:  4.68804845865
In [18]:
mseCubicTest=1.0/numTest*np.sum(np.power((cubicpredTest-targetsTest),2))
madCubicTest=1.0/numTest*np.sum(np.abs(cubicpredTest-targetsTest))
print "Mean Square Error on test data:        ",mseCubicTest
print "Mean Absolute Difference on test data: ",madCubicTest
Mean Square Error on test data:         3599.22744243
Mean Absolute Difference on test data:  14.0123831714
In [19]:
mseQuadTest=1.0/numTest*np.sum(np.power((quadpredTest-targetsTest),2))
madQuadTest=1.0/numTest*np.sum(np.abs(quadpredTest-targetsTest))
print "Mean Square Error on test data:        ",mseQuadTest
print "Mean Absolute Difference on test data: ",madQuadTest
Mean Square Error on test data:         259.839338909
Mean Absolute Difference on test data:  8.85756835121

Questions:

  • Which model would you prefer now?
  • Compare the learned parameters (coefficients of the learned function) of the good and the bad models. What do you conclude?

Cross Validation

In the case that the number of labeled data is too low for partitioning it into a sufficiently large training and a sufficiently large test set, one usually applies x-fold cross validation. The integer $x$ is often selected to be $10$. For 10-fold cross the entire set of labeled data is partitioned into $x=10$ partitions of approximately the same size. The overall training and test is performed in $x=10$ iterations. In the i.th iteration partition $i$ is used for testing and all other partitions are applied for training the model. In each iteration a model is trained with the training partitions and tested with the test partition. The overall performance criteria is then the mean over all iteration-specific performance values (e.g. the mean over all mse).

In [20]:
i = Image(filename='./Res/CrossValidation.jpg')
i
Out[20]:

Scikit-learn provides the cross_val_score-method for x-fold crossvalidation. Arguments of this method are the set of all features, all targets and an object of the learner-model class. The function returns a scores-object, which contains the performance values of all iterations:

In [21]:
allchest=dataArray[:,columnIdx]
alltargets=dataArray[:,1]
allfeatures=np.transpose(np.atleast_2d(allchest))
allfeatures2=np.power(allfeatures,2)
allfeatures3=np.power(allfeatures,3)
allfeaturesQuad=np.concatenate([allfeatures,allfeatures2],axis=1)
allfeaturesCubic=np.concatenate([allfeatures,allfeatures2,allfeatures3],axis=1)

regressor1=linear_model.LinearRegression()
regressor2=linear_model.LinearRegression()
regressor3=linear_model.LinearRegression()
In [22]:
#scores = cross_val_score(regressor1,allfeatures,alltargets,cv=10,score_func=metrics.mean_squared_error)
scores = cross_val_score(regressor1,allfeatures,alltargets,cv=10,scoring="mean_squared_error")
print scores
print "Linear Model Cross Validation Score: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() / 2)
[-31.989 -34.282 -21.123 -27.323 -15.23  -25.135 -17.578 -21.209 -34.904
 -13.259]
Linear Model Cross Validation Score: -24.203 (+/- 3.708)
In [23]:
scores = cross_val_score(regressor2,allfeaturesQuad,alltargets,cv=10,scoring="mean_squared_error")
print scores
print "Quadratic Model Cross Validation Score: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() / 2)
[-34.326 -27.665 -18.931 -26.749 -13.565 -24.713 -19.019 -21.116 -38.27
 -12.74 ]
Quadratic Model Cross Validation Score: -23.709 (+/- 3.954)
In [24]:
scores = cross_val_score(regressor3,allfeaturesCubic,alltargets,cv=10,scoring="mean_squared_error")
print scores
print "Cubic Model Cross Validation Score: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() / 2)
[ -33.177 -106.645  -18.833  -26.78   -15.522  -26.792  -18.362  -23.059
  -36.591  -12.35 ]
Cubic Model Cross Validation Score: -31.811 (+/- 12.986)

Summary and questions

Summary

  • We performed linear and generalized linear regression
  • In this experiment only one feature (chest circumference) has been applied
  • For regression MSE or MAD are typical performance measures
  • A good model is one whose predictions are close to the the actual labels. But this comparison must be done on test datasamples, i.e. on samples that have not been used for training.
  • A model is overfitted, if it performs well on the training data, but relatively bad on the test data
  • X-fold crossvalidation is the standard method for evaluating a supervised learner, both for regression and classification.

The MSE of cross validation is still large. How can it be improved?

Use all available features for bodyfat regression

In all of the previous experiments we applied only one feature, the chest circumference, for predicting bodyfat ratio. We have seen that all of the learned models yielded a relatively large MSE. In general prediction is weak, if there are latent parameters, i.e. parameters that have not been used in the model but have a considerable influence on the target that shall be predicted. In our example it is obvious that bodyfat ratio depends not only on chest circumference but many other features, such as weight, height, age etc. Since our dataset contains much more features than only chest circumference, we can apply all of them for learning a linear model for bodyfat regression. This is implemented in the code snippet below. Here, we only apply a linear model and perform 10-fold cross-validation.

In [25]:
allfeatures=dataArray[:,2:] # now the applied features are columns 2 to 14
alltargets=dataArray[:,1]
regressor1=linear_model.LinearRegression()
scores = cross_val_score(regressor1,allfeatures,alltargets,cv=10,scoring="mean_squared_error")
print scores
print "Linear Model Cross Validation Score: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() / 2)
[-18.644 -29.135 -16.028 -23.773 -17.549 -21.088 -15.521 -15.717 -34.118
 -19.816]
Linear Model Cross Validation Score: -21.139 (+/- 2.947)

The MSE is now much better than in the case of applying only a single feature.

Questions

  1. What is a good model?
  2. What is overfitting?
  3. Training and testing both require a sufficient amount of labeled data samples. What if only a relatively small number of labeled data is available?
  4. In these experiments the cross validation MSE is much lower than the MSE of the previous tests with the fixed training and test set. Why?
  5. Is it always better to apply as much as possible features?
  6. In this notebook first a single feature (chest circumference) then, in the last section, all features have been applied. Suggest a better method for feature analysis.
  7. Compare the conventional approach of expert-based modelling with the data-based modelling. Which type of knowledge is required?

Determining the most informative parameter

By plotting the target value bodyfat-percentage over all the available measurement variables (columns 2 to 14) the most informative parameter can be determined. Which single parameter would be the best to determine bodyfat-percentage?

In [26]:
plt.figure(num=None, figsize=(18, 28), dpi=80, facecolor='w', edgecolor='k')
plt.subplot(7,2,1)
targets=dataArray[:,1]
for featIdx in range(2,14):
    plt.subplot(7,2,featIdx-1)
    feats=dataArray[:,featIdx]
    plt.plot(feats,targets,'ob')
    plt.title(columnslist[featIdx])
In [ ]: