Notebook

Table of Contents • ← Chapter 2 - Classification • Chapter 2.02 - Naive Bayes →

Chapter 2.01 - Dummy Classifiers¶

To really understand how classifiers work, we're going to start with two very basic models called dummy classifiers.

Technically these aren't machine learning models, as they use simple rules defined by the user. As you will see, they provide a good baseline for performance and demonstrate the importance of using various performance measures to evaluate models. If you've ever witnessed someone 'wow' an audience by describing a predictive model with an impressively high accuracy (such as 95%), you will see why this may not be as impressive as it sounds. (Accuracy has a special definition when we are talking about classification.)

Mode¶

Mode is a statistical term for the most frequent value in a set of data. For example, in the set a, b, b, b, c, d, the value b occurs most frequently, so it is the mode. You can calculate the mode of a given dataset in Python with the statistics.mode function:

In [35]:

from statistics import mode

# our simple data set
data = ['a','b','b','b','c','d']

mode(data)

Out[35]:

'b'

In the above example there are four unique values. In classification, when there are just two unique values in the labels, this is called binary classification. For example, consider the set True, False, False, False, False. The mode of this set is False, and in binary classification this is also known as the majority class.

We can use the mode to create a very simple model for predicting a value (and without requiring any input). In binary classification, this is can be called a majority class classifier. Using the example above, a majority class classifier would always predict False, and given the example data it would achieve an accuracy of 80%! A great achievement for such a simple model.

This terminology is potentially problematic. When talking about classification, accuracy is a measure of the proportion of predictions that are predicted correctly. The colloquial meaning of accuracy could mislead others about the performance of your model. Consider a set of 100 True and False labels that flag whether a loan has defaulted or not. Perhaps in this set, only 5 of the loans defaulted. With a basic model such as this, we could trivially achieve 95% accuracy by always predicting False.

Additional performance measures¶

Using the table below (aptly called a confusion matrix), we can define some additional useful measures of performance:

	Predicted = True	Predicted = False
Actual = True	True Positive (TP)	False Negative (FN)
Actual = False	False Positive (FP)	True Negative (TN)

Accuracy measures how often the model is correct, calculated as:

$$Accuracy=\frac{TP + TN}{TP + TN + FP + FN}$$

True Positive Rate - TPR (also called Recall or Sensitivity) measures how often the model is correct when the actual value is true, calculated as:

$$True\ Positive\ Rate\ (TPR)=\frac{TP}{FN + TP}$$

False Positive Rate - FPR (also called Fallout) measures how often the model is incorrect when the actual value is false, calculated as:

$$False\ Positive\ Rate\ (FPR)=\frac{FP}{TN + FP}$$

Positive Predictive Value - PPV (also called Precision) measures the proportion of predictions that are correct when the predicted value is true, calculated as:

$$Positive\ Predictive\ Value\ (PPV)=\frac{TP}{FP + TP}$$

There are many other metrics that can be derived from a confusion matrix, but for now we will focus on just these.

Generally, the goal is to maximise accuracy, TPR and PPV, and minimise FPR.

Continuing with our loan defaults example above, let's calculate these three metrics (remembering our model always predicts False):

	Predicted = True	Predicted = False
Actual = True	0 (TP)	5 (FN)
Actual = False	0 (FP)	95 (TN)

We already know accuracy is 95%
$TPR = \frac{TP}{FN + TP} = \frac{0}{5 + 0} = 0%$
$FPR = \frac{FP}{TN + FP} = \frac{0}{95 + 0} = 0%$
$PPV = \frac{TP}{FP + TP} = \frac{0}{0 + 0} = NaN$

Considering these additional metrics, we can now see that while the model accuracy is high, it's actually plain garbage for it's predicting loan defaults. Where this model is strong however, is to form a baseline. Hopefully your real model will achieve higher accuracy (or TPR, or PPV - depending on what is most important.

Class probability¶

Before we continue to real models, lets consider one more dummy classifier - the stratified classifier. In statistics, stratified sampling takes samples from each group (class) in the population. It works by assigning predictions according to the probability distribution of the underlying class. This classifier is slightly more complex than the mode variant, and can also potentially achieve a higher accuracy.

Continuing with example of loan defaults, in our set of 100, 5 default (True labels) and 95 do not (False labels). Notationally, this gives class probabilities of $p(default) = 0.05$ and $p(no\ default) = 0.95$. In other words, for every 100 predictions this model makes, it will randomly select 5 to be True and the rest False.

Best case, our stratified classifier makes 5 True predictions that align to the 5 actual True values by chance.

	Predicted = True	Predicted = False
Actual = True	5 (TP)	0 (FN)
Actual = False	0 (FP)	95 (TN)

$Accuracy = \frac{TP + TN}{TP + TN + FP + FN} = \frac{5 + 95}{5 + 95 + 0 + 0} = 100%$
$TPR = \frac{TP}{FN + TP} = \frac{5}{0 + 5} = 100%$
$FPR = \frac{FP}{TN + FP} = \frac{0}{95 + 0} = 0%$
$PPV = \frac{TP}{FP + TP} = \frac{5}{5 + 0} = 100%$

Worst case means our classifier makes no correct predictions.

	Predicted = True	Predicted = False
Actual = True	0 (TP)	5 (FN)
Actual = False	5 (FP)	90 (TN)

$Accuracy = \frac{TP + TN}{TP + TN + FP + FN} = \frac{0 + 90}{0 + 90 + 5 + 5} = 90%$
$TPR = \frac{TP}{FN + TP} = \frac{0}{5 + 0} = 0%$
$FPR = \frac{FP}{TN + FP} = \frac{5}{90 + 5} = 5.26%$
$PPV = \frac{TP}{FP + TP} = \frac{0}{5 + 0} = 0%$

In this example, we could achieve an accuracy of between 90% (all five wrong) and 100%. If we take all the possible scenarios (i.e. 0 true positives, 1 true positives, ... , 5 true positives), the classifier will give us on average an accuracy of 95% (same as mode classifier), TPR of 50%, FPR of ~2.6% and PPV of 50%. This means, given a sufficiently large data set, we will produce a model that performs (on average) better than the mode dummy classifier according to our additional performance measures.

Predicting probabilities¶

As discussed in Chapter 2's introduction to classification, classifiers are typically able to output a probability or likelihood of the positive class occurance. This means that instead of predicting True or False, and 1 or 0, predictions will be output as 0.23, 0.67 or 0.94, where this repesents the probability of a True or 1 value occuring.

Notice that in each confusion matrix above, we have explicitly discretely defined the model predictions (meaning as either True or False, and nothing in between).

Before we can begin measuring accuracy (or any of the other metrics) of a predicted probability, we need to define a threshold (called a discrimination threshold) below which we assume probabilities to be False or 0, and above which they become True or 1. For example, if this threshold is 0.5 for our values 0.23, 0.67 or 0.94, they would become 0, 1 and 1 respectively. Furthermore, it were 0.75, they would become 0, 0 and 1 respectively.

But how do we know where to set this threshold? One common method is to maximise what is called the F1 score. The F1 score is an alternative accuracy measure which is the weighted average of precision and sensitivity (or recall), and is calculated as:

$$ F_1\ score = 2\frac{PPV \times TPR}{PPV + TPR}\\ $$

The F1 score is calculated for each threshold from 0.01 to 0.99, with the maximal score occuring indicating the appropriate threshold which maximises both $\frac{TP}{FP + TP}$ and $\frac{TP}{FN + TP}$ simulatenously.

There will potentially be times where your problem may place different value on the occurances of FP and FN. Wouldn't it be nice if there was a performance measure that is useful for all binary classification problems (that is, without the need to explicitly define a threshold)?

We can achieve this by calculating the sensitivity (true positive rate) and fallout (false positive rate) of our model for every threshold increment, and compare this to that of random predictions.

A random binary prediction has a 50% chance of being correct. Observe the TPR and FPR as we change the number of True predictions in a total set of 100 predictions:

0% True predictions: TPR = 0% and FPR = 0%
10% True predictions: TPR = 10% and FPR = 10%
20% True predictions: TPR = 20% and FPR = 20%
...
100% True predictions: TPR = 100% and FPR = 100%

We can plot TPR and FPR of our random predictions to form a straight line from (0,0) to (1,1). This line is our worst possible baseline. The line is called a receiver operating characteristic (ROC) curve, though it's not a curve at this point.

As the discrimination threshold is varied and the resulting TPR and FPR is plotted, a 'good' model will produce a curve reaching up towards the top left corner. The closer the ROC curve to the top left, the better the model. A perfect model will maximise this curve all the way to the top left corner, effectively making a right angle.

If the curve crosses over the random baseline, this indicates an error with the model. If the curve is completely below the random baseline, simply inverting the model (replacing all True predictions with False predictions and vice versa) is a trivial improvement.

The space below the curve is a performance metric called AUC (the area under the curve). By measuring the number of true positives to false positives at each variation of the discrimination threshold, this provides an effective measure of how well a binary classifier can distinguish Trues from Falses, and allows us to measure model performance without explicitly defining a threshold.

Implementing dummy classifiers¶

Having a baseline as described above is genuinely useful. Luckily, these models are straightforward to implement in Python using the Scikit-learn Machine Learning package. Conveniently, the DummyClassifier class conforms to the same API (i.e. code style) as the rest of the algorithms, so what we learn for applying these baseline models to a dataset is largely transferrable to real machine learning!

First, let's quickly generate a some sample data to work with:

In [5]:

from sklearn.datasets import make_classification

# create a sample data set, where X are the features and y are the labels
X, y = make_classification(n_samples=100, n_classes=2)

Now we can build our first dummy classifier, using the mode method described above and compute the AUC metric:

In [7]:

from sklearn.dummy import DummyClassifier
from sklearn import metrics

# define and fit our dummy model
model = DummyClassifier(strategy='most_frequent')
model.fit(X, y)

# generate predictions and calculate sensitivity and fallout
predictions = model.predict(X)
TPR, FPR, thresholds = metrics.roc_curve(y, predictions)

metrics.auc(TPR, FPR)

Out[7]:

0.5

And the second dummy classifier, using the probability distribution method:

In [4]:

# define and fit our dummy model
dummy_strat = DummyClassifier(strategy='stratified')
dummy_strat.fit(X, y)

# generate predictions and calculate sensitivity and fallout
predictions = dummy_strat.predict(X)
TPR, FPR, thresholds = metrics.roc_curve(y, predictions)

metrics.auc(TPR, FPR)

Out[4]:

0.53041216486594633

Notice the AUC will increase and decrease if you re-run the stratified dummy classifier. This is because the predictions are assigned randomly according to the probability distribution, with a range of possible sensitivity and fallout measures as described above.

Where to now?¶

Given the two dummy classifiers as a baseline, our goal with machine learning is to create a model capable of out-performing these. The rest of Chapter 2 is going to explain and demonstrate various different approaches (i.e. types of models) to help you achieve this for your given classification problem. The first one we look at takes the ideas same idea behind modelling class probabilities of the stratified dummy classifer, and introducing inputs.

Table of Contents • ← Chapter 2 - Classification • Chapter 2.02 - Naive Bayes →