Table of Contents • ← Chapter 2.01 - Dummy Classifiers • Chapter 2.03 - k-Nearest Neighbours →
In statistics, Bayes theorem describes the probability of an occurence based on input conditions. The theorem states: the probability of A given B is equal to the probability of B given A multiplied by the probability of A and divided by the probability of B, or notationally:
$$P(A \mid B) = \frac{P(B \mid A) \times P(A)}{P(B)}$$$P$ means the probability of and $\mid$ means given or "where".
In other words $P(A \mid B)$ means "the probability of A occuring when B occurs".
A naive Bayes classifier applies this theorem naively, assuming that features (inputs into) in the model are indepedent of (unrelated to) each other.
In the previous chapter, we looked at using class probabilities to build a dummy classifier, and considered an example that 95% of loans do not default. This probability is known as a prior probability - it is known without knowing anything about the class inputs.
We can prove Bayes theorem by starting with the probability of two events, A and B, occuring together.
$$P(A\ and\ B) = P(A) \times P(B \mid A)\\ P(A\ and\ B) = P(B) \times P(A \mid B)$$Equating the right sides of each equation:
$$P(B) \times P(A \mid B) = P(A) \times P(B \mid A)$$Divide both sides by $P(B)$, gives us Bayes theorem:
$$P(A \mid B) = \frac{P(B \mid A) \times P(A)}{P(B)}$$Let's expand our dummy classifier example with some input:
Employment | Default | Count |
---|---|---|
FT | N | 59 |
FT | Y | 1 |
PT | N | 36 |
PT | Y | 4 |
Probability of default given full-time employment:
$$ \begin{align} P(Default=Y \mid Emp=FT) & = \frac{P(Default=Y) \times P(Emp=FT \mid Default=Y)}{P(Emp=FT)}\\ & = \frac{0.05 \times 0.2}{0.6}\\ & = 0.0167... \end{align} $$Probability of default given part-time employment:
$$ \begin{align} P(Default=Y \mid Emp=PT) & = \frac{P(Default=Y) \times P(Emp=PT \mid Default=Y)}{P(Emp=PT)}\\ & = \frac{0.05 \times 0.8}{0.4}\\ & = 0.1 \end{align} $$Given just one input, for this example we can see that part time employees are almost 6 times more likely to default than their full time counterparts.
If we want to predict the class of a given employment type, we calculate the probability of all classes and take the maximum.
To extend on the above, if the employment type is FT, we know the probability of default is 0.0167.
The probability of not defaulting is:
$$ \begin{align} P(Default=N \mid Emp=FT) & = \frac{P(Default=N) \times P(Emp=FT \mid Default=N)}{P(Emp=FT)}\\ & = \frac{0.95 \times (59/95)}{0.6}\\ & = 0.983... \end{align} $$Since there are only two classes of default (true or false), the probabilities are intuitively inverse! As you can see, a loan to a full time worker is predicted to not default.
If we are not interested in the probability and only interested in the predicted class, we can take a shortcut and not calculate the divisor $P(Emp=FT)$ for both equations, as it is the same for both - it can only scale the results.
We can expand Bayes theorem with even more inputs and try to improve our classifier! This is where the naive aspect comes into play. For each input, we will assume (naively) that it is unrelated to every other input. Consider the following:
Gender | Employment | Default | Count |
---|---|---|---|
M | FT | N | 30 |
M | FT | Y | 1 |
M | PT | N | 14 |
M | PT | Y | 3 |
F | FT | N | 29 |
F | FT | Y | 0 |
F | PT | N | 22 |
F | PT | Y | 1 |
While we won't go through the mathematical proof here, Bayes theorem is generalised for multiple inputs as:
$$P(class \mid f_1,f_2,f_3,...) = P(class) \times P(f_1 \mid class) \times P(f_2 \mid class) \times P(f_3 \mid class) \times ...$$Let's predict default for a full-time employed female:
$$ \begin{align} P(default=True \mid emp=FT,gen=F) & = P(default=True) \times P(emp=FT \mid default=True) \times P(gen=F \mid default=True)\\ & = 0.05 \times 0.2 \times 0.2\\ & = 0.05 \end{align} $$$$ \begin{align} P(default=False \mid emp=FT,gen=F) & = P(default=False) \times P(emp=FT \mid default=False) \times P(gen=F \mid default=False)\\ & = 0.95 \times \frac{59}{95} \times \frac{51}{95}\\ & = 0.3167 \end{align} $$Now was take the maximum of the two probabilities, and assign the corresponding class as our prediction. That is, for a full-time employed female, we predict no default.
BernoulliNB
requires binary feature inputs, but luckily has a threshold parameter (aptly named binarize
) to convert continuous inputs into binary inputs at specified threshold. Conveniently we can keep using make_classification
for this implementation.
# prepare sample data, similar to previous chapter
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_classes=2)
# fit Bernoulli naive bayes (features below 0.0 become 0, above become 1)
from sklearn.naive_bayes import BernoulliNB
modelb = BernoulliNB(binarize=0.0)
modelb.fit(X, y)
predictions = modelb.predict(X)
# calculate AUC, should be much better than our previous dummy classifiers!
from sklearn import metrics
FPR, TPR, thresholds = metrics.roc_curve(y, predictions)
metrics.auc(FPR, TPR)
0.91999999999999993
Notice the improvement in AUC (it's now closer to 1).
An input with multiple discrete (categorical) values is called a multinomial input. (And if you were wondering, an input with only two values is called a Bernoulli or binary input).
Let's expand our dummy classifier example with some slightly different input, this time we will have three employment types (full time, part time and casual):
Employment | Default | Count |
---|---|---|
FT | N | 58 |
FT | Y | 1 |
PT | N | 34 |
PT | Y | 3 |
CA | N | 3 |
CA | Y | 1 |
Let's revise our calculations and calculate for our new value.
Probability of default given full-time employment:
$$ \begin{align} P(default=True \mid emp=FT, gen=F) & = \frac{P(default=True) \times P(emp=FT \mid default=True)}{P(emp=FT)}\\ & = \frac{0.05 \times 0.2}{0.59}\\ & = 0.01695... \end{align} $$Probability of default given part-time employment:
$$ \begin{align} P(default=True \mid emp=PT, gen=F) & = \frac{P(default=True) \times P(emp=PT \mid default=True)}{P(emp=PT)}\\ & = \frac{0.05 \times 0.6}{0.37}\\ & = 0.081... \end{align} $$Probability of default given casual employment:
$$ \begin{align} P(default=True \mid emp=CA, gen=F) & = \frac{P(default=True) \times P(emp=CA \mid default=True)}{P(emp=CA)}\\ & = \frac{0.05 \times 0.2}{0.04}\\ & = 0.25... \end{align} $$For multinomial inputs, it is possible that not every combination of classes and values of a given feature occur in the training data, meaning that $P(feature \mid class)$ would equate to 0 and multiply out the probability of the whole set of features to be 0. This is problematic, and so it is resolved by smoothing the result by introducing new information (known as regularisation).
This is not a problem for Bernoulli inputs, as by definition it must have a binary value (0 or 1, True or False), and if one were to not occur in the training data - it would only only have a single value, providing no information for learning.
MultinomialNB
requires inputs to be discrete, non-negative counts - so unfortunately make_classification
won't give us suitable inputs directly so we need to perform a quick transformation to integers.
# prepare sample data, similar to previous chapter
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_classes=2, shift=10, scale=10)
X = X.astype(int)
# fit Multinomial naive Bayes
from sklearn.naive_bayes import MultinomialNB
modelm = MultinomialNB()
modelm.fit(X, y)
predictions = modelm.predict(X)
# calculate AUC, should be much better than our previous dummy classifiers!
from sklearn import metrics
FPR, TPR, thresholds = metrics.roc_curve(y, predictions)
metrics.auc(FPR, TPR)
0.92897158863545426
The simplest way to handle (continuous) numerical inputs is to turn them into Bernoulli (binary) or multinomial (multiple values). Both of these options turns them into discrete values, and so this process is known as discretisation or binning.
For example, the table below discretises 'age' into both Bernoulli and multinomial values:
Age | Over 18 (Bernoulli) | Age Band (Multinomial) |
---|---|---|
12 | N | 10 to 19 |
17 | N | 10 to 19 |
30 | Y | 30 to 39 |
35 | Y | 30 to 39 |
43 | Y | 40 to 49 |
49 | Y | 40 to 49 |
There are many different approaches the optimising bins which we will not discuss here. Instead, we will focus on another method that is more robust against discretisation error (the error introduced through binning).
Given the rule above, where $P(class \mid f_1, f_2, f_3,...) = P(class) \times P(f_1 \mid class) \times P(f_2 \mid class) \times P(f_3 \mid class) \times ...$, we can substitue our calculation of $P(f_n \mid class)$ with a different approach that works for numerical (or continuous) inputs.
A common method is to apply what is called a probability density function, and assume that the numerical inputs are normally distributed (this is called a Gaussian distribution). Without going into the maths, to calculate this all we need to find the mean (average) and standard deviation (average difference from the mean) of the continuous values associated with each class.
Given the above, this should be quite straightforward.
# prepare sample data, similar to previous chapter
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_classes=2)
# fit Gaussian naive Bayes
from sklearn.naive_bayes import GaussianNB
modelg = GaussianNB()
modelg.fit(X, y)
predictions = modelg.predict(X)
# calculate AUC, should be much better than our previous dummy classifiers!
from sklearn import metrics
FPR, TPR, thresholds = metrics.roc_curve(y, predictions)
metrics.auc(FPR, TPR)
0.97999999999999998
Remember that naive Bayes is, in fact, naive - meaning that we assume the features are independent of each other. This means that we can create models that mix inputs by simply multiplying their probabilities. In effect, we could create a model that mixes Bernoulli, multinomial and continuous inputs by creating 3 respective models for each type of input, and then multiplying the output probabilities to achieve a combined model. You could also create a model which takes any input take and applies the correct probability calculation at each step.
We don't lose any information about relationships between any two inputs in different models, because naive Bayes never considers them in the first place even when in the same model!
Unforunately scikit-learn doesn't have direct support for mixed input naive bayes. To get around this, we're going to create a model that mixes Bernoulli and Gaussian naive bayes, by first obtaining the probability estimates and them combining them using a second Gaussian model.
# create some data and split 10 features for bernoulli and 10 for gaussian
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_classes=2)
X1 = X[:,0:10]
X2 = X[:,10:20]
# fit bernoulli
from sklearn.naive_bayes import BernoulliNB
modelb = BernoulliNB(binarize=0.0)
modelb.fit(X1, y)
# fit gaussian
from sklearn.naive_bayes import GaussianNB
modelg = GaussianNB()
modelg.fit(X2, y)
# combine both of our models
from sklearn.ensemble import VotingClassifier
models = VotingClassifier(estimators=[('bnb', modelb), ('gnb', modelg)], voting='soft')
models.fit(X,y)
predictions = models.predict(X)
# calculate AUC, should be much better than our previous dummy classifiers!
from sklearn import metrics
FPR, TPR, thresholds = metrics.roc_curve(y, predictions, drop_intermediate=False)
print(metrics.auc(FPR,TPR))
0.92
Just like naive Bayes could be expanded for considering inputs with multiple values, the same expansion works for multiple classes. If instead of default, we have default status (being one of never in default, currently in default and previously in default), we simply apply all of the calculations above for a third class. To make a prediction, again we take the maximum probability of the three classes. You can test this yourself by reusing any of the above code with make_classification(n_classes=2)
modified to the number of classes you have. (Note: AUC metric only applies to binary class classification, so attempting to calculate this will fail).
We now know that Naive Bayes will work with any combination of binary, categorical or continuous inputs, but how do we know which inputs to use and how many?
To find out, we can divide our dataset into 2 subsets:
X_train
and y_train
), andX_test
and y_test
)It is important that the model is constructed on only the training data, so that the test data will provide a realistic demonstration of the performance of the model (that is, how the model would perform on new data to which we don't know the output). The benefit of splitting our data like this means we can quantify the test performance (using the measures described in Chapter 1.01, when we discussed Dummy Classifiers).
Now we're going to iteratively (in cycles) build models, with each iteration (cycle) introducing one new feature. We'll then measure the performance of each interaction and observe what happens as the number of features grows, and plot the results.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn import metrics
import numpy as np
# create some data and split into training and test sets
X, y = make_classification(n_samples=1000, n_classes=2, n_features=500, n_informative=250)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# lists to capture results for plotting below
px = []
py = []
for n in range(1,500):
# pick n features at random and get corresponding inputs
sample_features = np.random.randint(500, size=n)
X_train_sample = X_train[:,sample_features]
X_test_sample = X_test[:,sample_features]
# fit bernoulli and make predictions
modelb = BernoulliNB(binarize=0.0)
modelb.fit(X_train_sample, y_train)
predictions = modelb.predict_proba(X_test_sample)
# calculate AUC, should be much better than our previous dummy classifiers!
FPR, TPR, thresholds = metrics.roc_curve(y_test, predictions[:,1])
px.append(n)
py.append(metrics.auc(FPR, TPR))
# plot our data
%matplotlib inline
import matplotlib.pyplot as plt
plt.xkcd()
plt.plot(px, py)
plt.title('Naive Bayes with increasing features')
plt.xlabel('Number of Features')
plt.ylabel('AUC');
We can see that even as large amounts of features are added, in this test the performance of Naive Bayes only increases slowly.
As we discovered, naive Bayes is naive - it won't discover any relationships between inputs to improve the model, as it assumed each input to be indepedent of each other.
Now that we understand how inputs can impact probability indepedently, we're going to explore a model that can, in a simple way, start taking advantage of relationships in those inputs.