Can Bayes' theorem help us to solve a classification problem, namely predicting the species of an iris?
We'll read the iris data into a DataFrame, and round up all of the measurements to the next integer:
import pandas as pd
import numpy as np
# read the iris data into a DataFrame
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
iris = pd.read_csv(url, header=None, names=col_names)
iris.head()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
# apply the ceiling function to the numeric columns
iris.loc[:, 'sepal_length':'petal_width'] = iris.loc[:, 'sepal_length':'petal_width'].apply(np.ceil)
iris.head()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 6 | 4 | 2 | 1 | Iris-setosa |
1 | 5 | 3 | 2 | 1 | Iris-setosa |
2 | 5 | 4 | 2 | 1 | Iris-setosa |
3 | 5 | 4 | 2 | 1 | Iris-setosa |
4 | 5 | 4 | 2 | 1 | Iris-setosa |
Let's say that I have an out-of-sample iris with the following measurements: 7, 3, 5, 2. How might I predict the species?
# show all observations with features: 7, 3, 5, 2
iris[(iris.sepal_length==7) & (iris.sepal_width==3) & (iris.petal_length==5) & (iris.petal_width==2)]
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
54 | 7 | 3 | 5 | 2 | Iris-versicolor |
58 | 7 | 3 | 5 | 2 | Iris-versicolor |
63 | 7 | 3 | 5 | 2 | Iris-versicolor |
68 | 7 | 3 | 5 | 2 | Iris-versicolor |
72 | 7 | 3 | 5 | 2 | Iris-versicolor |
73 | 7 | 3 | 5 | 2 | Iris-versicolor |
74 | 7 | 3 | 5 | 2 | Iris-versicolor |
75 | 7 | 3 | 5 | 2 | Iris-versicolor |
76 | 7 | 3 | 5 | 2 | Iris-versicolor |
77 | 7 | 3 | 5 | 2 | Iris-versicolor |
87 | 7 | 3 | 5 | 2 | Iris-versicolor |
91 | 7 | 3 | 5 | 2 | Iris-versicolor |
97 | 7 | 3 | 5 | 2 | Iris-versicolor |
123 | 7 | 3 | 5 | 2 | Iris-virginica |
126 | 7 | 3 | 5 | 2 | Iris-virginica |
127 | 7 | 3 | 5 | 2 | Iris-virginica |
146 | 7 | 3 | 5 | 2 | Iris-virginica |
# count the species for these observations
iris[(iris.sepal_length==7) & (iris.sepal_width==3) & (iris.petal_length==5) & (iris.petal_width==2)].species.value_counts()
Iris-versicolor 13 Iris-virginica 4 dtype: int64
# count the species for all observations
iris.species.value_counts()
Iris-setosa 50 Iris-versicolor 50 Iris-virginica 50 dtype: int64
Let's frame this as a conditional probability problem: What is the probability of some particular species, given the measurements 7, 3, 5, and 2?
$$P(species \ | \ 7352)$$We could calculate the conditional probability for each of the three species, and then predict the species with the highest probability:
$$P(setosa \ | \ 7352)$$$$P(versicolor \ | \ 7352)$$$$P(virginica \ | \ 7352)$$Bayes' theorem gives us a way to calculate these conditional probabilities.
Let's start with versicolor:
$$P(versicolor \ | \ 7352) = \frac {P(7352 \ | \ versicolor) \times P(versicolor)} {P(7352)}$$We can calculate each of the terms on the right side of the equation:
$$P(7352 \ | \ versicolor) = \frac {13} {50} = 0.26$$$$P(versicolor) = \frac {50} {150} = 0.33$$$$P(7352) = \frac {17} {150} = 0.11$$Therefore, Bayes' theorem says the probability of versicolor given these measurements is:
$$P(versicolor \ | \ 7352) = \frac {0.26 \times 0.33} {0.11} = 0.76$$Let's repeat this process for virginica and setosa:
$$P(virginica \ | \ 7352) = \frac {0.08 \times 0.33} {0.11} = 0.24$$$$P(setosa \ | \ 7352) = \frac {0 \times 0.33} {0.11} = 0$$We predict that the iris is a versicolor, since that species had the highest conditional probability.
Let's make some hypothetical adjustments to the data, to demonstrate how Bayes' theorem makes intuitive sense:
Pretend that more of the existing versicolors had measurements of 7352:
Pretend that most of the existing irises were versicolor:
Pretend that 17 of the setosas had measurements of 7352: