#!/usr/bin/env python
# coding: utf-8

# # In Depth: Naive Bayes Classification

# The previous four chapters have given a general overview of the concepts of machine learning.
# In this chapter and the ones that follow, we will be taking a
# closer look first at four algorithms for  supervised learning,
# and then at four algorithms for unsupervised learning.
# We start here with our first supervised method, naive Bayes classification.
# 
# Naive Bayes models are a group of extremely fast and simple classification algorithms that are often suitable for very high-dimensional datasets.
# Because they are so fast and have so few tunable parameters, they end up being useful as a quick-and-dirty baseline for a classification problem.
# This chapter will provide an intuitive explanation of how naive Bayes classifiers work, followed by a few examples of them in action on some datasets.

# ## Bayesian Classification
# 
# Naive Bayes classifiers are built on Bayesian classification methods.
# These rely on Bayes's theorem, which is an equation describing the relationship of conditional probabilities of statistical quantities.
# In Bayesian classification, we're interested in finding the probability of a label $L$ given some observed features, which we can write as $P(L~|~{\rm features})$.
# Bayes's theorem tells us how to express this in terms of quantities we can compute more directly:
# 
# $$
# P(L~|~{\rm features}) = \frac{P({\rm features}~|~L)P(L)}{P({\rm features})}
# $$
# 
# If we are trying to decide between two labels—let's call them $L_1$ and $L_2$—then one way to make this decision is to compute the ratio of the posterior probabilities for each label:
# 
# $$
# \frac{P(L_1~|~{\rm features})}{P(L_2~|~{\rm features})} = \frac{P({\rm features}~|~L_1)}{P({\rm features}~|~L_2)}\frac{P(L_1)}{P(L_2)}
# $$
# 
# All we need now is some model by which we can compute $P({\rm features}~|~L_i)$ for each label.
# Such a model is called a *generative model* because it specifies the hypothetical random process that generates the data.
# Specifying this generative model for each label is the main piece of the training of such a Bayesian classifier.
# The general version of such a training step is a very difficult task, but we can make it simpler through the use of some simplifying assumptions about the form of this model.
# 
# This is where the "naive" in "naive Bayes" comes in: if we make very naive assumptions about the generative model for each label, we can find a rough approximation of the generative model for each class, and then proceed with the Bayesian classification.
# Different types of naive Bayes classifiers rest on different naive assumptions about the data, and we will examine a few of these in the following sections.
# 
# We begin with the standard imports:

# In[1]:


get_ipython().run_line_magic('matplotlib', 'inline')
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')


# ## Gaussian Naive Bayes
# 
# Perhaps the easiest naive Bayes classifier to understand is Gaussian naive Bayes.
# With this classifier, the assumption is that *data from each label is drawn from a simple Gaussian distribution*.
# Imagine that we have the following data, shown in Figure 41-1:

# In[2]:


from sklearn.datasets import make_blobs
X, y = make_blobs(100, 2, centers=2, random_state=2, cluster_std=1.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu');


# The simplest Gaussian model is to assume that the data is described by a Gaussian distribution with no covariance between dimensions.
# This model can be fit by computing the mean and standard deviation of the points within each label, which is all we need to define such a distribution.
# The result of this naive Gaussian assumption is shown in the following figure:

# ![(run code in Appendix to generate image)](images/05.05-gaussian-NB.png)
# [figure source in Appendix](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/06.00-Figure-Code.ipynb#Gaussian-Naive-Bayes)

# The ellipses here represent the Gaussian generative model for each label, with larger probability toward the center of the ellipses.
# With this generative model in place for each class, we have a simple recipe to compute the likelihood $P({\rm features}~|~L_1)$ for any data point, and thus we can quickly compute the posterior ratio and determine which label is the most probable for a given point.
# 
# This procedure is implemented in Scikit-Learn's `sklearn.naive_bayes.GaussianNB` estimator:

# In[3]:


from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X, y);


# Let's generate some new data and predict the label:

# In[4]:


rng = np.random.RandomState(0)
Xnew = [-6, -14] + [14, 18] * rng.rand(2000, 2)
ynew = model.predict(Xnew)


# Now we can plot this new data to get an idea of where the decision boundary is (see the following figure):

# In[5]:


plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu')
lim = plt.axis()
plt.scatter(Xnew[:, 0], Xnew[:, 1], c=ynew, s=20, cmap='RdBu', alpha=0.1)
plt.axis(lim);


# We see a slightly curved boundary in the classifications—in general, the boundary produced by a Gaussian naive Bayes model will be quadratic.
# 
# A nice aspect of this Bayesian formalism is that it naturally allows for probabilistic classification, which we can compute using the `predict_proba` method:

# In[6]:


yprob = model.predict_proba(Xnew)
yprob[-8:].round(2)


# The columns give the posterior probabilities of the first and second labels, respectively.
# If you are looking for estimates of uncertainty in your classification, Bayesian approaches like this can be a good place to start.
# 
# Of course, the final classification will only be as good as the model assumptions that lead to it, which is why Gaussian naive Bayes often does not produce very good results.
# Still, in many cases—especially as the number of features becomes large—this assumption is not detrimental enough to prevent Gaussian naive Bayes from being a reliable method.

# ## Multinomial Naive Bayes
# 
# The Gaussian assumption just described is by no means the only simple assumption that could be used to specify the generative distribution for each label.
# Another useful example is multinomial naive Bayes, where the features are assumed to be generated from a simple multinomial distribution.
# The multinomial distribution describes the probability of observing counts among a number of categories, and thus multinomial naive Bayes is most appropriate for features that represent counts or count rates.
# 
# The idea is precisely the same as before, except that instead of modeling the data distribution with the best-fit Gaussian, we model it with a best-fit multinomial distribution.

# ### Example: Classifying Text
# 
# One place where multinomial naive Bayes is often used is in text classification, where the features are related to word counts or frequencies within the documents to be classified.
# We discussed the extraction of such features from text in [Feature Engineering](05.04-Feature-Engineering.ipynb); here we will use the sparse word count features from the 20 Newsgroups corpus made available through Scikit-Learn to show how we might classify these short documents into categories.
# 
# Let's download the data and take a look at the target names:

# In[7]:


from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups()
data.target_names


# For simplicity here, we will select just a few of these categories and download the training and testing sets:

# In[8]:


categories = ['talk.religion.misc', 'soc.religion.christian',
              'sci.space', 'comp.graphics']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)


# Here is a representative entry from the data:

# In[9]:


print(train.data[5][48:])


# In order to use this data for machine learning, we need to be able to convert the content of each string into a vector of numbers.
# For this we will use the TF-IDF vectorizer (introduced in [Feature Engineering](05.04-Feature-Engineering.ipynb)), and create a pipeline that attaches it to a multinomial naive Bayes classifier:

# In[10]:


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

model = make_pipeline(TfidfVectorizer(), MultinomialNB())


# With this pipeline, we can apply the model to the training data and predict labels for the test data:

# In[11]:


model.fit(train.data, train.target)
labels = model.predict(test.data)


# Now that we have predicted the labels for the test data, we can evaluate them to learn about the performance of the estimator.
# For example, let's take a look at the confusion matrix between the true and predicted labels for the test data (see the following figure):

# In[12]:


from sklearn.metrics import confusion_matrix
mat = confusion_matrix(test.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=train.target_names, yticklabels=train.target_names,
            cmap='Blues')
plt.xlabel('true label')
plt.ylabel('predicted label');


# Evidently, even this very simple classifier can successfully separate space discussions from computer discussions, but it gets confused between discussions about religion and discussions about Christianity.
# This is perhaps to be expected!
# 
# The cool thing here is that we now have the tools to determine the category for *any* string, using the `predict` method of this pipeline.
# Here's a utility function that will return the prediction for a single string:

# In[13]:


def predict_category(s, train=train, model=model):
    pred = model.predict([s])
    return train.target_names[pred[0]]


# Let's try it out:

# In[14]:


predict_category('sending a payload to the ISS')


# In[15]:


predict_category('discussing the existence of God')


# In[16]:


predict_category('determining the screen resolution')


# Remember that this is nothing more sophisticated than a simple probability model for the (weighted) frequency of each word in the string; nevertheless, the result is striking.
# Even a very naive algorithm, when used carefully and trained on a large set of high-dimensional data, can be surprisingly effective.

# ## When to Use Naive Bayes
# 
# Because naive Bayes classifiers make such stringent assumptions about data, they will generally not perform as well as more complicated models.
# That said, they have several advantages:
# 
# - They are fast for both training and prediction.
# - They provide straightforward probabilistic prediction.
# - They are often easily interpretable.
# - They have few (if any) tunable parameters.
# 
# These advantages mean a naive Bayes classifier is often a good choice as an initial baseline classification.
# If it performs suitably, then congratulations: you have a very fast, very interpretable classifier for your problem.
# If it does not perform well, then you can begin exploring more sophisticated models, with some baseline knowledge of how well they should perform.
# 
# Naive Bayes classifiers tend to perform especially well in the following situations:
# 
# - When the naive assumptions actually match the data (very rare in practice)
# - For very well-separated categories, when model complexity is less important
# - For very high-dimensional data, when model complexity is less important
# 
# The last two points seem distinct, but they actually are related: as the dimensionality of a dataset grows, it is much less likely for any two points to be found close together (after all, they must be close in *every single dimension* to be close overall).
# This means that clusters in high dimensions tend to be more separated, on average, than clusters in low dimensions, assuming the new dimensions actually add information.
# For this reason, simplistic classifiers like the ones discussed here tend to work as well or better than more complicated classifiers as the dimensionality grows: once you have enough data, even a simple model can be very powerful.