This is one of the 100 recipes of the IPython Cookbook, the definitive guide to high-performance scientific computing and data science in Python.

8.4. Learning from text: Naive Bayes for Natural Language Processing

In this recipe, we show how to handle text data with scikit-learn. Working with text requires careful preprocessing and feature extraction. It is also quite common to deal with highly sparse matrices.

We will learn to recognize whether a comment posted during a public discussion is considered insulting to one of the participants. We will use a labeled dataset from Impermium, released during a Kaggle competition.

You need to download the troll dataset on the book's website. (https://ipython-books.github.io)

  1. Let's import our libraries.
In [ ]:
import numpy as np
import pandas as pd
import sklearn
import sklearn.cross_validation as cv
import sklearn.grid_search as gs
import sklearn.feature_extraction.text as text
import sklearn.naive_bayes as nb
import matplotlib.pyplot as plt
%matplotlib inline
  1. Let's open the csv file with Pandas.
In [ ]:
df = pd.read_csv("data/troll.csv")
  1. Each row is a comment. There are three columns: whether the comment is insulting (1) or not (0), the data, and the unicode-encoded contents of the comment.
In [ ]:
df[['Insult', 'Comment']].tail()
  1. Now, we are going to define the feature matrix $\mathbf{X}$ and the labels $\mathbf{y}$.
In [ ]:
y = df['Insult']

Obtaining the feature matrix from the text is not trivial. Scikit-learn can only work with numerical matrices. How to convert text into a matrix of numbers? A classical solution is to first extract a vocabulary: a list of words used throughout the corpus. Then, we can count, for each sample, the frequency of each word. We end up with a sparse matrix: a huge matrix containiny mostly zeros. Here, we do this in two lines. We will give more explanations in How it works....

In [ ]:
tf = text.TfidfVectorizer()
X = tf.fit_transform(df['Comment'])
print(X.shape)
  1. There are 3947 comments and 16469 different words. Let's estimate the sparsity of this feature matrix.
In [ ]:
print("Each sample has ~{0:.2f}% non-zero features.".format(
          100 * X.nnz / float(X.shape[0] * X.shape[1])))
  1. Now, we are going to train a classifier as usual. We first split the data into a train and test set.
In [ ]:
(X_train, X_test,
 y_train, y_test) = cv.train_test_split(X, y,
                                        test_size=.2)
  1. We use a Bernoulli Naive Bayes classifier with a grid search on the parameter $\alpha$.
In [ ]:
bnb = gs.GridSearchCV(nb.BernoulliNB(), param_grid={'alpha':np.logspace(-2., 2., 50)})
bnb.fit(X_train, y_train);
  1. What is the performance of this classifier on the test dataset?
In [ ]:
bnb.score(X_test, y_test)
  1. Let's take a look at the words corresponding to the largest coefficients (the words we find frequently in insulting comments).
In [ ]:
# We first get the words corresponding to each feature.
names = np.asarray(tf.get_feature_names())
# Next, we display the 50 words with the largest
# coefficients.
print(','.join(names[np.argsort(
    bnb.best_estimator_.coef_[0,:])[::-1][:50]]))
  1. Finally, let's test our estimator on a few test sentences.
In [ ]:
print(bnb.predict(tf.transform([
    "I totally agree with you.",
    "You are so stupid.",
    "I love you."
    ])))

You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).

IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014 (500 pages).