This is one of the 100 recipes of the IPython Cookbook, the definitive guide to high-performance scientific computing and data science in Python.

8.4. Learning from text: Naive Bayes for Natural Language Processing

In this recipe, we show how to handle text data with scikit-learn. Working with text requires careful preprocessing and feature extraction. It is also quite common to deal with highly sparse matrices.

We will learn to recognize whether a comment posted during a public discussion is considered insulting to one of the participants. We will use a labeled dataset from Impermium, released during a Kaggle competition.

You need to download the troll dataset on the book's website. (

  1. Let's import our libraries.
In [ ]:
import numpy as np
import pandas as pd
import sklearn
import sklearn.cross_validation as cv
import sklearn.grid_search as gs
import sklearn.feature_extraction.text as text
import sklearn.naive_bayes as nb
import matplotlib.pyplot as plt
%matplotlib inline
  1. Let's open the csv file with Pandas.
In [ ]:
df = pd.read_csv("data/troll.csv")
  1. Each row is a comment. There are three columns: whether the comment is insulting (1) or not (0), the data, and the unicode-encoded contents of the comment.
In [ ]:
df[['Insult', 'Comment']].tail()
  1. Now, we are going to define the feature matrix $\mathbf{X}$ and the labels $\mathbf{y}$.
In [ ]:
y = df['Insult']

Obtaining the feature matrix from the text is not trivial. Scikit-learn can only work with numerical matrices. How to convert text into a matrix of numbers? A classical solution is to first extract a vocabulary: a list of words used throughout the corpus. Then, we can count, for each sample, the frequency of each word. We end up with a sparse matrix: a huge matrix containiny mostly zeros. Here, we do this in two lines. We will give more explanations in How it works....

In [ ]:
tf = text.TfidfVectorizer()
X = tf.fit_transform(df['Comment'])
  1. There are 3947 comments and 16469 different words. Let's estimate the sparsity of this feature matrix.
In [ ]:
print("Each sample has ~{0:.2f}% non-zero features.".format(
          100 * X.nnz / float(X.shape[0] * X.shape[1])))
  1. Now, we are going to train a classifier as usual. We first split the data into a train and test set.
In [ ]:
(X_train, X_test,
 y_train, y_test) = cv.train_test_split(X, y,
  1. We use a Bernoulli Naive Bayes classifier with a grid search on the parameter $\alpha$.
In [ ]:
bnb = gs.GridSearchCV(nb.BernoulliNB(), param_grid={'alpha':np.logspace(-2., 2., 50)}), y_train);
  1. What is the performance of this classifier on the test dataset?
In [ ]:
bnb.score(X_test, y_test)
  1. Let's take a look at the words corresponding to the largest coefficients (the words we find frequently in insulting comments).
In [ ]:
# We first get the words corresponding to each feature.
names = np.asarray(tf.get_feature_names())
# Next, we display the 50 words with the largest
# coefficients.
  1. Finally, let's test our estimator on a few test sentences.
In [ ]:
    "I totally agree with you.",
    "You are so stupid.",
    "I love you."

You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).

IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014 (500 pages).