#!/usr/bin/env python # coding: utf-8 # > This is one of the 100 recipes of the [IPython Cookbook](http://ipython-books.github.io/), the definitive guide to high-performance scientific computing and data science in Python. # # # 8.4. Learning from text: Naive Bayes for Natural Language Processing # In this recipe, we show how to handle text data with scikit-learn. Working with text requires careful preprocessing and feature extraction. It is also quite common to deal with highly sparse matrices. # # We will learn to recognize whether a comment posted during a public discussion is considered insulting to one of the participants. We will use a labeled dataset from [Impermium](https://impermium.com), released during a [Kaggle competition](https://www.kaggle.com/c/detecting-insults-in-social-commentary). # You need to download the *troll* dataset on the book's website. (https://ipython-books.github.io) # 1. Let's import our libraries. # In[ ]: import numpy as np import pandas as pd import sklearn import sklearn.model_selection as ms import sklearn.feature_extraction.text as text import sklearn.naive_bayes as nb import matplotlib.pyplot as plt get_ipython().run_line_magic('matplotlib', 'inline') # 2. Let's open the csv file with Pandas. # In[ ]: df = pd.read_csv("data/troll.csv") # 3. Each row is a comment. There are three columns: whether the comment is insulting (1) or not (0), the data, and the unicode-encoded contents of the comment. # In[ ]: df[['Insult', 'Comment']].tail() # 4. Now, we are going to define the feature matrix $\mathbf{X}$ and the labels $\mathbf{y}$. # In[ ]: y = df['Insult'] # Obtaining the feature matrix from the text is not trivial. Scikit-learn can only work with numerical matrices. How to convert text into a matrix of numbers? A classical solution is to first extract a **vocabulary**: a list of words used throughout the corpus. Then, we can count, for each sample, the frequency of each word. We end up with a **sparse matrix**: a huge matrix containiny mostly zeros. Here, we do this in two lines. We will give more explanations in *How it works...*. # In[ ]: tf = text.TfidfVectorizer() X = tf.fit_transform(df['Comment']) print(X.shape) # 5. There are 3947 comments and 16469 different words. Let's estimate the sparsity of this feature matrix. # In[ ]: print("Each sample has ~{0:.2f}% non-zero features.".format( 100 * X.nnz / float(X.shape[0] * X.shape[1]))) # 6. Now, we are going to train a classifier as usual. We first split the data into a train and test set. # In[ ]: (X_train, X_test, y_train, y_test) = ms.train_test_split(X, y, test_size=.2) # 7. We use a **Bernoulli Naive Bayes classifier** with a grid search on the parameter $\alpha$. # In[ ]: bnb = ms.GridSearchCV(nb.BernoulliNB(), param_grid={'alpha':np.logspace(-2., 2., 50)}) bnb.fit(X_train, y_train); # 8. What is the performance of this classifier on the test dataset? # In[ ]: bnb.score(X_test, y_test) # 9. Let's take a look at the words corresponding to the largest coefficients (the words we find frequently in insulting comments). # In[ ]: # We first get the words corresponding to each feature. names = np.asarray(tf.get_feature_names()) # Next, we display the 50 words with the largest # coefficients. print(','.join(names[np.argsort( bnb.best_estimator_.coef_[0,:])[::-1][:50]])) # 10. Finally, let's test our estimator on a few test sentences. # In[ ]: print(bnb.predict(tf.transform([ "I totally agree with you.", "You are so stupid.", "I love you." ]))) # > You'll find all the explanations, figures, references, and much more in the book (to be released later this summer). # # > [IPython Cookbook](http://ipython-books.github.io/), by [Cyrille Rossant](http://cyrille.rossant.net), Packt Publishing, 2014 (500 pages).