Working with Text Data and Naive Bayes in scikit-learn¶

Agenda¶

Working with text data

Representing text as data
Reading SMS data
Vectorizing SMS data
Examining the tokens and their counts
Bonus: Calculating the "spamminess" of each token

Naive Bayes classification

Building a Naive Bayes model
Comparing Naive Bayes with logistic regression

Part 1: Representing text as data¶

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

We will use CountVectorizer to "convert text into a matrix of token counts":

In [1]:

from sklearn.feature_extraction.text import CountVectorizer

In [2]:

# start with a simple example
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

In [3]:

# learn the 'vocabulary' of the training data
vect = CountVectorizer()
vect.fit(simple_train)
vect.get_feature_names()

Out[3]:

[u'cab', u'call', u'me', u'please', u'tonight', u'you']

In [4]:

# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm

Out[4]:

<3x6 sparse matrix of type '<type 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

In [5]:

# print the sparse matrix
print simple_train_dtm

  (0, 1)	1
  (0, 4)	1
  (0, 5)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	2

In [6]:

# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()

Out[6]:

array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]], dtype=int64)

In [7]:

# examine the vocabulary and document-term matrix together
import pandas as pd
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())

Out[7]:

	cab	call	me	please	tonight	you
0	0	1	0	0	1	1
1	1	1	1	0	0	0
2	0	1	1	2	0	0

From the scikit-learn documentation:

In this scheme, features and samples are defined as follows:

Each individual token occurrence frequency (normalized or not) is treated as a feature.

The vector of all the token frequencies for a given document is considered a multivariate sample.

A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [8]:

# transform testing data into a document-term matrix (using existing vocabulary)
simple_test = ["please don't call me"]
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()

Out[8]:

array([[0, 1, 1, 1, 0, 0]], dtype=int64)

In [9]:

# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())

Out[9]:

	cab	call	me	please	tonight	you
0	0	1	1	1	0	0

Summary:

vect.fit(train) learns the vocabulary of the training data
vect.transform(train) uses the fitted vocabulary to build a document-term matrix from the training data
vect.transform(test) uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)

Part 2: Reading SMS data¶

In [10]:

# read tab-separated file
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv'
col_names = ['label', 'message']
sms = pd.read_table(url, sep='\t', header=None, names=col_names)
print sms.shape

(5572, 2)

In [11]:

sms.head(20)

Out[11]:

	label	message
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...
5	spam	FreeMsg Hey there darling it's been 3 week's n...
6	ham	Even my brother is not like to speak with me. ...
7	ham	As per your request 'Melle Melle (Oru Minnamin...
8	spam	WINNER!! As a valued network customer you have...
9	spam	Had your mobile 11 months or more? U R entitle...
10	ham	I'm gonna be home soon and i don't want to tal...
11	spam	SIX chances to win CASH! From 100 to 20,000 po...
12	spam	URGENT! You have won a 1 week FREE membership ...
13	ham	I've been searching for the right words to tha...
14	ham	I HAVE A DATE ON SUNDAY WITH WILL!!
15	spam	XXXMobileMovieClub: To use your credit, click ...
16	ham	Oh k...i'm watching here:)
17	ham	Eh u remember how 2 spell his name... Yes i di...
18	ham	Fine if thats the way u feel. Thats the way ...
19	spam	England v Macedonia - dont miss the goals/team...

In [12]:

sms.label.value_counts()

Out[12]:

ham     4825
spam     747
dtype: int64

In [13]:

# convert label to a numeric variable
sms['label'] = sms.label.map({'ham':0, 'spam':1})

In [14]:

# define X and y
X = sms.message
y = sms.label

In [15]:

# split into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print X_train.shape
print X_test.shape

(4179L,)
(1393L,)

Part 3: Vectorizing SMS data¶

In [16]:

# instantiate the vectorizer
vect = CountVectorizer()

In [17]:

# learn training data vocabulary, then create document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_train_dtm

Out[17]:

<4179x7456 sparse matrix of type '<type 'numpy.int64'>'
	with 55209 stored elements in Compressed Sparse Row format>

In [18]:

# alternative: combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm

Out[18]:

<4179x7456 sparse matrix of type '<type 'numpy.int64'>'
	with 55209 stored elements in Compressed Sparse Row format>

In [19]:

# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

Out[19]:

<1393x7456 sparse matrix of type '<type 'numpy.int64'>'
	with 17604 stored elements in Compressed Sparse Row format>

Part 4: Examining the tokens and their counts¶

In [20]:

# store token names
X_train_tokens = vect.get_feature_names()

In [21]:

# first 50 tokens
print X_train_tokens[:50]

[u'00', u'000', u'008704050406', u'0121', u'01223585236', u'01223585334', u'0125698789', u'02', u'0207', u'02072069400', u'02073162414', u'02085076972', u'021', u'03', u'04', u'0430', u'05', u'050703', u'0578', u'06', u'07', u'07008009200', u'07090201529', u'07090298926', u'07123456789', u'07732584351', u'07734396839', u'07742676969', u'0776xxxxxxx', u'07781482378', u'07786200117', u'078', u'07801543489', u'07808', u'07808247860', u'07808726822', u'07815296484', u'07821230901', u'07880867867', u'0789xxxxxxx', u'07946746291', u'0796xxxxxx', u'07973788240', u'07xxxxxxxxx', u'08', u'0800', u'08000407165', u'08000776320', u'08000839402', u'08000930705']

In [22]:

# last 50 tokens
print X_train_tokens[-50:]

[u'yer', u'yes', u'yest', u'yesterday', u'yet', u'yetunde', u'yijue', u'ym', u'ymca', u'yo', u'yoga', u'yogasana', u'yor', u'yorge', u'you', u'youdoing', u'youi', u'youphone', u'your', u'youre', u'yourjob', u'yours', u'yourself', u'youwanna', u'yowifes', u'yoyyooo', u'yr', u'yrs', u'ything', u'yummmm', u'yummy', u'yun', u'yunny', u'yuo', u'yuou', u'yup', u'zac', u'zaher', u'zealand', u'zebra', u'zed', u'zeros', u'zhong', u'zindgi', u'zoe', u'zoom', u'zouk', u'zyada', u'\xe8n', u'\u3028ud']

In [23]:

# view X_train_dtm as a dense matrix
X_train_dtm.toarray()

Out[23]:

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [24]:

# count how many times EACH token appears across ALL messages in X_train_dtm
import numpy as np
X_train_counts = np.sum(X_train_dtm.toarray(), axis=0)
X_train_counts

Out[24]:

array([ 5, 23,  2, ...,  1,  1,  1], dtype=int64)

In [25]:

X_train_counts.shape

Out[25]:

(7456L,)

In [26]:

# create a DataFrame of tokens with their counts
pd.DataFrame({'token':X_train_tokens, 'count':X_train_counts}).sort('count')

Out[26]:

	count	token
3727	1	jules
4172	1	mallika
4169	1	malarky
4165	1	makiing
4161	1	maintaining
4158	1	mails
4157	1	mailed
4151	1	magicalsongs
4150	1	maggi
4149	1	magazine
4146	1	madodu
4143	1	mad2
4142	1	mad1
4140	1	macs
4139	1	macleran
4138	1	mack
4174	1	manage
4175	1	manageable
4178	1	manchester
4179	1	manda
4201	1	marking
4200	1	marketing
4197	1	marine
4196	1	margin
4193	1	marandratha
4192	1	maraikara
4191	1	maps
4136	1	machi
4190	1	mapquest
4187	1	manual
...	...	...
2290	292	do
7257	293	with
7120	293	we
6904	297	ur
1081	298	at
2995	299	get
3465	302	if
4778	306	or
1522	332	but
4647	338	not
6017	344	so
1574	349	can
1016	358	are
4662	361	now
4743	390	on
3235	416	have
1552	443	call
6539	453	that
4704	460	of
7424	508	your
2821	518	for
4489	550	my
3623	568	it
4238	601	me
3612	679	is
3502	683	in
929	717	and
6542	1004	the
7420	1660	you
6656	1670	to

7456 rows × 2 columns

Bonus: Calculating the "spamminess" of each token¶

In [27]:

# create separate DataFrames for ham and spam
sms_ham = sms[sms.label==0]
sms_spam = sms[sms.label==1]

In [28]:

# learn the vocabulary of ALL messages and save it
vect.fit(sms.message)
all_tokens = vect.get_feature_names()

In [29]:

# create document-term matrices for ham and spam
ham_dtm = vect.transform(sms_ham.message)
spam_dtm = vect.transform(sms_spam.message)

In [30]:

# count how many times EACH token appears across ALL ham messages
ham_counts = np.sum(ham_dtm.toarray(), axis=0)

In [31]:

# count how many times EACH token appears across ALL spam messages
spam_counts = np.sum(spam_dtm.toarray(), axis=0)

In [32]:

# create a DataFrame of tokens with their separate ham and spam counts
token_counts = pd.DataFrame({'token':all_tokens, 'ham':ham_counts, 'spam':spam_counts})

In [33]:

# add one to ham and spam counts to avoid dividing by zero (in the step that follows)
token_counts['ham'] = token_counts.ham + 1
token_counts['spam'] = token_counts.spam + 1

In [34]:

# calculate ratio of spam-to-ham for each token
token_counts['spam_ratio'] = token_counts.spam / token_counts.ham
token_counts.sort('spam_ratio')

Out[34]:

	ham	spam	token	spam_ratio
3684	319	1	gt	0.003135
4793	317	1	lt	0.003155
3805	232	1	he	0.004310
6843	168	1	she	0.005952
4747	163	1	lor	0.006135
2428	151	1	da	0.006623
4550	136	1	later	0.007353
1247	90	1	ask	0.011111
6626	90	1	said	0.011111
2714	89	1	doing	0.011236
1084	89	1	amp	0.011236
5167	80	1	morning	0.012500
2163	231	3	come	0.012987
1142	77	1	anything	0.012987
2289	77	1	cos	0.012987
4724	75	1	lol	0.013333
7463	72	1	sure	0.013889
7099	70	1	something	0.014286
3690	68	1	gud	0.014706
3171	63	1	feel	0.015873
8394	63	1	went	0.015873
5371	63	1	nice	0.015873
3595	59	1	gonna	0.016949
7001	59	1	sleep	0.016949
1064	59	1	always	0.016949
5254	755	13	my	0.017219
5217	116	2	much	0.017241
5533	115	2	oh	0.017391
2815	56	1	dun	0.017857
3925	166	3	home	0.018072
...	...	...	...	...
1623	1	22	bonus	22.000000
3994	1	22	http	22.000000
6619	1	22	sae	22.000000
735	1	22	8007	22.000000
732	1	23	800	23.000000
5297	1	23	national	23.000000
8375	1	25	weekly	25.000000
8153	1	25	valid	25.000000
309	1	25	10p	25.000000
618	1	26	5000	26.000000
5117	1	26	mob	26.000000
364	2	54	16	27.000000
2150	1	27	collection	27.000000
7838	1	27	tones	27.000000
2963	1	27	entry	27.000000
1	1	30	000	30.000000
8596	3	99	www	33.000000
6525	1	33	ringtone	33.000000
356	1	35	150ppm	35.000000
8016	2	75	uk	37.500000
1333	1	39	awarded	39.000000
299	1	42	1000	42.000000
617	1	45	500	45.000000
2371	1	45	cs	45.000000
3688	1	51	guaranteed	51.000000
369	1	52	18	52.000000
7837	1	61	tone	61.000000
352	1	72	150p	72.000000
6113	1	94	prize	94.000000
2067	1	114	claim	114.000000

8713 rows × 4 columns

Part 5: Building a Naive Bayes model¶

We will use Multinomial Naive Bayes:

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [35]:

# train a Naive Bayes model using X_train_dtm
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)

Out[35]:

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [36]:

# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [37]:

# calculate accuracy of class predictions
from sklearn import metrics
print metrics.accuracy_score(y_test, y_pred_class)

0.988513998564

In [38]:

# confusion matrix
print metrics.confusion_matrix(y_test, y_pred_class)

[[1203    5]
 [  11  174]]

In [39]:

# predict (poorly calibrated) probabilities
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

Out[39]:

array([  2.87744864e-03,   1.83488846e-05,   2.07301295e-03, ...,
         1.09026171e-06,   1.00000000e+00,   3.98279868e-09])

In [40]:

# calculate AUC
print metrics.roc_auc_score(y_test, y_pred_prob)

0.986643100054

In [41]:

# print message text for the false positives
X_test[y_test < y_pred_class]

Out[41]:

574               Waiting for your call.
3375             Also andros ice etc etc
45      No calls..messages..missed calls
3415             No pic. Please re-send.
1988    No calls..messages..missed calls
Name: message, dtype: object

In [42]:

# print message text for the false negatives
X_test[y_test > y_pred_class]

Out[42]:

3132    LookAtMe!: Thanks for your purchase of a video...
5       FreeMsg Hey there darling it's been 3 week's n...
3530    Xmas & New Years Eve tickets are now on sale f...
684     Hi I'm sue. I am 20 years old and work as a la...
1875    Would you like to see my XXX pics they are so ...
1893    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...
4298    thesmszone.com lets you send free anonymous an...
4949    Hi this is Amy, we will be sending you a free ...
2821    INTERFLORA - It's not too late to order Inter...
2247    Hi ya babe x u 4goten bout me?' scammers getti...
4514    Money i have won wining number 946 wot do i do...
Name: message, dtype: object

In [43]:

# what do you notice about the false negatives?
X_test[3132]

Out[43]:

"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323."

Part 6: Comparing Naive Bayes with logistic regression¶

In [44]:

# import/instantiate/fit
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train_dtm, y_train)

Out[44]:

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0)

In [45]:

# class predictions and predicted probabilities
y_pred_class = logreg.predict(X_test_dtm)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]

In [46]:

# calculate accuracy and AUC
print metrics.accuracy_score(y_test, y_pred_class)
print metrics.roc_auc_score(y_test, y_pred_prob)

0.989231873654
0.994144889923