Working with text data
Naive Bayes classification
From the scikit-learn documentation:
Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.
We will use CountVectorizer to "convert text into a matrix of token counts":
from sklearn.feature_extraction.text import CountVectorizer
# start with a simple example
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']
# learn the 'vocabulary' of the training data
vect = CountVectorizer()
vect.fit(simple_train)
vect.get_feature_names()
[u'cab', u'call', u'me', u'please', u'tonight', u'you']
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm
<3x6 sparse matrix of type '<type 'numpy.int64'>' with 9 stored elements in Compressed Sparse Row format>
# print the sparse matrix
print simple_train_dtm
(0, 1) 1 (0, 4) 1 (0, 5) 1 (1, 0) 1 (1, 1) 1 (1, 2) 1 (2, 1) 1 (2, 2) 1 (2, 3) 2
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()
array([[0, 1, 0, 0, 1, 1], [1, 1, 1, 0, 0, 0], [0, 1, 1, 2, 0, 0]], dtype=int64)
# examine the vocabulary and document-term matrix together
import pandas as pd
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())
cab | call | me | please | tonight | you | |
---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 1 | 1 |
1 | 1 | 1 | 1 | 0 | 0 | 0 |
2 | 0 | 1 | 1 | 2 | 0 | 0 |
From the scikit-learn documentation:
In this scheme, features and samples are defined as follows:
- Each individual token occurrence frequency (normalized or not) is treated as a feature.
- The vector of all the token frequencies for a given document is considered a multivariate sample.
A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.
We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test = ["please don't call me"]
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()
array([[0, 1, 1, 1, 0, 0]], dtype=int64)
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())
cab | call | me | please | tonight | you | |
---|---|---|---|---|---|---|
0 | 0 | 1 | 1 | 1 | 0 | 0 |
Summary:
vect.fit(train)
learns the vocabulary of the training datavect.transform(train)
uses the fitted vocabulary to build a document-term matrix from the training datavect.transform(test)
uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)# read tab-separated file
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv'
col_names = ['label', 'message']
sms = pd.read_table(url, sep='\t', header=None, names=col_names)
print sms.shape
(5572, 2)
sms.head(20)
label | message | |
---|---|---|
0 | ham | Go until jurong point, crazy.. Available only ... |
1 | ham | Ok lar... Joking wif u oni... |
2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... |
3 | ham | U dun say so early hor... U c already then say... |
4 | ham | Nah I don't think he goes to usf, he lives aro... |
5 | spam | FreeMsg Hey there darling it's been 3 week's n... |
6 | ham | Even my brother is not like to speak with me. ... |
7 | ham | As per your request 'Melle Melle (Oru Minnamin... |
8 | spam | WINNER!! As a valued network customer you have... |
9 | spam | Had your mobile 11 months or more? U R entitle... |
10 | ham | I'm gonna be home soon and i don't want to tal... |
11 | spam | SIX chances to win CASH! From 100 to 20,000 po... |
12 | spam | URGENT! You have won a 1 week FREE membership ... |
13 | ham | I've been searching for the right words to tha... |
14 | ham | I HAVE A DATE ON SUNDAY WITH WILL!! |
15 | spam | XXXMobileMovieClub: To use your credit, click ... |
16 | ham | Oh k...i'm watching here:) |
17 | ham | Eh u remember how 2 spell his name... Yes i di... |
18 | ham | Fine if thats the way u feel. Thats the way ... |
19 | spam | England v Macedonia - dont miss the goals/team... |
sms.label.value_counts()
ham 4825 spam 747 dtype: int64
# convert label to a numeric variable
sms['label'] = sms.label.map({'ham':0, 'spam':1})
# define X and y
X = sms.message
y = sms.label
# split into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print X_train.shape
print X_test.shape
(4179L,) (1393L,)
# instantiate the vectorizer
vect = CountVectorizer()
# learn training data vocabulary, then create document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_train_dtm
<4179x7456 sparse matrix of type '<type 'numpy.int64'>' with 55209 stored elements in Compressed Sparse Row format>
# alternative: combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm
<4179x7456 sparse matrix of type '<type 'numpy.int64'>' with 55209 stored elements in Compressed Sparse Row format>
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm
<1393x7456 sparse matrix of type '<type 'numpy.int64'>' with 17604 stored elements in Compressed Sparse Row format>
# store token names
X_train_tokens = vect.get_feature_names()
# first 50 tokens
print X_train_tokens[:50]
[u'00', u'000', u'008704050406', u'0121', u'01223585236', u'01223585334', u'0125698789', u'02', u'0207', u'02072069400', u'02073162414', u'02085076972', u'021', u'03', u'04', u'0430', u'05', u'050703', u'0578', u'06', u'07', u'07008009200', u'07090201529', u'07090298926', u'07123456789', u'07732584351', u'07734396839', u'07742676969', u'0776xxxxxxx', u'07781482378', u'07786200117', u'078', u'07801543489', u'07808', u'07808247860', u'07808726822', u'07815296484', u'07821230901', u'07880867867', u'0789xxxxxxx', u'07946746291', u'0796xxxxxx', u'07973788240', u'07xxxxxxxxx', u'08', u'0800', u'08000407165', u'08000776320', u'08000839402', u'08000930705']
# last 50 tokens
print X_train_tokens[-50:]
[u'yer', u'yes', u'yest', u'yesterday', u'yet', u'yetunde', u'yijue', u'ym', u'ymca', u'yo', u'yoga', u'yogasana', u'yor', u'yorge', u'you', u'youdoing', u'youi', u'youphone', u'your', u'youre', u'yourjob', u'yours', u'yourself', u'youwanna', u'yowifes', u'yoyyooo', u'yr', u'yrs', u'ything', u'yummmm', u'yummy', u'yun', u'yunny', u'yuo', u'yuou', u'yup', u'zac', u'zaher', u'zealand', u'zebra', u'zed', u'zeros', u'zhong', u'zindgi', u'zoe', u'zoom', u'zouk', u'zyada', u'\xe8n', u'\u3028ud']
# view X_train_dtm as a dense matrix
X_train_dtm.toarray()
array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=int64)
# count how many times EACH token appears across ALL messages in X_train_dtm
import numpy as np
X_train_counts = np.sum(X_train_dtm.toarray(), axis=0)
X_train_counts
array([ 5, 23, 2, ..., 1, 1, 1], dtype=int64)
X_train_counts.shape
(7456L,)
# create a DataFrame of tokens with their counts
pd.DataFrame({'token':X_train_tokens, 'count':X_train_counts}).sort('count')
count | token | |
---|---|---|
3727 | 1 | jules |
4172 | 1 | mallika |
4169 | 1 | malarky |
4165 | 1 | makiing |
4161 | 1 | maintaining |
4158 | 1 | mails |
4157 | 1 | mailed |
4151 | 1 | magicalsongs |
4150 | 1 | maggi |
4149 | 1 | magazine |
4146 | 1 | madodu |
4143 | 1 | mad2 |
4142 | 1 | mad1 |
4140 | 1 | macs |
4139 | 1 | macleran |
4138 | 1 | mack |
4174 | 1 | manage |
4175 | 1 | manageable |
4178 | 1 | manchester |
4179 | 1 | manda |
4201 | 1 | marking |
4200 | 1 | marketing |
4197 | 1 | marine |
4196 | 1 | margin |
4193 | 1 | marandratha |
4192 | 1 | maraikara |
4191 | 1 | maps |
4136 | 1 | machi |
4190 | 1 | mapquest |
4187 | 1 | manual |
... | ... | ... |
2290 | 292 | do |
7257 | 293 | with |
7120 | 293 | we |
6904 | 297 | ur |
1081 | 298 | at |
2995 | 299 | get |
3465 | 302 | if |
4778 | 306 | or |
1522 | 332 | but |
4647 | 338 | not |
6017 | 344 | so |
1574 | 349 | can |
1016 | 358 | are |
4662 | 361 | now |
4743 | 390 | on |
3235 | 416 | have |
1552 | 443 | call |
6539 | 453 | that |
4704 | 460 | of |
7424 | 508 | your |
2821 | 518 | for |
4489 | 550 | my |
3623 | 568 | it |
4238 | 601 | me |
3612 | 679 | is |
3502 | 683 | in |
929 | 717 | and |
6542 | 1004 | the |
7420 | 1660 | you |
6656 | 1670 | to |
7456 rows × 2 columns
# create separate DataFrames for ham and spam
sms_ham = sms[sms.label==0]
sms_spam = sms[sms.label==1]
# learn the vocabulary of ALL messages and save it
vect.fit(sms.message)
all_tokens = vect.get_feature_names()
# create document-term matrices for ham and spam
ham_dtm = vect.transform(sms_ham.message)
spam_dtm = vect.transform(sms_spam.message)
# count how many times EACH token appears across ALL ham messages
ham_counts = np.sum(ham_dtm.toarray(), axis=0)
# count how many times EACH token appears across ALL spam messages
spam_counts = np.sum(spam_dtm.toarray(), axis=0)
# create a DataFrame of tokens with their separate ham and spam counts
token_counts = pd.DataFrame({'token':all_tokens, 'ham':ham_counts, 'spam':spam_counts})
# add one to ham and spam counts to avoid dividing by zero (in the step that follows)
token_counts['ham'] = token_counts.ham + 1
token_counts['spam'] = token_counts.spam + 1
# calculate ratio of spam-to-ham for each token
token_counts['spam_ratio'] = token_counts.spam / token_counts.ham
token_counts.sort('spam_ratio')
ham | spam | token | spam_ratio | |
---|---|---|---|---|
3684 | 319 | 1 | gt | 0.003135 |
4793 | 317 | 1 | lt | 0.003155 |
3805 | 232 | 1 | he | 0.004310 |
6843 | 168 | 1 | she | 0.005952 |
4747 | 163 | 1 | lor | 0.006135 |
2428 | 151 | 1 | da | 0.006623 |
4550 | 136 | 1 | later | 0.007353 |
1247 | 90 | 1 | ask | 0.011111 |
6626 | 90 | 1 | said | 0.011111 |
2714 | 89 | 1 | doing | 0.011236 |
1084 | 89 | 1 | amp | 0.011236 |
5167 | 80 | 1 | morning | 0.012500 |
2163 | 231 | 3 | come | 0.012987 |
1142 | 77 | 1 | anything | 0.012987 |
2289 | 77 | 1 | cos | 0.012987 |
4724 | 75 | 1 | lol | 0.013333 |
7463 | 72 | 1 | sure | 0.013889 |
7099 | 70 | 1 | something | 0.014286 |
3690 | 68 | 1 | gud | 0.014706 |
3171 | 63 | 1 | feel | 0.015873 |
8394 | 63 | 1 | went | 0.015873 |
5371 | 63 | 1 | nice | 0.015873 |
3595 | 59 | 1 | gonna | 0.016949 |
7001 | 59 | 1 | sleep | 0.016949 |
1064 | 59 | 1 | always | 0.016949 |
5254 | 755 | 13 | my | 0.017219 |
5217 | 116 | 2 | much | 0.017241 |
5533 | 115 | 2 | oh | 0.017391 |
2815 | 56 | 1 | dun | 0.017857 |
3925 | 166 | 3 | home | 0.018072 |
... | ... | ... | ... | ... |
1623 | 1 | 22 | bonus | 22.000000 |
3994 | 1 | 22 | http | 22.000000 |
6619 | 1 | 22 | sae | 22.000000 |
735 | 1 | 22 | 8007 | 22.000000 |
732 | 1 | 23 | 800 | 23.000000 |
5297 | 1 | 23 | national | 23.000000 |
8375 | 1 | 25 | weekly | 25.000000 |
8153 | 1 | 25 | valid | 25.000000 |
309 | 1 | 25 | 10p | 25.000000 |
618 | 1 | 26 | 5000 | 26.000000 |
5117 | 1 | 26 | mob | 26.000000 |
364 | 2 | 54 | 16 | 27.000000 |
2150 | 1 | 27 | collection | 27.000000 |
7838 | 1 | 27 | tones | 27.000000 |
2963 | 1 | 27 | entry | 27.000000 |
1 | 1 | 30 | 000 | 30.000000 |
8596 | 3 | 99 | www | 33.000000 |
6525 | 1 | 33 | ringtone | 33.000000 |
356 | 1 | 35 | 150ppm | 35.000000 |
8016 | 2 | 75 | uk | 37.500000 |
1333 | 1 | 39 | awarded | 39.000000 |
299 | 1 | 42 | 1000 | 42.000000 |
617 | 1 | 45 | 500 | 45.000000 |
2371 | 1 | 45 | cs | 45.000000 |
3688 | 1 | 51 | guaranteed | 51.000000 |
369 | 1 | 52 | 18 | 52.000000 |
7837 | 1 | 61 | tone | 61.000000 |
352 | 1 | 72 | 150p | 72.000000 |
6113 | 1 | 94 | prize | 94.000000 |
2067 | 1 | 114 | claim | 114.000000 |
8713 rows × 4 columns
We will use Multinomial Naive Bayes:
The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.
# train a Naive Bayes model using X_train_dtm
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)
# calculate accuracy of class predictions
from sklearn import metrics
print metrics.accuracy_score(y_test, y_pred_class)
0.988513998564
# confusion matrix
print metrics.confusion_matrix(y_test, y_pred_class)
[[1203 5] [ 11 174]]
# predict (poorly calibrated) probabilities
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob
array([ 2.87744864e-03, 1.83488846e-05, 2.07301295e-03, ..., 1.09026171e-06, 1.00000000e+00, 3.98279868e-09])
# calculate AUC
print metrics.roc_auc_score(y_test, y_pred_prob)
0.986643100054
# print message text for the false positives
X_test[y_test < y_pred_class]
574 Waiting for your call. 3375 Also andros ice etc etc 45 No calls..messages..missed calls 3415 No pic. Please re-send. 1988 No calls..messages..missed calls Name: message, dtype: object
# print message text for the false negatives
X_test[y_test > y_pred_class]
3132 LookAtMe!: Thanks for your purchase of a video... 5 FreeMsg Hey there darling it's been 3 week's n... 3530 Xmas & New Years Eve tickets are now on sale f... 684 Hi I'm sue. I am 20 years old and work as a la... 1875 Would you like to see my XXX pics they are so ... 1893 CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV... 4298 thesmszone.com lets you send free anonymous an... 4949 Hi this is Amy, we will be sending you a free ... 2821 INTERFLORA - It's not too late to order Inter... 2247 Hi ya babe x u 4goten bout me?' scammers getti... 4514 Money i have won wining number 946 wot do i do... Name: message, dtype: object
# what do you notice about the false negatives?
X_test[3132]
"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323."
# import/instantiate/fit
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train_dtm, y_train)
LogisticRegression(C=1000000000.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0)
# class predictions and predicted probabilities
y_pred_class = logreg.predict(X_test_dtm)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
# calculate accuracy and AUC
print metrics.accuracy_score(y_test, y_pred_class)
print metrics.roc_auc_score(y_test, y_pred_prob)
0.989231873654 0.994144889923