Working with Text Data in scikit-learn

Naive Bayes and logistic regression are both classification algorithms.


  1. Model building in scikit-learn (refresher)
  2. Representing text as numerical data
  3. Reading the SMS data
  4. Vectorizing the SMS data
  5. Building a Naive Bayes model
  6. Comparing Naive Bayes with logistic regression
  7. Calculating the "spamminess" of each token
  8. Creating a DataFrame from individual text files

Part 1 : Model building in scikit-learn (refresher)

In [1]:
# load the iris dataset as an example
from sklearn.datasets import load_iris
iris = load_iris()
<class 'sklearn.utils.Bunch'>
In [2]:
# store the feature matrix (X) and response vector (y)
X =
y =

"Features" are also known as predictors, inputs, or attributes. The "response" is also known as the target, label, or output.

In [3]:
# check the shapes of X and y
(150, 4)

"Observations" are also known as samples, instances, or records.

In [4]:
# examine the first 5 rows of X and its type
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
<class 'numpy.ndarray'>

Note that X and y are not DataFrames. They are Numpy nd-arrays. On the other hand, the iris object is not even a Numpy nd-array, it is a bunch object. In the next cell, we are converting X to a Pandas DataFrame.

In [5]:
# examine the first 5 rows of X (including the feature names)
import pandas as pd
pd.DataFrame(X, columns=iris.feature_names).head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

In order to build a model, the features must be numeric, and every observation must have the same features in the same order.

The fitting step is where the model is learning the relationship between the features and the response.

In [6]:
# import the class
from sklearn.neighbors import KNeighborsClassifier

# instantiate the model (with the default parameters)
knn = KNeighborsClassifier()

# fit the model with data (occurs in-place), y)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,

In order to make a prediction, the new observation must have the same features as the training observations, both in number and meaning.

In [7]:
# predict the response for a new observation
knn.predict([[3, 5, 4, 2]])

Part 2 : Representing text as numerical data

In [8]:
# example text for model training (SMS messages)
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

From the scikit-learn documentation:

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

We will use CountVectorizer for the purpose of taking text data and converting it to numerical data.CountVectorizer follows the same pattern that all scikit-learn estimators are following. Therefore, in the same way that we import, instantiate and fit a model, we import, instantiate, and fit CountVectorizer. So even though CountVectorizer is not a model, it has the same API in that sense as a scikit-learn module. But we should not get confused : CountVectorizer is not a model but it has a fit method.

We will use CountVectorizer to "convert text into a matrix of token counts" :

In [9]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

Note that the module is called feature_extraction. We will see shortly that each word in the simple_train list object will become a feature so the module name makes perfect sense. Also note that, in contrast with the previous model we instantiated on the iris dataset, the fit method now correspond to the vectorizer learning the vocabulary of each document (i.e. each SMS messages).

In [10]:
# learn the 'vocabulary' of the training data (occurs in-place)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

Once a vectorizer has been fit, it exposes a method called get_feature_names which returns the fitted vocabulary, i.e. the vocabulary that the vectorizer learned.

In [11]:
# examine the fitted vocabulary
['cab', 'call', 'me', 'please', 'tonight', 'you']

Note that it removed the word 'a' from the SMS message 'Call me a cab' because the default token_pattern has a regular expression that specifies to drop words that are not at least two characters long. Finally, to be clear, vectorizers don't try to understand language. Vectorizers are there simply to convert text into a matrix of token counts.

In [12]:
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

Three input samples, 6 feature names or equivalently three documents, 6 vocabulary words (also known as tokens). In numerical analysis and computer science, a sparse matrix (matrice creuse) or sparse array is a matrix in which most of the elements are zero. By contrast, if most of the elements are nonzero, then the matrix is considered dense." (Wikipedia)

In [13]:
# convert sparse matrix to a dense matrix
array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]])
In [14]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())
cab call me please tonight you
0 0 1 0 0 1 1
1 1 1 1 0 0 0
2 0 1 1 2 0 0

Initially we had text data which was non-numeric with variable length. Now, the text is represented as a feature matrix with a fixed number of columns. Again, note that the words/tokens have become finite features.

From the scikit-learn documentation :

In this scheme, features and samples are defined as follows:

  • Each individual token occurrence frequency (normalized or not) is treated as a feature.
  • The vector of all the token frequencies for a given document is considered a multivariate sample.

A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [15]:
# print the sparse matrix
  (0, 1)	1
  (0, 4)	1
  (0, 5)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	2

From the scikit-learn documentation:

As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).

For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

In order to be able to store such a matrix in memory but also to speed up operations, implementations will typically use a sparse representation such as the implementations available in the scipy.sparse package.

In [16]:
# example text for model testing
simple_test = ["please don't call me"]

In order to make a prediction, the new observation must have the same features as the training observations, both in number and meaning.

What does the previous sentence mean for text data ? It means that given a new observation, that observation has to have six columns and those six columns must have the same meaning as the columns in the training data document-term matrix, i.e. it needs to have the same exact features.

In [17]:
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test_dtm = vect.transform(simple_test)
array([[0, 1, 1, 1, 0, 0]])
In [18]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())
cab call me please tonight you
0 0 1 1 1 0 0

Note that the word 'don't' is missing. Why ? Because it is not part of that vocabulary that our vectorizer has learned during the fitting process made on the training data. In other words, 'don't' is not a feature of our original data, i.e. it is not a token seen in our training data. As a result, it just gets drop.

We can make the argument that it is actually ok to drop 'don't'. Why ? Since it is not present in the original corpus, it won't assist in modelling future documents. Specifically, the model can't use this new information as it was previously trained on a different corpus of documents.


  • learns the vocabulary of the training data
  • vect.transform(train) uses the fitted vocabulary to build a document-term matrix from the training data
  • vect.transform(test) uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)

Again, we didn't run on the testing data. If we had run a fit on the testing data, we would have learned a new vocabulary of four words that would not match the initial vocabulary of six words present in the training data. Finally, note that we only convert sparse matrices to dense matrices for the purpose of examining them. In real life (i.e. if we are not going to look at the matrix) we won't convert it to a dense matrix. That was for display purposes only.

Part 3 : Reading the SMS data

In [19]:
# read file into Pandas using a relative path
path = '../data/sms.tsv'
sms = pd.read_table(path, header = None, names = ['label', 'message'])
In [20]:
# examine the shape
(5572, 2)
In [21]:
# examine the first 10 rows
label message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
5 spam FreeMsg Hey there darling it's been 3 week's n...
6 ham Even my brother is not like to speak with me. ...
7 ham As per your request 'Melle Melle (Oru Minnamin...
8 spam WINNER!! As a valued network customer you have...
9 spam Had your mobile 11 months or more? U R entitle...
In [22]:
# examine the class distribution
ham     4825
spam     747
Name: label, dtype: int64
In [23]:
# convert label to a numerical variable
sms['label_num'] ={'ham' : 0, 'spam' : 1})
In [24]:
# check that the conversion worked
label message label_num
0 ham Go until jurong point, crazy.. Available only ... 0
1 ham Ok lar... Joking wif u oni... 0
2 spam Free entry in 2 a wkly comp to win FA Cup fina... 1
3 ham U dun say so early hor... U c already then say... 0
4 ham Nah I don't think he goes to usf, he lives aro... 0
5 spam FreeMsg Hey there darling it's been 3 week's n... 1
6 ham Even my brother is not like to speak with me. ... 0
7 ham As per your request 'Melle Melle (Oru Minnamin... 0
8 spam WINNER!! As a valued network customer you have... 1
9 spam Had your mobile 11 months or more? U R entitle... 1
In [25]:
# usual way to define X and y (from the iris data) for use with a MODEL
X =
y =
(150, 4)
In [26]:
# required way to define X and y for use with COUNTVECTORIZER
X = sms.message
y = sms.label_num

Recall with the iris dataset, we defined X as two-dimensional and y as one-dimensional. For CountVectorizer (i.e. when working with text-based data) both X and y are one-dimensional Series. So why do we now define X as a one-dimensional object in the text based context ? Because it will be converted by CountVectorizer into a two-dimensional object. If we pass a two-dimensional object to CountVectorizer it won't have any idea what to do with it. So CountVectorizer takes one-dimensional objects and turns them into two-dimensional objects.

In [27]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

Why are we applying train-test-split before vectorization ? The whole point of train-test-split is that the test set we get is a simulation of the future. Future documents will have words our model have never seen before and it will be handicaped by the fact that it only learned from the past. If we vectorize before splitting, our testing set won't see any words it didn't learned already. In other words, we are making it too easy for the model if we vectorize first than train-test-split.

Part 4 : Vectorizing the SMS data

In [28]:
# instantiate the vectorizer
vect = CountVectorizer()
In [29]:
# learn training data vocabulary, then use it to create a document-term matrix
X_train_dtm = vect.transform(X_train)

The next cell is more computationaly efficient then the previous one and does the exact same thing in one line of code.

In [30]:
# alternative : combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)
In [31]:
# examine the document-term matrix
<4179x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 55209 stored elements in Compressed Sparse Row format>

4179 training documents (SMS messages) and 7456 unique tokens (individual words). Recall that even if a word gets used twice in a document (or in different documents) it becomes one column, i.e. one and unique feature.

In [32]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
<1393x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 17604 stored elements in Compressed Sparse Row format>

Recall that transform is using the fitted vocabulary and it initially learned a vocabulary of 7456 tokens. We also knew it was going to output 1393 because 1393 is the number of test documents.

At this point, our data is in the adequate form for model building. Note that it has a similar structure to the IRIS example we saw at the beginning.

Part 5 : Building a Naive Bayes model

We will use multinomial Naive Bayes:

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [33]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
In [34]:
# train the model using X_train_dtm (timing it with an IPython "magic command")
%time, y_train)
CPU times: user 7.05 ms, sys: 4.62 ms, total: 11.7 ms
Wall time: 18.1 ms
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
In [35]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)
In [36]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)
In [37]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)
array([[1203,    5],
       [  11,  174]])

They are 5 false positive messages ($a_{12} = 5)$. A false positive means the message was incorrectly classified as SPAM.

In [38]:
# print message text for the false positives (meaning they were incorrectly classified as spam)
X_test[y_test < y_pred_class]
574               Waiting for your call.
3375             Also andros ice etc etc
45      No calls..messages..missed calls
3415             No pic. Please re-send.
1988    No calls..messages..missed calls
Name: message, dtype: object

The previous cell use a boolean condition to return all $5$ false positives messages corresponding to the confusion matrix element $a_{12}$. For any cases where the prediction is $1$ and the actual real value is $0$, it returns TRUE.

The Series of trues and falses is then passed to X_test to select out rows. That's all to say that these are the $5$ messages that were handmarked as HAM, meaning the person who built the training set said these are HAM messages, and the classifier incorrectly predicted SPAM (that's the definition of a false positive). Formally, we are comparing vectors. Let

$$ a = \begin{bmatrix}1\\0\\1\\1 \end{bmatrix} \quad \text{and} \quad b = \begin{bmatrix}1\\1\\1\\1 \end{bmatrix}, \quad \text{then} \quad a \lt b = \begin{bmatrix}False\\True\\False\\False \end{bmatrix}. $$
In [39]:
# print message text for the false negatives (meaning they were incorrectly classified as ham)
X_test[y_test > y_pred_class]
3132    LookAtMe!: Thanks for your purchase of a video...
5       FreeMsg Hey there darling it's been 3 week's n...
3530    Xmas & New Years Eve tickets are now on sale f...
684     Hi I'm sue. I am 20 years old and work as a la...
1875    Would you like to see my XXX pics they are so ...
1893    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...
4298 lets you send free anonymous an...
4949    Hi this is Amy, we will be sending you a free ...
2821    INTERFLORA - “It's not too late to order Inter...
2247    Hi ya babe x u 4goten bout me?' scammers getti...
4514    Money i have won wining number 946 wot do i do...
Name: message, dtype: object

A false negative is a SPAM message that was incorrectly classified as HAM, i.e. the message should normally belong to the positive class.

In [40]:
# what do you notice about the false negatives ?
"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323."

Interestingly we note that the false negatives are a lot longer. We've only got 16 instances at our disposal to study from so making conclusions from such a small set of data is not very rigorous. We would tentatively conclude (or at least think) that since the messages are so long, Naive Bayes is getting lost in all these very "normal, hammy" words (like the word 'Thank you').

In [41]:
# calculate predicted probabilities for X_test_dtm (poorly calibrated)
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
array([2.87744864e-03, 1.83488846e-05, 2.07301295e-03, ...,
       1.09026171e-06, 1.00000000e+00, 3.98279868e-09])

Predicted probabilities answer the question: "What is the predictive probability of class membership ?". Saying it differently : "For each of the observations, what is the models prediction of the likelihood (vraisemblance) that it is SPAM or HAM ?". In the above code, column 1 denotes the predicted probability of class 1 (recall that class 1 is SPAM). So for the first message in the test set, it thinks the likelihood that it's SPAM is $0.0028$. Similarly, for the very last message, it thinks the likelihood is almost zero. Note that for the penultimate (l'avant dernier) message it is $100\%$ sure it's SPAM. From the above results, we can spot a weakness of Naive Bayes (as compared to Logistic Regression). Naives Bayes puts poorly calibrated predicted probabilities, i.e. it gives very extreme numbers that are not very useful. If we interpret them as likelihoods, they are note useful in that way.

In [42]:
array([[9.97122551e-01, 2.87744864e-03],
       [9.99981651e-01, 1.83488846e-05],
       [9.97926987e-01, 2.07301295e-03],
       [9.99998910e-01, 1.09026171e-06],
       [1.86697467e-10, 1.00000000e+00],
       [9.99999996e-01, 3.98279868e-09]])

We should not be confused by the two previous cells. We initially made a fit using, y_train) which allowed us to make future predictions. The Multinomial Naive Bayes model provides two kind of predictions : class predictions using nb.predict(X_test_dtm) and predictive probability of class membership using nb.predict_proba(x_test_dtm). That's it.

In [43]:
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

Why are predictive probabilities useful (why do we care about them) ? Well for one thing, AUC requires predictive probabilities. Log loss also requires predictive probabilities. On a more concrete level, maybe we don't actually care about class predictions. In the case of credit card fraud ("Is this transaction fraudulent or not ?"), we might not actually care about accuracy, i.e. "Did I get it right or wrong ?", "Did the predicted probability break $50\%$ or not ?". We might say, all we care about is that if something has a more than a $10\%$ likelihood of being fraud then I will flag it and I will disallow the purchase. In that specific case, what we really care about is getting our predicted probabilities as finely tuned because we care more about whether it predicts $2\%$ or $12\%$ likelihood of fraud.

Part 6 : Comparing Naive Bayes with logistic regression

We will compare multinomial Naive Bayes with logistic regression:

Logistic regression, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

In [44]:
# import and instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
In [45]:
# train the model using X_train_dtm
%time, y_train)
CPU times: user 91.6 ms, sys: 6.73 ms, total: 98.4 ms
Wall time: 28.1 ms
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

We notice that, compare to Naive Bayes, the wall time is way bigger when we use Logistic Regression. It's on the order of eight to ten times slower than Naive Bayes. This is one of the advantages of the Naive Bayes algorithm.

In [46]:
# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)
In [47]:
# calculate predicted probabilities for X_test_dtm (well calibrated)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
array([0.01269556, 0.00347183, 0.00616517, ..., 0.03354907, 0.99725053,

Note that the predicted probabilities for Logistic Regression are well calibrated (compared to Naive Bayes), i.e. they are much more likely to be interpretable as actual likelihoods. We don't take Naive Bayes predicted probabilities very seriously whereas Logistic Regression predicted probabilities tend to be better calibrated.

In [48]:
# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)
In [49]:
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

Part 7 : Calculating the "spamminess" of each token

Naive Bayes allow us to get a bit more insights about our models. Here is one question we would like to answer : "Why did certain messages get flagged as HAM versus SPAM ?", "Were the individual words thought of by the model as spammy words ?". Let's explore that a little bit.

In [50]:
# store the vocabulary of X_train
X_train_tokens = vect.get_feature_names()
In [51]:
# examine the first 50 tokens
['00', '000', '008704050406', '0121', '01223585236', '01223585334', '0125698789', '02', '0207', '02072069400', '02073162414', '02085076972', '021', '03', '04', '0430', '05', '050703', '0578', '06', '07', '07008009200', '07090201529', '07090298926', '07123456789', '07732584351', '07734396839', '07742676969', '0776xxxxxxx', '07781482378', '07786200117', '078', '07801543489', '07808', '07808247860', '07808726822', '07815296484', '07821230901', '07880867867', '0789xxxxxxx', '07946746291', '0796xxxxxx', '07973788240', '07xxxxxxxxx', '08', '0800', '08000407165', '08000776320', '08000839402', '08000930705']
In [52]:
# examine the last 50 tokens
['yer', 'yes', 'yest', 'yesterday', 'yet', 'yetunde', 'yijue', 'ym', 'ymca', 'yo', 'yoga', 'yogasana', 'yor', 'yorge', 'you', 'youdoing', 'youi', 'youphone', 'your', 'youre', 'yourjob', 'yours', 'yourself', 'youwanna', 'yowifes', 'yoyyooo', 'yr', 'yrs', 'ything', 'yummmm', 'yummy', 'yun', 'yunny', 'yuo', 'yuou', 'yup', 'zac', 'zaher', 'zealand', 'zebra', 'zed', 'zeros', 'zhong', 'zindgi', 'zoe', 'zoom', 'zouk', 'zyada', 'èn', '〨ud']

It turns out that when running a fit, Naive Bayes counts the number of times that each token appears in each class (HAM or SPAM).

In [53]:
# Naive Bayes counts the number of times each token appears in each class
array([[ 0.,  0.,  0., ...,  1.,  1.,  1.],
       [ 5., 23.,  2., ...,  0.,  0.,  0.]])

The above 2 rows represent the two classes : class 0 for HAM and class 1 for SPAM. The 7456 columns represent the features. For example, token '00' was found zero times in HAM messages and five times in SPAM messages, whereas the very last token which is '〨ud' appeared once in a HAM message and zero times in a SPAM message. We can now use this data to decide which word are considered "hammy" and which word are considered "spammy".

In [54]:
# rows represent classes, columns represent tokens
(2, 7456)

As a side note, how do we know which row is HAM and which row is SPAM ? If we recall at the very beginning we assigned HAM to class 0 and SPAM to class 1. It is the same thing for the confusion matrix. The confusion matrix is sorted with the lowest class numerically as the closest to the upper left corner.

In [55]:
# number of times each token appears across all HAM messages
ham_token_count = nb.feature_count_[0, :]
array([0., 0., 0., ..., 1., 1., 1.])

Just a quick recap. We have a corpora of documents which, in our case, is a collection of 5572 different SMS messages. This corpora of documets has altogether a total number of 7456 different/distinct words (or tokens). Those words are associated to either HAM messages, SPAM messages or they can even appear in both classes. We would like to know the numbers of times each word appears in a speficic class.

In [56]:
# number of times each token appears across all SPAM messages
spam_token_count = nb.feature_count_[1, :]
array([ 5., 23.,  2., ...,  0.,  0.,  0.])
In [57]:
# create a DataFrame of tokens with their separate ham and spam counts
tokens = pd.DataFrame(
    {'token' : X_train_tokens,
     'ham' : ham_token_count,
     'spam' : spam_token_count}

DataFrames have a method called sample.

In [58]:
# examine 5 random DataFrame rows
tokens.sample(5, random_state = 6)
ham spam
very 64.0 2.0
nasty 1.0 1.0
villa 0.0 1.0
beloved 1.0 0.0
textoperator 0.0 2.0

Tricky question : Based upon the above table, if we saw another text message that has the word 'nasty' in it, would we say that it's equally predictive of HAM and SPAM, or it's more predictive of HAM, or it's more predictive of SPAM ? Answer : It's more predictive of SPAM because there are fewer SPAM messages so it's a greater proportion of SPAM (we refer to this as class imbalance).

Recall that our overall dataset has approximately $80\%$ of HAM messages and $20\%$ of SPAM messages. When we proceed with train-test-split, those proportions were approximately preserved. For this reason, we need the SPAM column to have higher overall weights/values to account for the class imbalance. We also need to avoid dividing by zero bedore we can calculate the "spamminess" of each token.

In [59]:
# add 1 to ham and spam counts to avoid dividing by 0
tokens['ham'] = tokens.ham + 1
tokens['spam'] = tokens.spam + 1
tokens.sample(5, random_state = 6)
ham spam
very 65.0 3.0
nasty 2.0 2.0
villa 1.0 2.0
beloved 2.0 1.0
textoperator 1.0 3.0

Next, we need to normalize the numbers in each column so that we can get frequencies rather than counts. For the HAM Series, we need to divide each element by the total number of HAM messages. Similarly for the SPAM Series, we need to divide each element by the total number of SPAM messages.

In [60]:
# Naive Bayes counts the number of observations in each class
array([3617.,  562.])
In [61]:
# convert the ham and spam counts into frequencies
tokens['ham'] = tokens.ham / nb.class_count_[0]
tokens['spam'] = tokens.spam / nb.class_count_[1]
tokens.sample(5, random_state = 6)
ham spam
very 0.017971 0.005338
nasty 0.000553 0.003559
villa 0.000276 0.003559
beloved 0.000553 0.001779
textoperator 0.000276 0.005338

This is a much better measure than the previous raw counts because it is adjusted for the class imbalance. Indeed, we notice that the token 'nasty' is now less prevalent in HAM messages than it is in SPAM messages. Still, the above ratios are not exactly right since we added one to each count to avoid division by zero.

In [62]:
# calculate the ratio of spam-to-ham for each token
tokens['spam_ratio'] = tokens.spam / tokens.ham
tokens.sample(5, random_state = 6)
ham spam spam_ratio
very 0.017971 0.005338 0.297044
nasty 0.000553 0.003559 6.435943
villa 0.000276 0.003559 12.871886
beloved 0.000553 0.001779 3.217972
textoperator 0.000276 0.005338 19.307829

The most "spammy" word (out of the 5 sampled ones) is 'textoperator', followed by 'villa', then followed by 'nasty', then followed by 'beloved', and finally followed by 'very'. This is the essence of the Naive Bayes algorithm : it learns the 'spam_ratio' column and uses it to make predictions.

In [63]:
# examine the DataFrame sorted by spam_ratio 
# words with the highest spam ratio
tokens.sort_values('spam_ratio', ascending = False).head(10)
ham spam spam_ratio
claim 0.000276 0.158363 572.798932
prize 0.000276 0.135231 489.131673
150p 0.000276 0.087189 315.361210
tone 0.000276 0.085409 308.925267
guaranteed 0.000276 0.076512 276.745552
18 0.000276 0.069395 251.001779
cs 0.000276 0.065836 238.129893
www 0.000553 0.129893 234.911922
1000 0.000276 0.056940 205.950178
awarded 0.000276 0.053381 193.078292
In [64]:
# examine the DataFrame sorted by spam_ratio
# words with the lowest spam ratio
tokens.sort_values('spam_ratio',  ascending = False).tail(10)
ham spam spam_ratio
already 0.019630 0.001779 0.090647
too 0.021841 0.001779 0.081468
come 0.048936 0.003559 0.072723
later 0.030688 0.001779 0.057981
lor 0.032900 0.001779 0.054084
da 0.032900 0.001779 0.054084
she 0.035665 0.001779 0.049891
he 0.047000 0.001779 0.037858
lt 0.064142 0.001779 0.027741
gt 0.064971 0.001779 0.027387
In [65]:
# look up the spam_ratio for a given token
tokens.loc['dating', 'spam_ratio']

Part 8 : Creating a DataFrame from individual text files

What is the motivation for this section ? By now we are probably thinking about our own text data. Our own text data is usually stored in a bunch of separate documents, i.e. it's not stored in a pre-built DataFrame. Let's see how we can deal with this.

In [66]:
# use glob to create a list of ham filenames
import glob
ham_filenames = glob.glob('../data/ham_files/*.txt')

The job of the glob module is to come up with a list of filenames so that we can later iterate through them (glob does not read the files).

In [67]:
# read the contents of the ham files into a list (each list element is one email)
ham_text = []
for filename in ham_filenames:
    with open(filename) as f:
['This is a ham email.\nIt has 2 lines.\n',
 'This is another ham email.\n',
 'This is yet another ham email.\n']
In [68]:
# repeat this process for the spam files
spam_filenames = glob.glob('../data/spam_files/*.txt')
spam_text = []
for filename in spam_filenames:
    with open(filename) as f:
['This is a spam email.\n', 'This is another spam email.\n']
In [69]:
# combine the ham and spam lists
all_text = ham_text + spam_text
['This is a ham email.\nIt has 2 lines.\n',
 'This is another ham email.\n',
 'This is yet another ham email.\n',
 'This is a spam email.\n',
 'This is another spam email.\n']

Python trick : If we have a list with one element in it, e.g. the list [0], and we multiply it by a number, e.g. len(ham_text), it returns a new list of that specified length with that particular number repeated.

In [70]:
# create a list of labels (ham = 0, spam = 1)
all_labels = [0]*len(ham_text) + [1]*len(spam_text)
[0, 0, 0, 1, 1]
In [71]:
# convert the lists into a DataFrame
    'label' : all_labels,
    'message' : all_text
label message
0 0 This is a ham email.\nIt has 2 lines.\n
1 0 This is another ham email.\n
2 0 This is yet another ham email.\n
3 1 This is a spam email.\n
4 1 This is another spam email.\n

CountVectorizer knows what a newline character is and it's going to ignore them.