♡ **Naive Bayes and logistic regression are both classification algorithms.** ♡

- Model building in scikit-learn (refresher)
- Representing text as numerical data
- Reading the SMS data
- Vectorizing the SMS data
- Building a Naive Bayes model
- Comparing Naive Bayes with logistic regression
- Calculating the "spamminess" of each token
- Creating a DataFrame from individual text files

In [1]:

```
# load the iris dataset as an example
from sklearn.datasets import load_iris
iris = load_iris()
print(type(iris))
```

In [2]:

```
# store the feature matrix (X) and response vector (y)
X = iris.data
y = iris.target
```

**"Features"** are also known as predictors, inputs, or attributes. The **"response"** is also known as the target, label, or output.

In [3]:

```
# check the shapes of X and y
print(X.shape)
print(y.shape)
```

**"Observations"** are also known as samples, instances, or records.

In [4]:

```
# examine the first 5 rows of X and its type
print(X[:5])
print(type(X))
```

Note that `X`

and `y`

are not DataFrames. They are Numpy nd-arrays. On the other hand, the `iris`

object is not even a Numpy nd-array, it is a `bunch`

object. In the next cell, we are converting `X`

to a Pandas DataFrame.

In [5]:

```
# examine the first 5 rows of X (including the feature names)
import pandas as pd
pd.DataFrame(X, columns=iris.feature_names).head()
```

Out[5]:

In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**.

♡ **The fitting step is where the model is learning the relationship between the features and the response.** ♡

In [6]:

```
# import the class
from sklearn.neighbors import KNeighborsClassifier
# instantiate the model (with the default parameters)
knn = KNeighborsClassifier()
# fit the model with data (occurs in-place)
knn.fit(X, y)
```

Out[6]:

In order to make a **prediction**, the new observation must have the **same features as the training observations**, both in number and meaning.

In [7]:

```
# predict the response for a new observation
knn.predict([[3, 5, 4, 2]])
```

Out[7]:

In [8]:

```
# example text for model training (SMS messages)
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']
```

From the scikit-learn documentation:

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect

numerical feature vectors with a fixed sizerather than theraw text documents with variable length.

We will use `CountVectorizer`

for the purpose of taking text data and converting it to numerical data.`CountVectorizer`

follows the same pattern that all scikit-learn estimators are following. Therefore, in the same way that we import, instantiate and fit a model, we import, instantiate, and fit `CountVectorizer`

. So even though `CountVectorizer`

is not a model, it has the same API in that sense as a scikit-learn module. But we should not get confused : `CountVectorizer`

**is not** a model but it has a fit method.

We will use CountVectorizer to "convert text into a matrix of token counts" :

In [9]:

```
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
```

Note that the module is called `feature_extraction`

. We will see shortly that each word in the `simple_train`

list object will become a feature so the module name makes perfect sense. Also note that, in contrast with the previous model we instantiated on the iris dataset, the fit method now correspond to the vectorizer learning the vocabulary of each document (i.e. each SMS messages).

In [10]:

```
# learn the 'vocabulary' of the training data (occurs in-place)
vect.fit(simple_train)
```

Out[10]:

Once a vectorizer has been fit, it exposes a **method** called `get_feature_names`

which returns the fitted vocabulary, i.e. the vocabulary that the vectorizer learned.

In [11]:

```
# examine the fitted vocabulary
vect.get_feature_names()
```

Out[11]:

Note that it removed the word `'a'`

from the SMS message `'Call me a cab'`

because the default `token_pattern`

has a regular expression that specifies to drop words that are not at least two characters long. Finally, to be clear, vectorizers don't try to understand language. Vectorizers are there simply to convert text into a matrix of token counts.

In [12]:

```
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm
```

Out[12]:

Three input samples, 6 feature names or equivalently three documents, 6 vocabulary words (also known as tokens). In numerical analysis and computer science, a sparse matrix (*matrice creuse*) or sparse array is a matrix in which most of the elements are zero. By contrast, if most of the elements are nonzero, then the matrix is considered dense." *(Wikipedia)*

In [13]:

```
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()
```

Out[13]:

In [14]:

```
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())
```

Out[14]:

Initially we had text data which was non-numeric with variable length. Now, the text is represented as a feature matrix with a fixed number of columns. Again, note that the words/tokens have become finite features.

From the scikit-learn documentation :

In this scheme, features and samples are defined as follows:

- Each individual token occurrence frequency (normalized or not) is treated as a
feature.- The vector of all the token frequencies for a given document is considered a multivariate
sample.A

corpus of documentscan thus be represented by a matrix withone row per documentandone column per token(e.g. word) occurring in the corpus.We call

vectorizationthe general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called theBag of Wordsor "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [15]:

```
# print the sparse matrix
print(simple_train_dtm)
```

From the scikit-learn documentation:

As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have

many feature values that are zeros(typically more than 99% of them).For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

In order to be able to

store such a matrix in memorybut also tospeed up operations, implementations will typically use asparse representationsuch as the implementations available in the`scipy.sparse`

package.

In [16]:

```
# example text for model testing
simple_test = ["please don't call me"]
```

**prediction**, the new observation must have the **same features as the training observations**, both in number and meaning.

What does the previous sentence mean for text data ? It means that given a new observation, that observation has to have six columns and those six columns must have the same meaning as the columns in the training data document-term matrix, i.e. it needs to have the same exact features.

In [17]:

```
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()
```

Out[17]:

In [18]:

```
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())
```

Out[18]:

Note that the word `'don't'`

is missing. Why ? Because it is not part of that vocabulary that our vectorizer has learned during the fitting process made on the training data. In other words, `'don't'`

is not a feature of our original data, i.e. it is not a token seen in our training data. As a result, it just gets drop.

We can make the argument that it is actually ok to drop `'don't'`

. Why ? Since it is not present in the original corpus, it won't assist in modelling future documents. Specifically, the model can't use this new information as it was previously trained on a different corpus of documents.

**Summary:**

`vect.fit(train)`

**learns the vocabulary**of the training data`vect.transform(train)`

uses the**fitted vocabulary**to build a document-term matrix from the training data`vect.transform(test)`

uses the**fitted vocabulary**to build a document-term matrix from the testing data (and**ignores tokens**it hasn't seen before)

Again, we didn't run `vec.fit()`

on the testing data. If we had run a fit on the testing data, we would have learned a new vocabulary of four words that would not match the initial vocabulary of six words present in the training data. Finally, note that we only convert sparse matrices to dense matrices for the purpose of examining them. In real life (i.e. if we are not going to look at the matrix) we won't convert it to a dense matrix. That was for display purposes only.

In [19]:

```
# read file into Pandas using a relative path
path = '../data/sms.tsv'
sms = pd.read_table(path, header = None, names = ['label', 'message'])
```

In [20]:

```
# examine the shape
sms.shape
```

Out[20]:

In [21]:

```
# examine the first 10 rows
sms.head(10)
```

Out[21]:

In [22]:

```
# examine the class distribution
sms.label.value_counts()
```

Out[22]:

In [23]:

```
# convert label to a numerical variable
sms['label_num'] = sms.label.map({'ham' : 0, 'spam' : 1})
```

In [24]:

```
# check that the conversion worked
sms.head(10)
```

Out[24]:

In [25]:

```
# usual way to define X and y (from the iris data) for use with a MODEL
X = iris.data
y = iris.target
print(X.shape)
print(y.shape)
```

In [26]:

```
# required way to define X and y for use with COUNTVECTORIZER
X = sms.message
y = sms.label_num
print(X.shape)
print(y.shape)
```

Recall with the iris dataset, we defined `X`

as two-dimensional and `y`

as one-dimensional. For `CountVectorizer`

(i.e. when working with text-based data) both `X`

and `y`

are one-dimensional Series. So why do we now define `X`

as a one-dimensional object in the text based context ? Because it will be converted by `CountVectorizer`

into a two-dimensional object. If we pass a two-dimensional object to `CountVectorizer`

it won't have any idea what to do with it. So `CountVectorizer`

takes one-dimensional objects and turns them into two-dimensional objects.

In [27]:

```
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
```

Why are we applying train-test-split before vectorization ? The whole point of train-test-split is that the test set we get is a simulation of the future. Future documents will have words our model have never seen before and it will be handicaped by the fact that it only learned from the past. If we vectorize before splitting, our testing set won't see any words it didn't learned already. In other words, we are making it too easy for the model if we vectorize first than train-test-split.

In [28]:

```
# instantiate the vectorizer
vect = CountVectorizer()
```

In [29]:

```
# learn training data vocabulary, then use it to create a document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
```

The next cell is more computationaly efficient then the previous one and does the exact same thing in one line of code.

In [30]:

```
# alternative : combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)
```

In [31]:

```
# examine the document-term matrix
X_train_dtm
```

Out[31]:

4179 training documents (SMS messages) and 7456 **unique** tokens (individual words). Recall that even if a word gets used twice in a document (or in different documents) it becomes one column, i.e. one and unique feature.

In [32]:

```
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm
```

Out[32]:

Recall that `transform`

is using the **fitted** vocabulary and it initially learned a vocabulary of 7456 tokens. We also knew it was going to output 1393 because 1393 is the number of test documents.

At this point, our data is in the adequate form for model building. Note that it has a similar structure to the IRIS example we saw at the beginning.

We will use multinomial Naive Bayes:

The multinomial Naive Bayes classifier is suitable for classification with

discrete features(e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [33]:

```
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
```

In [34]:

```
# train the model using X_train_dtm (timing it with an IPython "magic command")
%time nb.fit(X_train_dtm, y_train)
```

Out[34]:

In [35]:

```
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)
```

In [36]:

```
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)
```

Out[36]:

In [37]:

```
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)
```

Out[37]:

They are 5 false positive messages ($a_{12} = 5)$. A false positive means the message was incorrectly classified as SPAM.

In [38]:

```
# print message text for the false positives (meaning they were incorrectly classified as spam)
X_test[y_test < y_pred_class]
```

Out[38]:

The previous cell use a boolean condition to return all $5$ false positives messages corresponding to the confusion matrix element $a_{12}$. For any cases where the prediction is $1$ and the actual real value is $0$, it returns `TRUE`

.

The Series of trues and falses is then passed to `X_test`

to select out rows. That's all to say that these are the $5$ messages that were handmarked as HAM, meaning the person who built the training set said these are HAM messages, and the classifier incorrectly predicted SPAM (that's the definition of a false positive). Formally, we are comparing vectors. Let

In [39]:

```
# print message text for the false negatives (meaning they were incorrectly classified as ham)
X_test[y_test > y_pred_class]
```

Out[39]:

A false negative is a SPAM message that was incorrectly classified as HAM, i.e. the message should normally belong to the positive class.

In [40]:

```
# what do you notice about the false negatives ?
X_test[3132]
```

Out[40]:

Interestingly we note that the false negatives are a lot longer. We've only got 16 instances at our disposal to study from so making conclusions from such a small set of data is not very rigorous. We would tentatively conclude (or at least think) that since the messages are so long, Naive Bayes is getting lost in all these very "normal, hammy" words (like the word `'Thank you'`

).

In [41]:

```
# calculate predicted probabilities for X_test_dtm (poorly calibrated)
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob
```

Out[41]:

Predicted probabilities answer the question: "What is the predictive probability of class membership ?". Saying it differently : "For each of the observations, what is the models prediction of the likelihood (*vraisemblance*) that it is SPAM or HAM ?". In the above code, column 1 denotes the predicted probability of class 1 (recall that class 1 is SPAM). So for the first message in the test set, it thinks the likelihood that it's SPAM is $0.0028$. Similarly, for the very last message, it thinks the likelihood is almost zero. Note that for the penultimate (*l'avant dernier*) message it is $100\%$ sure it's SPAM. From the above results, we can spot a weakness of Naive Bayes (as compared to Logistic Regression). Naives Bayes puts poorly calibrated predicted probabilities, i.e. it gives very extreme numbers that are not very useful. If we interpret them as likelihoods, they are note useful in that way.

In [42]:

```
nb.predict_proba(X_test_dtm)
```

Out[42]:

We should not be confused by the two previous cells. We initially made a fit using `nf.fit(X_train_dtm, y_train)`

which allowed us to make future predictions. The Multinomial Naive Bayes model provides two kind of predictions : class predictions using `nb.predict(X_test_dtm)`

and predictive probability of class membership using `nb.predict_proba(x_test_dtm)`

. That's it.

In [43]:

```
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)
```

Out[43]:

Why are predictive probabilities useful (why do we care about them) ? Well for one thing, **AUC requires predictive probabilities**. Log loss also requires predictive probabilities. On a more concrete level, maybe we don't actually care about class predictions. In the case of credit card fraud ("Is this transaction fraudulent or not ?"), we might not actually care about accuracy, i.e. "Did I get it right or wrong ?", "Did the predicted probability break $50\%$ or not ?". We might say, all we care about is that if something has a more than a $10\%$ likelihood of being fraud then I will flag it and I will disallow the purchase. In that specific case, what we really care about is getting our predicted probabilities as finely tuned because we care more about whether it predicts $2\%$ or $12\%$ likelihood of fraud.

We will compare multinomial Naive Bayes with logistic regression:

Logistic regression, despite its name, is a

linear model for classificationrather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

In [44]:

```
# import and instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
```

In [45]:

```
# train the model using X_train_dtm
%time logreg.fit(X_train_dtm, y_train)
```

Out[45]:

We notice that, compare to Naive Bayes, the wall time is way bigger when we use Logistic Regression. It's on the order of eight to ten times slower than Naive Bayes. This is one of the advantages of the Naive Bayes algorithm.

In [46]:

```
# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)
```

In [47]:

```
# calculate predicted probabilities for X_test_dtm (well calibrated)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
y_pred_prob
```

Out[47]:

Note that the predicted probabilities for Logistic Regression are well calibrated (compared to Naive Bayes), i.e. they are much more likely to be interpretable as actual likelihoods. We don't take Naive Bayes predicted probabilities very seriously whereas Logistic Regression predicted probabilities tend to be better calibrated.

In [48]:

```
# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)
```

Out[48]:

In [49]:

```
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)
```

Out[49]:

Naive Bayes allow us to get a bit more insights about our models. Here is one question we would like to answer : "Why did certain messages get flagged as HAM versus SPAM ?", "Were the individual words thought of by the model as spammy words ?". Let's explore that a little bit.

In [50]:

```
# store the vocabulary of X_train
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)
```

Out[50]:

In [51]:

```
# examine the first 50 tokens
print(X_train_tokens[0:50])
```

In [52]:

```
# examine the last 50 tokens
print(X_train_tokens[-50:])
```

It turns out that when running a fit, Naive Bayes counts the number of times that each token appears **in each class** (HAM or SPAM).

In [53]:

```
# Naive Bayes counts the number of times each token appears in each class
nb.feature_count_
```

Out[53]:

The above 2 rows represent the two classes : class 0 for HAM and class 1 for SPAM. The 7456 columns represent the features. For example, token `'00'`

was found zero times in HAM messages and five times in SPAM messages, whereas the very last token which is `'〨ud'`

appeared once in a HAM message and zero times in a SPAM message. We can now use this data to decide which word are considered "hammy" and which word are considered "spammy".

In [54]:

```
# rows represent classes, columns represent tokens
nb.feature_count_.shape
```

Out[54]:

As a side note, how do we know which row is HAM and which row is SPAM ? If we recall at the very beginning we assigned HAM to class 0 and SPAM to class 1. It is the same thing for the confusion matrix. The confusion matrix is sorted with the lowest class numerically as the closest to the upper left corner.

In [55]:

```
# number of times each token appears across all HAM messages
ham_token_count = nb.feature_count_[0, :]
ham_token_count
```

Out[55]:

Just a quick recap. We have a corpora of documents which, in our case, is a collection of 5572 different SMS messages. This corpora of documets has altogether a total number of 7456 different/distinct words (or tokens). Those words are associated to either HAM messages, SPAM messages or they can even appear in both classes. We would like to know the numbers of times each word appears in a speficic class.

In [56]:

```
# number of times each token appears across all SPAM messages
spam_token_count = nb.feature_count_[1, :]
spam_token_count
```

Out[56]:

In [57]:

```
# create a DataFrame of tokens with their separate ham and spam counts
tokens = pd.DataFrame(
{'token' : X_train_tokens,
'ham' : ham_token_count,
'spam' : spam_token_count}
).set_index('token')
```

DataFrames have a method called `sample`

.

In [58]:

```
# examine 5 random DataFrame rows
tokens.sample(5, random_state = 6)
```

Out[58]:

**Tricky question :** Based upon the above table, if we saw another text message that has the word `'nasty'`

in it, would we say that it's equally predictive of HAM and SPAM, or it's more predictive of HAM, or it's more predictive of SPAM ? **Answer :** It's more predictive of SPAM because there are fewer SPAM messages so it's a greater proportion of SPAM (we refer to this as **class imbalance**).

Recall that our overall dataset has approximately $80\%$ of HAM messages and $20\%$ of SPAM messages. When we proceed with `train-test-split`

, those proportions were approximately preserved. For this reason, we need the SPAM column to have higher overall weights/values to account for the **class imbalance**. We also need to avoid dividing by zero bedore we can calculate the "spamminess" of each token.

In [59]:

```
# add 1 to ham and spam counts to avoid dividing by 0
tokens['ham'] = tokens.ham + 1
tokens['spam'] = tokens.spam + 1
tokens.sample(5, random_state = 6)
```

Out[59]:

Next, we need to normalize the numbers in each column so that we can get frequencies rather than counts. For the HAM Series, we need to divide each element by the total number of HAM messages. Similarly for the SPAM Series, we need to divide each element by the total number of SPAM messages.

In [60]:

```
# Naive Bayes counts the number of observations in each class
nb.class_count_
```

Out[60]:

In [61]:

```
# convert the ham and spam counts into frequencies
tokens['ham'] = tokens.ham / nb.class_count_[0]
tokens['spam'] = tokens.spam / nb.class_count_[1]
tokens.sample(5, random_state = 6)
```

Out[61]:

This is a much better measure than the previous raw counts because it is adjusted for the **class imbalance**. Indeed, we notice that the token `'nasty'`

is now less prevalent in HAM messages than it is in SPAM messages. Still, the above ratios are not exactly right since we added one to each count to avoid division by zero.

In [62]:

```
# calculate the ratio of spam-to-ham for each token
tokens['spam_ratio'] = tokens.spam / tokens.ham
tokens.sample(5, random_state = 6)
```

Out[62]:

The most "spammy" word (out of the 5 sampled ones) is `'textoperator'`

, followed by `'villa'`

, then followed by `'nasty'`

, then followed by `'beloved'`

, and finally followed by `'very'`

. This is the essence of the Naive Bayes algorithm : it learns the `'spam_ratio'`

column and uses it to make predictions.

In [63]:

```
# examine the DataFrame sorted by spam_ratio
# words with the highest spam ratio
tokens.sort_values('spam_ratio', ascending = False).head(10)
```

Out[63]:

In [64]:

```
# examine the DataFrame sorted by spam_ratio
# words with the lowest spam ratio
tokens.sort_values('spam_ratio', ascending = False).tail(10)
```

Out[64]:

In [65]:

```
# look up the spam_ratio for a given token
tokens.loc['dating', 'spam_ratio']
```

Out[65]:

What is the motivation for this section ? By now we are probably thinking about our own text data. Our own text data is usually stored in a bunch of separate documents, i.e. it's not stored in a pre-built DataFrame. Let's see how we can deal with this.

In [66]:

```
# use glob to create a list of ham filenames
import glob
ham_filenames = glob.glob('../data/ham_files/*.txt')
ham_filenames
```

Out[66]:

The job of the `glob`

module is to come up with a **list** of filenames so that we can later iterate through them (`glob`

does not read the files).

In [67]:

```
# read the contents of the ham files into a list (each list element is one email)
ham_text = []
for filename in ham_filenames:
with open(filename) as f:
ham_text.append(f.read())
ham_text
```

Out[67]:

In [68]:

```
# repeat this process for the spam files
spam_filenames = glob.glob('../data/spam_files/*.txt')
spam_text = []
for filename in spam_filenames:
with open(filename) as f:
spam_text.append(f.read())
spam_text
```

Out[68]:

In [69]:

```
# combine the ham and spam lists
all_text = ham_text + spam_text
all_text
```

Out[69]:

**Python trick :** If we have a list with one element in it, e.g. the list `[0]`

, and we multiply it by a number, e.g. `len(ham_text)`

, it returns a new list of that specified length with that particular number repeated.

In [70]:

```
# create a list of labels (ham = 0, spam = 1)
all_labels = [0]*len(ham_text) + [1]*len(spam_text)
all_labels
```

Out[70]:

In [71]:

```
# convert the lists into a DataFrame
pd.DataFrame({
'label' : all_labels,
'message' : all_text
})
```

Out[71]:

`CountVectorizer`

knows what a newline character is and it's going to ignore them.