Spam Filter using Naive Bayes

Building a spam filter from scratch using the Naive Bayes algorithm

In machine learning, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes theorem with strong (but naive) independence assumptions between the features. In probability theory and statistics, Bayes' theorem (alternatively Bayes law or Bayes rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

The aim of this project is to build a spam filter using Naive Bayes algorithm from scratch. Leveraging the Bayes theorem the model is built to predict the probability that a given text is spam or non-spam based on the training data exposed to the model.

The dataset consists of text messages and a target variable specifying whether the text message was a spam or not. The dataset has been picked up from the UCI Machine Learning Repository

The columns in the dataset are not named, hence the names given - Label and SMS.

-> Label - target variable spam and ham (non-spam)
-> SMS - the message

It is required to explore the dataset before building the model, so as to remove any discrepancies in the text.

In [184]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from itertools import chain
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report
from wordcloud import WordCloud,STOPWORDS,ImageColorGenerator
from PIL import Image

In [143]:

df = pd.read_csv('SMSSpamCollection',sep='\t',header=None)
df.columns = ['Label','SMS']
print(df.shape)
df.head(5)

(5572, 2)

Out[143]:

	Label	SMS
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...

Whenever building a classifier, it is important to identify the proportion of the targets and know before hand whether the classifier is going to be exposed to imbalanced data or not. In this case, the proportion of spam to proportion of ham (non-spam) is important for the classifier. A countplot perfectly captures this statistic.

In [144]:

%matplotlib inline
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,8))
plt.style.use('fivethirtyeight')
sns.countplot(df.Label)
plt.ylabel('')
plt.yticks([])
df.Label.value_counts(normalize=True)

Out[144]:

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

The dataset is highly imbalanced i.e. the target variable doesnot have equal proportions of its classes. There are multiple ways to solve this problem. Most commonly, the dataset can be Under Sampled i.e. collect random samples of the majority class equal to minority class or the dataset can be Over Sampled i.e. using interpolation, the minority class points are increased in number to compete with the majority class. For this project, none of the two techiniques are used since the classifier is a probabilistic model, probability is relative and hence will make up for the imbalance.

It is very important to test the model to gain validation and feedback on the model. The dataset here is split into 80:20. The train data will be 80% and test data 20% of the entire population. All randomly selected. The sampling is done using the sample function of pandas. An easier method would be to use the inbuilt train_test_split() function of the sklearn.model_selection module.

In [145]:

randomized = df.sample(frac=1,random_state=1)
train_size = round(len(randomized) * 0.8)
train = randomized[:train_size].reset_index(drop=True)
test = randomized[train_size:].reset_index(drop=True)
print(train.shape)
print(test.shape)

(4458, 2)
(1114, 2)

A sample of a population has to be a representative of the population, otherwise the results obtained can be faulty or skewed. Thus it becomes very important to check for this criterion before moving forward with the project.

In [72]:

train.Label.value_counts(normalize=True)

Out[72]:

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [73]:

test.Label.value_counts(normalize=True)

Out[73]:

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

Comparing the splits, the proportion of classes for the two splits as well as the entire population (found above) are very near to eachother. This is a good enough measure to prove that the split samples are good representatives of the population for this dataset.

The general idea for this classifier is that, for every word in the input the probability is calculated for that word appearing in either spam messages or non-spam messages. These probabilities are then compared. If the probabilities cumulatively indicate that having this word or a collection of words is strongly associated with spam (non-spam) messages, then that input is classified as spam (non-spam).

The above logic pushes for cleaning the text to only recover the terms that are useful and remove stopwords, punctuations and anything extra in the texts.

In [74]:

train.SMS.head(10)

Out[74]:

0                         Yep, by the pretty sculpture
1        Yes, princess. Are you going to make me moan?
2                           Welp apparently he retired
3                                              Havent.
4    I forgot 2 ask ü all smth.. There's a card on ...
5    Ok i thk i got it. Then u wan me 2 come now or...
6    I want kfc its Tuesday. Only buy 2 meals ONLY ...
7                           No dear i was sleeping :-P
8                            Ok pa. Nothing problem:-)
9                      Ill be there on  &lt;#&gt;  ok.
Name: SMS, dtype: object

In [146]:

train.SMS = train['SMS'].str.replace('\W',' ')
train.SMS = train['SMS'].str.lower()
train.SMS.head(10)

Out[146]:

0                         yep  by the pretty sculpture
1        yes  princess  are you going to make me moan 
2                           welp apparently he retired
3                                              havent 
4    i forgot 2 ask ü all smth   there s a card on ...
5    ok i thk i got it  then u wan me 2 come now or...
6    i want kfc its tuesday  only buy 2 meals only ...
7                           no dear i was sleeping   p
8                            ok pa  nothing problem   
9                      ill be there on   lt   gt   ok 
Name: SMS, dtype: object

The normalized messages are used to create the Term Document Matrix. This is a mapping of every word to its message and the number of times the word occurs in the message. Think of it as a matrix with vocabulary of all the texts as row indices and every text id as column indices. A value 'n' (> 0) at the index -> (row,col) means the word 'col' appear n times in the text 'row'.

The first step to building the Term Document Matrix is to create the vocabulary of the texts. The vocabulary is a set of all unique words in all the texts.

In [104]:

messages = train.SMS.str.split()
words = list(chain(*messages))
vocabulary = pd.Series(words).unique()
len(vocabulary)

Out[104]:

In the train set, there are 7783 unique words for our vocabulary. These have been retrieved from the messages.
Using the vocabulary, the Term Document Matrix is derived. For the ease of usage, a dictionary representing the same concept is made.

In [110]:

no_of_messages = len(train.SMS)
word_counts = {word: [0] * no_of_messages for word in vocabulary}

for index,mssge in enumerate(messages):
    for word in mssge:
        word_counts[word][index] += 1

The above method made a dictionary with every term in the vocabulary as key and for each term a vector identifying in which messages the term appears in and how many times, analogous to the Term Document Matrix

The method afore-mentioned is a way of doing this from scratch, that doesnt take into account the stop words. This can also be achieved via the CountVectorizer function in the sklearn.feature_extraction.text module. Here the removal of stopwords is also considered.

In [147]:

token = RegexpTokenizer(r'[a-z0-9A-Z]+')
vectorizer = CountVectorizer(
    lowercase=True,
    stop_words='english',
    ngram_range=(1,1),
    tokenizer = token.tokenize
)
counts = vectorizer.fit_transform(train['SMS'])
print(len(vectorizer.get_feature_names()))
vocabulary = vectorizer.get_feature_names()

word_counts = dict(zip(vectorizer.get_feature_names(),np.transpose(counts.toarray())))

The vocabulary contains 7508 terms after removing the stopwords as well. Since the vocabulary is quite long, it makes it impossible to draw inferences from. One solution is to create a WordCloud to get a gist of the data. The WordClouds will be different for the spam messages and the ham (non-spam) messages. This would give a sense of the kind of words present in both the type of messages.

The WordCloud is present in the wordcloud package, downloadable -

pip install wordcloud
conda install wordcloud

The WordCloud function creates an image object, which is viewed using the imshow() function of matplotlib. The width, height and other parameters are set in the wordcloud object. There is a provision for removing stopwords as well in the object. The stopwords paramter has to be supplied with the STOPWORDS object from the nltk.corpus.

The first WordCloud is of the text messages, labeled as spam.

In [133]:

text = ""

for message in train[train.Label == 'spam'].SMS:
    words = message.split()
    text = text + " ".join(words) + " "
    
# stopwords = set(STOPWORDS)
wordcloud = WordCloud(width=1200,height=800,stopwords=stopwords.words('english'),background_color='white').generate(text)

plt.figure(figsize=(12,8))
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis('off')

Out[133]:

(-0.5, 1199.5, 799.5, -0.5)

From the WordCloud the words like - free, call, now, txt, tone, reply, mobile, text are frequently appearing words in the spam messages of the training set. This is very intuitive as infact these sort of words do appear in spam messages recieved generally.

The next word cloud is for the ham (non-spam) label.

In [112]:

text = ""

for message in train[train.Label == 'ham'].SMS:
    words = message.split()
    text = text + " ".join(words) + " "
    
# stopwords = set(STOPWORDS)
wordcloud = WordCloud(width=1200,height=800,stopwords=stopwords.words('english'),background_color='white').generate(text)

plt.figure(figsize=(12,8))
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis('off')

Out[112]:

(-0.5, 1199.5, 799.5, -0.5)

There are distinct differences in the variety and kind of words appearing for both the classes/labels. The words appearing in ham(non-spam) messages are generally common words as the WordCloud shows - love, come, know, got, home, need, dear etc. The ones in spam messages convey state very direct actions/objects like the words mentioned.

For model purposes, the dictionary is converted to a DataFrame and appended to the initial dataset. This way the representation for each message becomes easy. Every row is an individual text and every column after the SMS and Label columns represents a word from the vocabulary.

In [148]:

vector_space = pd.DataFrame(word_counts)
train = pd.concat([train,vector_space],axis=1)
train.head(3)

Out[148]:

	Label	SMS	...
0	ham	yep by the pretty sculpture	...
1	ham	yes princess are you going to make me moan	...
2	ham	welp apparently he retired	...

3 rows × 7510 columns

Before the model building starts, the stopwords have to be removed from the SMS columns permanently. The stopwords are usually present in all messages and are not relevant when distinguishing between spam and ham (non-spam) messages.

The train dataset is ready for model building. The Naive Bayes algorithm's formula requires the following terms. Precalculating them once saves time during model training, :-

P(spam) - probability of spam
P(ham) - probability of non-spam
N_spam - number of words in spam messages
N_ham - number of words in non-spam messages
N_vocab - number of words in vocabulary
Alpha - Laplace smoothing constant

The Alpha is the Laplace smoothing constant. This is required because there can be certain words of the input text that only appear in spam messages and not in non-spam messages. When calculating - P(word|non-spam), it gives a 0. Since the inference is based on the given formula :-

y_pred = argmax_y (P(y)P(w₁|y)P(w₂|y)…P(w_n|y))

One 0 in the equation and the entire message is given a 0 probability. This actually is a shortcoming of the limited vocabulary and limited texts in the dataset. To overcome this problem, the Laplace smoothing constant is applied to avoid the 0 entering the equation.

In [149]:

alpha = 1
P_spam = train.Label.value_counts(normalize=True)['spam']
P_ham = train.Label.value_counts(normalize=True)['ham']
N_spam = len(list(chain(*messages[train.Label == 'spam'])))
N_ham = len(list(chain(*messages[train.Label == 'ham'])))
N_vocab = len(vocabulary)

The paramters

P(w_i|spam)
P(w_i|ham)

where w_i stands for each word in the vocabulary. These formulas calculate the probability of a word appearing in the text given the message label. These calculations are static like the ones done previously. Hence calculating them once will avoid meaningless calculations during training phaze.

Two separate dictionaries, for spam and ham (non-spam) are maintained, mapping word to its probability of appearing in the respective label.

In [150]:

P_word_given_spam = {word: 0 for word in vocabulary}
P_word_given_ham = {word: 0 for word in vocabulary}

Spam_messages = train[train.Label == 'spam']
Ham_messages = train[train.Label == 'ham']

for word in vocabulary:
    
    N_word_given_spam = Spam_messages[word].sum()
    N_word_given_ham = Ham_messages[word].sum()
    
    P_word_given_spam[word] = (
        (N_word_given_spam + alpha) / (N_spam + (alpha * N_vocab))
    )
    
    P_word_given_ham[word] = (
        (N_word_given_ham + alpha) / (N_ham + (alpha * N_vocab))
    )

All static components required to build the model are calculated, this saves time during training.
Next phaze is to build the model. The model will take an input message, calculate the probabilites :-

P(spam|w_i...w_n)
P(ham|w_i...w_n)

Comparing the two, if the first is greater than the latter, the models classifies the message as spam, if they are equal then the model requires human help to classify them better, else the message is classified as ham (non-spam).

In [151]:

def classify(message,verbose):
    
    message = re.sub('\W',' ',message)
    message = message.lower()
    message = message.split()
    
    spam = 1
    ham = 1
    
    for word in message:
        if word in P_word_given_spam.keys():
            spam *= P_word_given_spam[word]
        if word in P_word_given_ham.keys():
            ham *= P_word_given_ham[word]
        
    P_spam_given_message = P_spam * spam
    P_ham_given_message = P_ham * ham
    
    if verbose:
        print("P(spam|message) = ",P_spam_given_message)
        print("P(ham|message) = ",P_ham_given_message)
    
    if P_spam_given_message > P_ham_given_message:
        if verbose:
            print('Label: spam')
        return 'spam'
    elif P_spam_given_message < P_ham_given_message:
        if verbose:
            print('Label: ham')    
        return 'ham'
    else:
        if verbose:
            print('Human assistance needed, equal probabilities')
        return 'Not classified'

The function as specified before accepts an input message and classifies it. The verbose parameter is to get printed output at every step of the function. It is useful when debugging or understanding the working.

The function is tested on the following inputs, as a part of preliminary test :-

Spam - 'WINNER!! This is the secret code to unlock the money: C3421.'
Ham (non-spam) - "Sounds good, Tom, then see u there"

In [154]:

classify('WINNER!! This is the secret code to unlock the money: C3421.',verbose=1)

P(spam|message) =  6.814909426971065e-15
P(ham|message) =  1.772963190712378e-17
Label: spam

Out[154]:

'spam'

In [155]:

classify('Sounds good, Tom, then see u there',verbose=1)

P(spam|message) =  1.5820325455468539e-15
P(ham|message) =  2.7059850698247674e-13
Label: ham

Out[155]:

'ham'

The model looks good enough from the preliminary tests. As the final part of the project, the model has to be tested on the test set and inference from the predictions have to be made.
The predictions are stored in a new column predicted, to compare it with the actual Label column.

In [158]:

test['predicted'] = test.SMS.apply(classify,verbose=0) #try putting verbose >1 and see the output of the model
test.head(5)

Out[158]:

	Label	SMS	predicted
0	ham	Later i guess. I needa do mcat study too.	ham
1	ham	But i haf enuff space got like 4 mb...	ham
2	spam	Had your mobile 10 mths? Update to latest Oran...	spam
3	ham	All sounds good. Fingers . Makes it difficult ...	ham
4	ham	All done, all handed in. Don't know if mega sh...	ham

In [159]:

test.predicted.value_counts(normalize=True)

Out[159]:

ham     0.857271
spam    0.142729
Name: predicted, dtype: float64

Out of the entire test sample, the model classified about 85% as ham and about 14% as spam. The model seems to have done pretty well as these proportions are analogous to the sample's proportion of classes. But this doesnt speak for missclassified labels.

To be clearer with respect to the effectiveness of the model, accuracy is a good metric that describes the proportion of correctly classified labels by the model. The accuracy is simply comparision between the Label and the predicted columns divided by the total units in the test sample.

In [160]:

accuracy = sum(test.Label == test.predicted)/len(test)
accuracy

Out[160]:

0.9802513464991023

The model has achieved a phenomenal accuracy of about 98% on the test set. It obviously seems to good to be true, but this is a result of less data points for training, limited variance in the vocabulary of the population.

The accuracy can be calculated on the train set as well.

In [162]:

train['predicted'] = train.SMS.apply(classify,verbose=0)
accuracy = sum(train.Label == train.predicted)/len(train)
accuracy

Out[162]:

0.9894571556751907

In [163]:

train.predicted.value_counts(normalize=True)

Out[163]:

ham     0.856662
spam    0.143338
Name: predicted, dtype: float64

Accuracies give a general idea about the model, but not the entire picture. A confusion matrix gives information about the True positives and the False negatives (correct predictions) and the True negative and False positive (incorrect predictions) of the model. With these, important metrics to report are the Precision, Recall and the F1 score.

The sklearn.metrics module provides the inbuilt function classification_report which is generated based on the true labels and the predicted labels.

In [186]:

print(classification_report(test.predicted,test.Label))

              precision    recall  f1-score   support

         ham       0.98      0.99      0.99       955
        spam       0.97      0.89      0.93       159

    accuracy                           0.98      1114
   macro avg       0.97      0.94      0.96      1114
weighted avg       0.98      0.98      0.98      1114

The model manages to get a 99% accuracy on the train set. And an equally good precision and recall.
Since it is not a 100% for either test or train set, certain messages were missclassified. Its good practice to review the missclassified data, as it could be due to erroneous data or uncleaned text in this case.

Since the test messages were cleaned inside the classify() function. To review the missclassified messages, they will have to be cleaned externally.

In [183]:

def clean(message):
    
    message = re.sub('\W',' ',message)
    message = message.lower()
    
    return message

test.SMS = test.SMS.apply(clean)
missclassified = test[test.Label != test.predicted]
missclassified

Out[183]:

	Label	SMS	predicted
9	ham	i liked the new mobile	spam
18	ham	and picking them up from various points	spam
114	spam	not heard from u4 a while call me now am here...	ham
130	ham	how was txting and driving	spam
131	ham	they have a thread on the wishlist section of ...	spam
152	ham	unlimited texts limited minutes	spam
159	ham	26th of july	spam
284	ham	nokia phone is lovly	spam
302	ham	no calls messages missed calls	spam
313	ham	speaking of does he have any cash yet	spam
319	ham	we have sent jd for customer service cum accou...	spam
363	spam	email alertfrom jeri stewartsize 2kbsubject ...	ham
398	ham	hasn t that been the pattern recently crap wee...	spam
492	ham	madam regret disturbance might receive a refer...	spam
660	ham	any chance you might have had with me evaporat...	spam
754	ham	i am getting threats from your sales executive...	spam
824	ham	these won t do have to move on to morphine	spam
876	spam	rct thnq adrian for u text rgds vatian	ham
885	spam	2 2 146tf150p	ham
923	ham	how much would it cost to hire a hitman	spam
953	spam	hello we need some posh birds and chaps to us...	ham
1073	ham	i career tel have added u as a contact on in...	spam

The easier method to visualize texts is a WordCloud. Below are generated WordClouds for both the classes, only for the texts that were missclassified.

The first one is for the ham (non-spam) messages that were classified as spam.

In [177]:

text = ""

for message in missclassified[missclassified.Label == 'ham'].SMS:
    words = message.split()
    text = text + " ".join(words) + " "
    
# stopwords = set(STOPWORDS)
wordcloud = WordCloud(width=1200,height=800,stopwords=stopwords.words('english'),background_color='white').generate(text)

plt.figure(figsize=(12,8))
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis('off')

Out[177]:

(-0.5, 1199.5, 799.5, -0.5)

The WordCloud shows the most frequent words for the ham (non-spam) missclassified messages. The words - call, phone, connected, messages, executive, mobile etc are similar to words found usually in spam messages. In short, the model missclassified all those ham (non-spam) messages that had a more formal tone to it or sort of mimicked the spam messages.

The next plot is for the missclassified spam label.

In [179]:

text = ""

for message in missclassified[missclassified.Label == 'spam'].SMS:
    words = message.split()
    text = text + " ".join(words) + " "
    
# stopwords = set(STOPWORDS)
wordcloud = WordCloud(width=1200,height=800,stopwords=stopwords.words('english'),background_color='white').generate(text)

plt.figure(figsize=(12,8))
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis('off')

Out[179]:

(-0.5, 1199.5, 799.5, -0.5)

The plot shows alot of words that are not part of the vocabulary. Words such as - u4, luv, knickers etc. Due to these words not appearing in the vocabulary, the ham words outweighed the spam words, meaning - The message is essentially a spam, since the words that could actually contribute to classify this as one are not present in the vocabulary, the words present in the message and were a part of the vocabulary classified this as a ham (non-spam) message.

The missclassified messages expose the problem of this model. The model is oblivious to context hidden in the messages. It only considers the word individually and checks if that word can be spam or not. Whereas it is noticed that the words individually do not convey the whole story.

In conclusion to the project, a successful model for spam filter was created using the Naive Bayes Algorithm. The model assumes independence between the messages and the words in the messages, oblivious to the context of the message.
Just based on this assumption, the model performed very well for this dataset with the following accuracies (approx) :-

Training accuracy - 98.94%
Testing accuracy - 98.02%

The model achieves a good precision and recall scores. However it is important to note that spam messages have a lesser recall. This can be accounted to the fact that there is an imbalance in the classes of the target variable.

Overall the model has learnt the training data well and has performed well on the test set for the text messages in the dataset used.