In machine learning, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes theorem with strong (but naive) independence assumptions between the features. In probability theory and statistics, Bayes' theorem (alternatively Bayes law or Bayes rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
The aim of this project is to build a spam filter using Naive Bayes algorithm from scratch. Leveraging the Bayes theorem the model is built to predict the probability that a given text is spam or non-spam based on the training data exposed to the model.
The dataset consists of text messages and a target variable specifying whether the text message was a spam or not. The dataset has been picked up from the UCI Machine Learning Repository
The columns in the dataset are not named, hence the names given - Label and SMS.
-> Label - target variable spam and ham (non-spam)
-> SMS - the message
It is required to explore the dataset before building the model, so as to remove any discrepancies in the text.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from itertools import chain
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report
from wordcloud import WordCloud,STOPWORDS,ImageColorGenerator
from PIL import Image
df = pd.read_csv('SMSSpamCollection',sep='\t',header=None)
df.columns = ['Label','SMS']
print(df.shape)
df.head(5)
(5572, 2)
Label | SMS | |
---|---|---|
0 | ham | Go until jurong point, crazy.. Available only ... |
1 | ham | Ok lar... Joking wif u oni... |
2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... |
3 | ham | U dun say so early hor... U c already then say... |
4 | ham | Nah I don't think he goes to usf, he lives aro... |
Whenever building a classifier, it is important to identify the proportion of the targets and know before hand whether the classifier is going to be exposed to imbalanced data or not. In this case, the proportion of spam to proportion of ham (non-spam) is important for the classifier. A countplot
perfectly captures this statistic.
%matplotlib inline
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,8))
plt.style.use('fivethirtyeight')
sns.countplot(df.Label)
plt.ylabel('')
plt.yticks([])
df.Label.value_counts(normalize=True)
ham 0.865937 spam 0.134063 Name: Label, dtype: float64
The dataset is highly imbalanced i.e. the target variable doesnot have equal proportions of its classes. There are multiple ways to solve this problem. Most commonly, the dataset can be Under Sampled i.e. collect random samples of the majority class equal to minority class or the dataset can be Over Sampled i.e. using interpolation, the minority class points are increased in number to compete with the majority class. For this project, none of the two techiniques are used since the classifier is a probabilistic model, probability is relative and hence will make up for the imbalance.
It is very important to test the model to gain validation and feedback on the model. The dataset here is split into 80:20. The train data will be 80% and test data 20% of the entire population. All randomly selected. The sampling is done using the sample
function of pandas
. An easier method would be to use the inbuilt train_test_split()
function of the sklearn.model_selection
module.
randomized = df.sample(frac=1,random_state=1)
train_size = round(len(randomized) * 0.8)
train = randomized[:train_size].reset_index(drop=True)
test = randomized[train_size:].reset_index(drop=True)
print(train.shape)
print(test.shape)
(4458, 2) (1114, 2)
A sample of a population has to be a representative of the population, otherwise the results obtained can be faulty or skewed. Thus it becomes very important to check for this criterion before moving forward with the project.
train.Label.value_counts(normalize=True)
ham 0.86541 spam 0.13459 Name: Label, dtype: float64
test.Label.value_counts(normalize=True)
ham 0.868043 spam 0.131957 Name: Label, dtype: float64
Comparing the splits, the proportion of classes for the two splits as well as the entire population (found above) are very near to eachother. This is a good enough measure to prove that the split samples are good representatives of the population for this dataset.
The general idea for this classifier is that, for every word in the input the probability is calculated for that word appearing in either spam messages or non-spam messages. These probabilities are then compared. If the probabilities cumulatively indicate that having this word or a collection of words is strongly associated with spam (non-spam) messages, then that input is classified as spam (non-spam).
The above logic pushes for cleaning the text to only recover the terms that are useful and remove stopwords, punctuations and anything extra in the texts.
train.SMS.head(10)
0 Yep, by the pretty sculpture 1 Yes, princess. Are you going to make me moan? 2 Welp apparently he retired 3 Havent. 4 I forgot 2 ask ü all smth.. There's a card on ... 5 Ok i thk i got it. Then u wan me 2 come now or... 6 I want kfc its Tuesday. Only buy 2 meals ONLY ... 7 No dear i was sleeping :-P 8 Ok pa. Nothing problem:-) 9 Ill be there on <#> ok. Name: SMS, dtype: object
train.SMS = train['SMS'].str.replace('\W',' ')
train.SMS = train['SMS'].str.lower()
train.SMS.head(10)
0 yep by the pretty sculpture 1 yes princess are you going to make me moan 2 welp apparently he retired 3 havent 4 i forgot 2 ask ü all smth there s a card on ... 5 ok i thk i got it then u wan me 2 come now or... 6 i want kfc its tuesday only buy 2 meals only ... 7 no dear i was sleeping p 8 ok pa nothing problem 9 ill be there on lt gt ok Name: SMS, dtype: object
The normalized messages are used to create the Term Document Matrix
. This is a mapping of every word to its message and the number of times the word occurs in the message. Think of it as a matrix with vocabulary of all the texts as row indices and every text id as column indices. A value 'n' (> 0) at the index -> (row,col) means the word 'col' appear n times in the text 'row'.
The first step to building the Term Document Matrix
is to create the vocabulary of the texts. The vocabulary is a set of all unique words in all the texts.
messages = train.SMS.str.split()
words = list(chain(*messages))
vocabulary = pd.Series(words).unique()
len(vocabulary)
7783
In the train set, there are 7783 unique words for our vocabulary. These have been retrieved from the messages.
Using the vocabulary, the Term Document Matrix
is derived. For the ease of usage, a dictionary representing the same concept is made.
no_of_messages = len(train.SMS)
word_counts = {word: [0] * no_of_messages for word in vocabulary}
for index,mssge in enumerate(messages):
for word in mssge:
word_counts[word][index] += 1
The above method made a dictionary with every term in the vocabulary as key and for each term a vector identifying in which messages the term appears in and how many times, analogous to the Term Document Matrix
The method afore-mentioned is a way of doing this from scratch, that doesnt take into account the stop words. This can also be achieved via the CountVectorizer
function in the sklearn.feature_extraction.text
module. Here the removal of stopwords is also considered.
token = RegexpTokenizer(r'[a-z0-9A-Z]+')
vectorizer = CountVectorizer(
lowercase=True,
stop_words='english',
ngram_range=(1,1),
tokenizer = token.tokenize
)
counts = vectorizer.fit_transform(train['SMS'])
print(len(vectorizer.get_feature_names()))
vocabulary = vectorizer.get_feature_names()
word_counts = dict(zip(vectorizer.get_feature_names(),np.transpose(counts.toarray())))
7508
The vocabulary contains 7508 terms after removing the stopwords as well. Since the vocabulary is quite long, it makes it impossible to draw inferences from. One solution is to create a WordCloud
to get a gist of the data. The WordCloud
s will be different for the spam
messages and the ham
(non-spam) messages. This would give a sense of the kind of words present in both the type of messages.
The WordCloud
is present in the wordcloud package, downloadable -
pip install wordcloud
conda install wordcloud
The WordCloud
function creates an image object, which is viewed using the imshow()
function of matplotlib
.
The width, height and other parameters are set in the wordcloud object. There is a provision for removing stopwords as well in the object. The stopwords paramter has to be supplied with the STOPWORDS object from the nltk.corpus
.
The first WordCloud
is of the text messages, labeled as spam.
text = ""
for message in train[train.Label == 'spam'].SMS:
words = message.split()
text = text + " ".join(words) + " "
# stopwords = set(STOPWORDS)
wordcloud = WordCloud(width=1200,height=800,stopwords=stopwords.words('english'),background_color='white').generate(text)
plt.figure(figsize=(12,8))
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis('off')
(-0.5, 1199.5, 799.5, -0.5)
From the WordCloud
the words like - free, call, now, txt, tone, reply, mobile, text are frequently appearing words in the spam messages of the training set. This is very intuitive as infact these sort of words do appear in spam messages recieved generally.
The next word cloud is for the ham (non-spam) label.
text = ""
for message in train[train.Label == 'ham'].SMS:
words = message.split()
text = text + " ".join(words) + " "
# stopwords = set(STOPWORDS)
wordcloud = WordCloud(width=1200,height=800,stopwords=stopwords.words('english'),background_color='white').generate(text)
plt.figure(figsize=(12,8))
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis('off')
(-0.5, 1199.5, 799.5, -0.5)
There are distinct differences in the variety and kind of words appearing for both the classes/labels.
The words appearing in ham(non-spam) messages are generally common words as the WordCloud
shows - love, come, know, got, home, need, dear etc. The ones in spam messages convey state very direct actions/objects like the words mentioned.
For model purposes, the dictionary is converted to a DataFrame and appended to the initial dataset. This way the representation for each message becomes easy. Every row is an individual text and every column after the SMS and Label columns represents a word from the vocabulary.
vector_space = pd.DataFrame(word_counts)
train = pd.concat([train,vector_space],axis=1)
train.head(3)
Label | SMS | 0 | 00 | 000 | 000pes | 008704050406 | 0089 | 01223585334 | 02 | ... | zealand | zebra | zed | zeros | zhong | zindgi | zoe | zogtorius | zouk | zyada | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ham | yep by the pretty sculpture | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | ham | yes princess are you going to make me moan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | ham | welp apparently he retired | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 rows × 7510 columns
Before the model building starts, the stopwords have to be removed from the SMS columns permanently. The stopwords are usually present in all messages and are not relevant when distinguishing between spam and ham (non-spam) messages.
The train dataset is ready for model building. The Naive Bayes algorithm's formula requires the following terms. Precalculating them once saves time during model training, :-
The Alpha is the Laplace smoothing constant. This is required because there can be certain words of the input text that only appear in spam messages and not in non-spam messages. When calculating - P(word|non-spam), it gives a 0. Since the inference is based on the given formula :-
ypred = argmaxy (P(y)P(w1|y)P(w2|y)…P(wn|y))
One 0 in the equation and the entire message is given a 0 probability. This actually is a shortcoming of the limited vocabulary and limited texts in the dataset. To overcome this problem, the Laplace smoothing constant is applied to avoid the 0 entering the equation.
alpha = 1
P_spam = train.Label.value_counts(normalize=True)['spam']
P_ham = train.Label.value_counts(normalize=True)['ham']
N_spam = len(list(chain(*messages[train.Label == 'spam'])))
N_ham = len(list(chain(*messages[train.Label == 'ham'])))
N_vocab = len(vocabulary)
The paramters
where wi stands for each word in the vocabulary. These formulas calculate the probability of a word appearing in the text given the message label. These calculations are static like the ones done previously. Hence calculating them once will avoid meaningless calculations during training phaze.
Two separate dictionaries, for spam and ham (non-spam) are maintained, mapping word to its probability of appearing in the respective label.
P_word_given_spam = {word: 0 for word in vocabulary}
P_word_given_ham = {word: 0 for word in vocabulary}
Spam_messages = train[train.Label == 'spam']
Ham_messages = train[train.Label == 'ham']
for word in vocabulary:
N_word_given_spam = Spam_messages[word].sum()
N_word_given_ham = Ham_messages[word].sum()
P_word_given_spam[word] = (
(N_word_given_spam + alpha) / (N_spam + (alpha * N_vocab))
)
P_word_given_ham[word] = (
(N_word_given_ham + alpha) / (N_ham + (alpha * N_vocab))
)
All static components required to build the model are calculated, this saves time during training.
Next phaze is to build the model. The model will take an input message, calculate the probabilites :-
Comparing the two, if the first is greater than the latter, the models classifies the message as spam
, if they are equal then the model requires human help to classify them better, else the message is classified as ham
(non-spam).
def classify(message,verbose):
message = re.sub('\W',' ',message)
message = message.lower()
message = message.split()
spam = 1
ham = 1
for word in message:
if word in P_word_given_spam.keys():
spam *= P_word_given_spam[word]
if word in P_word_given_ham.keys():
ham *= P_word_given_ham[word]
P_spam_given_message = P_spam * spam
P_ham_given_message = P_ham * ham
if verbose:
print("P(spam|message) = ",P_spam_given_message)
print("P(ham|message) = ",P_ham_given_message)
if P_spam_given_message > P_ham_given_message:
if verbose:
print('Label: spam')
return 'spam'
elif P_spam_given_message < P_ham_given_message:
if verbose:
print('Label: ham')
return 'ham'
else:
if verbose:
print('Human assistance needed, equal probabilities')
return 'Not classified'
The function as specified before accepts an input message and classifies it. The verbose parameter is to get printed output at every step of the function. It is useful when debugging or understanding the working.
The function is tested on the following inputs, as a part of preliminary test :-
Spam - 'WINNER!! This is the secret code to unlock the money: C3421.'
Ham (non-spam) - "Sounds good, Tom, then see u there"
classify('WINNER!! This is the secret code to unlock the money: C3421.',verbose=1)
P(spam|message) = 6.814909426971065e-15 P(ham|message) = 1.772963190712378e-17 Label: spam
'spam'
classify('Sounds good, Tom, then see u there',verbose=1)
P(spam|message) = 1.5820325455468539e-15 P(ham|message) = 2.7059850698247674e-13 Label: ham
'ham'
The model looks good enough from the preliminary tests. As the final part of the project, the model has to be tested on the test set and inference from the predictions have to be made.
The predictions are stored in a new column predicted, to compare it with the actual Label column.
test['predicted'] = test.SMS.apply(classify,verbose=0) #try putting verbose >1 and see the output of the model
test.head(5)
Label | SMS | predicted | |
---|---|---|---|
0 | ham | Later i guess. I needa do mcat study too. | ham |
1 | ham | But i haf enuff space got like 4 mb... | ham |
2 | spam | Had your mobile 10 mths? Update to latest Oran... | spam |
3 | ham | All sounds good. Fingers . Makes it difficult ... | ham |
4 | ham | All done, all handed in. Don't know if mega sh... | ham |
test.predicted.value_counts(normalize=True)
ham 0.857271 spam 0.142729 Name: predicted, dtype: float64
Out of the entire test sample, the model classified about 85% as ham and about 14% as spam. The model seems to have done pretty well as these proportions are analogous to the sample's proportion of classes. But this doesnt speak for missclassified labels.
To be clearer with respect to the effectiveness of the model, accuracy is a good metric that describes the proportion of correctly classified labels by the model. The accuracy is simply comparision between the Label
and the predicted
columns divided by the total units in the test sample.
accuracy = sum(test.Label == test.predicted)/len(test)
accuracy
0.9802513464991023
The model has achieved a phenomenal accuracy of about 98% on the test set. It obviously seems to good to be true, but this is a result of less data points for training, limited variance in the vocabulary of the population.
The accuracy can be calculated on the train set as well.
train['predicted'] = train.SMS.apply(classify,verbose=0)
accuracy = sum(train.Label == train.predicted)/len(train)
accuracy
0.9894571556751907
train.predicted.value_counts(normalize=True)
ham 0.856662 spam 0.143338 Name: predicted, dtype: float64
Accuracies give a general idea about the model, but not the entire picture. A confusion matrix gives information about the True positives and the False negatives (correct predictions) and the True negative and False positive (incorrect predictions) of the model. With these, important metrics to report are the Precision, Recall and the F1 score.
The sklearn.metrics
module provides the inbuilt function classification_report
which is generated based on the true labels and the predicted labels.
print(classification_report(test.predicted,test.Label))
precision recall f1-score support ham 0.98 0.99 0.99 955 spam 0.97 0.89 0.93 159 accuracy 0.98 1114 macro avg 0.97 0.94 0.96 1114 weighted avg 0.98 0.98 0.98 1114
The model manages to get a 99% accuracy on the train set. And an equally good precision and recall.
Since it is not a 100% for either test or train set, certain messages were missclassified. Its good practice to review the missclassified data, as it could be due to erroneous data or uncleaned text in this case.
Since the test messages were cleaned inside the classify()
function. To review the missclassified messages, they will have to be cleaned externally.
def clean(message):
message = re.sub('\W',' ',message)
message = message.lower()
return message
test.SMS = test.SMS.apply(clean)
missclassified = test[test.Label != test.predicted]
missclassified
Label | SMS | predicted | |
---|---|---|---|
9 | ham | i liked the new mobile | spam |
18 | ham | and picking them up from various points | spam |
114 | spam | not heard from u4 a while call me now am here... | ham |
130 | ham | how was txting and driving | spam |
131 | ham | they have a thread on the wishlist section of ... | spam |
152 | ham | unlimited texts limited minutes | spam |
159 | ham | 26th of july | spam |
284 | ham | nokia phone is lovly | spam |
302 | ham | no calls messages missed calls | spam |
313 | ham | speaking of does he have any cash yet | spam |
319 | ham | we have sent jd for customer service cum accou... | spam |
363 | spam | email alertfrom jeri stewartsize 2kbsubject ... | ham |
398 | ham | hasn t that been the pattern recently crap wee... | spam |
492 | ham | madam regret disturbance might receive a refer... | spam |
660 | ham | any chance you might have had with me evaporat... | spam |
754 | ham | i am getting threats from your sales executive... | spam |
824 | ham | these won t do have to move on to morphine | spam |
876 | spam | rct thnq adrian for u text rgds vatian | ham |
885 | spam | 2 2 146tf150p | ham |
923 | ham | how much would it cost to hire a hitman | spam |
953 | spam | hello we need some posh birds and chaps to us... | ham |
1073 | ham | i career tel have added u as a contact on in... | spam |
The easier method to visualize texts is a WordCloud
. Below are generated WordCloud
s for both the classes, only for the texts that were missclassified.
The first one is for the ham (non-spam) messages that were classified as spam.
text = ""
for message in missclassified[missclassified.Label == 'ham'].SMS:
words = message.split()
text = text + " ".join(words) + " "
# stopwords = set(STOPWORDS)
wordcloud = WordCloud(width=1200,height=800,stopwords=stopwords.words('english'),background_color='white').generate(text)
plt.figure(figsize=(12,8))
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis('off')
(-0.5, 1199.5, 799.5, -0.5)
The WordCloud
shows the most frequent words for the ham (non-spam) missclassified messages. The words - call, phone, connected, messages, executive, mobile etc are similar to words found usually in spam messages. In short, the model missclassified all those ham (non-spam) messages that had a more formal tone to it or sort of mimicked the spam messages.
The next plot is for the missclassified spam label.
text = ""
for message in missclassified[missclassified.Label == 'spam'].SMS:
words = message.split()
text = text + " ".join(words) + " "
# stopwords = set(STOPWORDS)
wordcloud = WordCloud(width=1200,height=800,stopwords=stopwords.words('english'),background_color='white').generate(text)
plt.figure(figsize=(12,8))
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis('off')
(-0.5, 1199.5, 799.5, -0.5)
The plot shows alot of words that are not part of the vocabulary. Words such as - u4, luv, knickers etc. Due to these words not appearing in the vocabulary, the ham words outweighed the spam words, meaning - The message is essentially a spam, since the words that could actually contribute to classify this as one are not present in the vocabulary, the words present in the message and were a part of the vocabulary classified this as a ham (non-spam) message.
The missclassified messages expose the problem of this model. The model is oblivious to context hidden in the messages. It only considers the word individually and checks if that word can be spam or not. Whereas it is noticed that the words individually do not convey the whole story.
In conclusion to the project, a successful model for spam filter was created using the Naive Bayes Algorithm. The model assumes independence between the messages and the words in the messages, oblivious to the context of the message.
Just based on this assumption, the model performed very well for this dataset with the following accuracies (approx) :-
The model achieves a good precision and recall scores. However it is important to note that spam messages have a lesser recall. This can be accounted to the fact that there is an imbalance in the classes of the target variable.
Overall the model has learnt the training data well and has performed well on the test set for the text messages in the dataset used.