Building a spam filter with Naive Bayes¶

Our first task is to "teach" the computer how to classify SMS messages (spam or non-spam). To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The UCI Machine Learning Repository.

The data collection process is described in more details on this page, where you can also find some of the authors' papers.

For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam). Another goal will be compare the performance of our manually-written algorithm with scikit-learn's MultinomialNB classifier.

As a result, our algorithm is able to identify spam messages with 98.74% accuracy which is just a bit lower than scikit-learn's MultinomialNB classifier result (99.01%).

In [1]:

import numpy as np
import pandas as pd
import re

Exploring the dataset¶

In [2]:

SMS = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])
SMS

Out[2]:

	Label	SMS
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...
...	...	...
5567	spam	This is the 2nd time we have tried 2 contact u...
5568	ham	Will ü b going to esplanade fr home?
5569	ham	Pity, * was in mood for that. So...any other s...
5570	ham	The guy did some bitching but I acted like i'd...
5571	ham	Rofl. Its true to its name

5572 rows × 2 columns

In [3]:

SMS.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   5572 non-null   object
 1   SMS     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB

In [4]:

# How many span and non-spam messages are out there
SMS['Label'].value_counts()

Out[4]:

ham     4825
spam     747
Name: Label, dtype: int64

In [5]:

# In percentages
(SMS['Label'].value_counts(normalize=True)*100).round(2)

Out[5]:

ham     86.59
spam    13.41
Name: Label, dtype: float64

Splitting original data into training and testing datasets¶

Once our spam filter is done, we'll need to test how good it is with classifying new messages. To test the spam filter, we're first going to split our dataset into two categories:

A training set, which we'll use to "train" the computer how to classify messages.
A test set, which we'll use to test how good the spam filter is with classifying new messages.

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data).

In [6]:

# Randomize the dataset
data_randomized = SMS.sample(frac=1, random_state=1) # The whole sample is randomized

# Calculate index for split
training_test_index = round(len(data_randomized) * 0.8)

# Training/Test split
SMS_train = data_randomized[:training_test_index].reset_index(drop=True)
SMS_test = data_randomized[training_test_index:].reset_index(drop=True)

print(f"Traing data size : {SMS_train.shape}")
print(f"Testing data size : {SMS_test.shape}")

Traing data size : (4458, 2)
Testing data size : (1114, 2)

In [7]:

(SMS_train['Label'].value_counts(normalize=True)*100).round(2)

Out[7]:

ham     86.54
spam    13.46
Name: Label, dtype: float64

In [8]:

(SMS_test['Label'].value_counts(normalize=True)*100).round(2)

Out[8]:

ham     86.8
spam    13.2
Name: Label, dtype: float64

The original relation between spam and non-spam messages from the original dataset is kept for both training and testing datasets.

Data cleaning¶

Our end goal with this data cleaning process is to bring our training set to this format to make the calculations easier for our algorithm:

With the exception of the "Label" column, every other column in the transformed table above represents a unique word in our vocabulary (more specifically, each column shows the frequency of that unique word for any given message).

In [9]:

# Remove all the punctuation from SMS column and make all the strings lowercase
SMS_train['SMS'] = SMS_train['SMS'].str.replace(r'[^\w\s]', ' ').str.lower()
SMS_train.head()

Out[9]:

	Label	SMS
0	ham	yep by the pretty sculpture
1	ham	yes princess are you going to make me moan
2	ham	welp apparently he retired
3	ham	havent
4	ham	i forgot 2 ask ü all smth there s a card on ...

Data transformation¶

In [10]:

# Extract all the words into vocabulary
SMS_train['SMS'] = SMS_train['SMS'].str.split()

vocabulary = []
for sms in SMS_train['SMS']:
    for word in sms:
        vocabulary.append(word)
    
vocabulary[:5]

Out[10]:

['yep', 'by', 'the', 'pretty', 'sculpture']

In [11]:

# Delete the duplicates
vocabulary_final = list(set(vocabulary))
print(f"Vocabulary original length: {len(vocabulary)}")
print(f"Vocabulary w/o duplicates length: {len(vocabulary_final)}")

Vocabulary original length: 72427
Vocabulary w/o duplicates length: 7783

In [12]:

# Create a dictionary counting each word from vocabulary apperance in each SMS 
word_counts_per_sms = {unique_word: [0] * len(SMS_train['SMS']) for unique_word in vocabulary_final}

for index, sms in enumerate(SMS_train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [13]:

words = pd.DataFrame(word_counts_per_sms)
words.shape

Out[13]:

(4458, 7783)

In [14]:

# Print just first 15 columns due to the huge size of the dataset
words.iloc[:,:15].head()

Out[14]:

	castor	smth	vibrator	copied	02073162414	wrkin	freak	stylish	rose	salesman	08719839835	neck	cold	stop2stop	junna
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
4	0	2	0	0	0	0	0	0	0	0	0	0	0	0	0

In [15]:

# Create a final dataset for training
SMS_train_final = pd.concat([SMS_train['Label'], words], axis = 1)
SMS_train_final.iloc[:,:15].head(5)

Out[15]:

	Label	smth
0	ham	0
1	ham	0
2	ham	0
3	ham	0
4	ham	2

Creating the constants¶

For the multinominal Naive Bayes algorithm will need to know the probability values of the two equations below to be able to classify new messages:

$P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)$ $P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)$

Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, since we are using Multinominal Bayes classification we need to use these equations:

$P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}$ $P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}$

ome of the terms in the four equations above will have the same value for every new message. As a start, let's first calculate:

P(Spam) and P(Ham) (P(No Spam))
NSpam, NHam, NVocabulary

We'll also use Laplace smoothing and set α = 1 .

In [16]:

alpha = 1

p_no_spam = SMS_train_final['Label'].value_counts(normalize=True)['ham']
p_spam = SMS_train_final['Label'].value_counts(normalize=True)['spam']

N_vocab = len(vocabulary_final)

# Calculate number of words for each SMS message
SMS_train_final['word_num'] = SMS_train_final.iloc[:,1:].sum(axis=1)
N_no_spam = SMS_train_final[SMS_train_final['Label'] == 'ham']['word_num'].sum()
N_spam = SMS_train_final[SMS_train_final['Label'] == 'spam']['word_num'].sum()

# Check
if SMS_train_final['word_num'].sum() - N_no_spam - N_spam == 0:
    print("The check is passed")
else:
    print("There is a mistake in the calculations. Check your code.")

The check is passed

Calculating the parameters¶

Although both P(wi|Spam) and P(wi|Ham) vary depending on the word, the probability for each individual word is constant for every new message.

In [17]:

# Isolate spam and not spam dataframes
no_spam_df = SMS_train_final[SMS_train_final['Label'] == 'ham']
spam_df = SMS_train_final[SMS_train_final['Label'] == 'spam']

In [18]:

# Initialize spam and not spam count dictionaries 
no_spam_words = {word:0 for word in vocabulary_final}
spam_words = {word:0 for word in vocabulary_final}

# Calculate probabilities of each word|spam
for word in spam_words:
    N_word_given_spam = spam_df[word].sum()
    spam_words[word] = (N_word_given_spam + alpha)/(N_spam + alpha*N_vocab)
    
for word in no_spam_words:
    N_word_given_no_spam = no_spam_df[word].sum()
    no_spam_words[word] = (N_word_given_no_spam + alpha)/(N_no_spam + alpha*N_vocab)

Creating a spam filter¶

Now that we've calculated all the constants and parameters we need, we can start creating the spam filter. The spam filter can be understood as a function that:

Takes in as input a new message (w1, w2, ..., wn)
Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn)
Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
- If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
- If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
- If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.

In [19]:

def classify(message):
    message = re.sub('\W', ' ', message).lower().split() # String cleaning
    p_spam_given_message = p_spam
    p_no_spam_given_message = p_no_spam
    
    for word in message:
        if word in spam_words:
            p_spam_given_message *= spam_words[word]
        if word in no_spam_words:
            p_no_spam_given_message *= no_spam_words[word]
    
    print('P(spam|message):', p_spam_given_message)
    print('P(no_spam|message):', p_no_spam_given_message)
    
    if p_spam_given_message > p_no_spam_given_message:
        return "This is spam"
    elif p_spam_given_message < p_no_spam_given_message:
        return "This is not spam"
    else:
        return "We need a human to classify that"

In [20]:

# Trial check
print(classify('WINNER!! This is the secret code to unlock the money: C3421.'))
print(classify("Sounds good, Tom, then see u there"))

P(spam|message): 1.3481290211300841e-25
P(no_spam|message): 1.9368049028589875e-27
This is spam
P(spam|message): 2.4372375665888117e-25
P(no_spam|message): 3.687530435009238e-21
This is not spam

Measuring a spam filter accuracy¶

In [21]:

# Make a function applicable to the dataset
def classify_test_set(message):
    message = re.sub('\W', ' ', message).lower().split() # String cleaning
    p_spam_given_message = p_spam
    p_no_spam_given_message = p_no_spam
    
    for word in message:
        if word in spam_words:
            p_spam_given_message *= spam_words[word]
        if word in no_spam_words:
            p_no_spam_given_message *= no_spam_words[word]
    
    if p_spam_given_message > p_no_spam_given_message:
        return "spam"
    elif p_spam_given_message < p_no_spam_given_message:
        return "ham"
    else:
        return "We need a human to classify that"

In [22]:

# Test the function
SMS_test['prediction'] = SMS_test['SMS'].apply(classify_test_set)
SMS_test.sample(10)

Out[22]:

	Label	SMS	prediction
856	ham	Ok.	ham
721	ham	Thanks and ! Or bomb and date as my phone wan...	ham
280	ham	Yes..he is really great..bhaji told kallis bes...	ham
197	ham	<#> ISH MINUTES WAS 5 MINUTES AGO. WTF.	ham
188	spam	4mths half price Orange line rental & latest c...	spam
428	ham	Sorry i din lock my keypad.	ham
867	ham	Aiyo... Her lesson so early... I'm still sleep...	ham
240	ham	.Please charge my mobile when you get up in mo...	ham
872	ham	I thk 50 shd be ok he said plus minus 10.. Did...	ham
500	spam	Ur balance is now £600. Next question: Complet...	spam

In [23]:

# Measure the accuracy
correct = 0
total = SMS_test.shape[0]
for row in SMS_test.iterrows():
    row = row[1] # To get rid of index
    if row['Label'] == row['prediction']:
        correct += 1
        
print(f"Correctly predicted: {correct}")
print(f"Incorrectly predicted: {total}")
print(f"The accuracy of manually written Naive Bayes: {correct/total}")

Correctly predicted: 1100
Incorrectly predicted: 1114
The accuracy of manually written Naive Bayes: 0.9874326750448833

Comparing result with scikit learn library¶

Scikit learn library has a module related to Multinomial Naive Bayes classification. Let's implement it and compare the accuracy with a hand-written algorithm. We'll perform the following operations:

Make a copy of an original dataset to use for scikit-learn training.
Divide it to training and testing datasets. It has to be mentioned, that these training and testing datasets will be different comparing to the datasets we used in the manually-written algorithm's accuracy calculations. It is done for the purpose of having a "clear" experiment, i.e., we would like to answer the question "What will be the accuracy if we use typical scikit-learn approach to perform the same task".
Vectorize the text from the SMS column.
Train the model.
Perform the prediction.
Evaluate the model's accuracy.

In [25]:

# Applying scikit learn
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score

SMS_sk = SMS.copy()

# Encode labels
SMS_sk.replace('ham', 0, inplace = True)
SMS_sk.replace('spam', 1, inplace = True)

# Divide the dataset
train, test = train_test_split(SMS_sk, test_size=0.2, random_state=1)

# Prepare data
y_train = train['Label']
y_test = test['Label']

# Vectorize strings from 'SMS' column
cv = CountVectorizer(binary=False, max_df=0.95)
cv.fit(train['SMS']) # To get a dictionary
train_feature_set = cv.transform(train['SMS'])
test_feature_set = cv.transform(test['SMS'])

# Train the model
clf = MultinomialNB()
clf.fit(train_feature_set, y_train)

# Predict the result
pred = clf.predict(test_feature_set)

# Evaluate the accuracy
accuracy_score(y_test, pred)

print(f"The accuracy of the scikit-learn's Naive Bayes: {accuracy_score(y_test, pred)}")

The accuracy of the scikit-learn's Naive Bayes: 0.9901345291479821

Conclusions¶

In this project, we managed to build a spam filter for SMS messages using the Multinomial Naive Bayes algorithm. The filter had an accuracy of 98.74% on the test set, which is an excellent result. We initially aimed for an accuracy of over 80%, but we managed to do way better than that. Moreover, our accuracy was just a bit lower than scikit learn's MultinomialNB classifier's (99.01%).

	castor	smth	vibrator	copied	02073162414	wrkin	freak	stylish	rose	salesman	08719839835	neck	cold	stop2stop	junna
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
4	0	2	0	0	0	0	0	0	0	0	0	0	0	0	0

	Label	smth
0	ham	0
1	ham	0
2	ham	0
3	ham	0
4	ham	2

	castor	smth	vibrator	copied	02073162414	wrkin	freak	stylish	rose	salesman	08719839835	neck	cold	stop2stop	junna
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
4	0	2	0	0	0	0	0	0	0	0	0	0	0	0	0

	Label	smth
0	ham	0
1	ham	0
2	ham	0
3	ham	0
4	ham	2

	castor	smth	vibrator	copied	02073162414	wrkin	freak	stylish	rose	salesman	08719839835	neck	cold	stop2stop	junna
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
4	0	2	0	0	0	0	0	0	0	0	0	0	0	0	0

	Label	smth
0	ham	0
1	ham	0
2	ham	0
3	ham	0
4	ham	2