Our first task is to "teach" the computer how to classify SMS messages (spam or non-spam). To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.
The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The UCI Machine Learning Repository.
The data collection process is described in more details on this page, where you can also find some of the authors' papers.
For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam). Another goal will be compare the performance of our manually-written algorithm with scikit-learn's MultinomialNB
classifier.
As a result, our algorithm is able to identify spam messages with 98.74% accuracy which is just a bit lower than scikit-learn's MultinomialNB
classifier result (99.01%).
import numpy as np
import pandas as pd
import re
SMS = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])
SMS
Label | SMS | |
---|---|---|
0 | ham | Go until jurong point, crazy.. Available only ... |
1 | ham | Ok lar... Joking wif u oni... |
2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... |
3 | ham | U dun say so early hor... U c already then say... |
4 | ham | Nah I don't think he goes to usf, he lives aro... |
... | ... | ... |
5567 | spam | This is the 2nd time we have tried 2 contact u... |
5568 | ham | Will ü b going to esplanade fr home? |
5569 | ham | Pity, * was in mood for that. So...any other s... |
5570 | ham | The guy did some bitching but I acted like i'd... |
5571 | ham | Rofl. Its true to its name |
5572 rows × 2 columns
SMS.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5572 entries, 0 to 5571 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Label 5572 non-null object 1 SMS 5572 non-null object dtypes: object(2) memory usage: 87.2+ KB
# How many span and non-spam messages are out there
SMS['Label'].value_counts()
ham 4825 spam 747 Name: Label, dtype: int64
# In percentages
(SMS['Label'].value_counts(normalize=True)*100).round(2)
ham 86.59 spam 13.41 Name: Label, dtype: float64
Once our spam filter is done, we'll need to test how good it is with classifying new messages. To test the spam filter, we're first going to split our dataset into two categories:
We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data).
# Randomize the dataset
data_randomized = SMS.sample(frac=1, random_state=1) # The whole sample is randomized
# Calculate index for split
training_test_index = round(len(data_randomized) * 0.8)
# Training/Test split
SMS_train = data_randomized[:training_test_index].reset_index(drop=True)
SMS_test = data_randomized[training_test_index:].reset_index(drop=True)
print(f"Traing data size : {SMS_train.shape}")
print(f"Testing data size : {SMS_test.shape}")
Traing data size : (4458, 2) Testing data size : (1114, 2)
(SMS_train['Label'].value_counts(normalize=True)*100).round(2)
ham 86.54 spam 13.46 Name: Label, dtype: float64
(SMS_test['Label'].value_counts(normalize=True)*100).round(2)
ham 86.8 spam 13.2 Name: Label, dtype: float64
The original relation between spam and non-spam messages from the original dataset is kept for both training and testing datasets.
Our end goal with this data cleaning process is to bring our training set to this format to make the calculations easier for our algorithm:
With the exception of the "Label" column, every other column in the transformed table above represents a unique word in our vocabulary (more specifically, each column shows the frequency of that unique word for any given message).
# Remove all the punctuation from SMS column and make all the strings lowercase
SMS_train['SMS'] = SMS_train['SMS'].str.replace(r'[^\w\s]', ' ').str.lower()
SMS_train.head()
Label | SMS | |
---|---|---|
0 | ham | yep by the pretty sculpture |
1 | ham | yes princess are you going to make me moan |
2 | ham | welp apparently he retired |
3 | ham | havent |
4 | ham | i forgot 2 ask ü all smth there s a card on ... |
# Extract all the words into vocabulary
SMS_train['SMS'] = SMS_train['SMS'].str.split()
vocabulary = []
for sms in SMS_train['SMS']:
for word in sms:
vocabulary.append(word)
vocabulary[:5]
['yep', 'by', 'the', 'pretty', 'sculpture']
# Delete the duplicates
vocabulary_final = list(set(vocabulary))
print(f"Vocabulary original length: {len(vocabulary)}")
print(f"Vocabulary w/o duplicates length: {len(vocabulary_final)}")
Vocabulary original length: 72427 Vocabulary w/o duplicates length: 7783
# Create a dictionary counting each word from vocabulary apperance in each SMS
word_counts_per_sms = {unique_word: [0] * len(SMS_train['SMS']) for unique_word in vocabulary_final}
for index, sms in enumerate(SMS_train['SMS']):
for word in sms:
word_counts_per_sms[word][index] += 1
words = pd.DataFrame(word_counts_per_sms)
words.shape
(4458, 7783)
# Print just first 15 columns due to the huge size of the dataset
words.iloc[:,:15].head()
castor | smth | vibrator | copied | 02073162414 | wrkin | freak | stylish | rose | salesman | 08719839835 | neck | cold | stop2stop | junna | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# Create a final dataset for training
SMS_train_final = pd.concat([SMS_train['Label'], words], axis = 1)
SMS_train_final.iloc[:,:15].head(5)
Label | castor | smth | vibrator | copied | 02073162414 | wrkin | freak | stylish | rose | salesman | 08719839835 | neck | cold | stop2stop | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ham | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | ham | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | ham | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | ham | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | ham | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
For the multinominal Naive Bayes algorithm will need to know the probability values of the two equations below to be able to classify new messages:
Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, since we are using Multinominal Bayes classification we need to use these equations:
ome of the terms in the four equations above will have the same value for every new message. As a start, let's first calculate:
We'll also use Laplace smoothing and set α = 1 .
alpha = 1
p_no_spam = SMS_train_final['Label'].value_counts(normalize=True)['ham']
p_spam = SMS_train_final['Label'].value_counts(normalize=True)['spam']
N_vocab = len(vocabulary_final)
# Calculate number of words for each SMS message
SMS_train_final['word_num'] = SMS_train_final.iloc[:,1:].sum(axis=1)
N_no_spam = SMS_train_final[SMS_train_final['Label'] == 'ham']['word_num'].sum()
N_spam = SMS_train_final[SMS_train_final['Label'] == 'spam']['word_num'].sum()
# Check
if SMS_train_final['word_num'].sum() - N_no_spam - N_spam == 0:
print("The check is passed")
else:
print("There is a mistake in the calculations. Check your code.")
The check is passed
P(wi|Spam) and P(wi|Ham) will vary depending on the individual words. For instance, P("Hi"|Spam) will have a certain probability value, while P("cousin"|Spam) or P("prize"|Spam) will most likely have other values.
Although both P(wi|Spam) and P(wi|Ham) vary depending on the word, the probability for each individual word is constant for every new message.
# Isolate spam and not spam dataframes
no_spam_df = SMS_train_final[SMS_train_final['Label'] == 'ham']
spam_df = SMS_train_final[SMS_train_final['Label'] == 'spam']
# Initialize spam and not spam count dictionaries
no_spam_words = {word:0 for word in vocabulary_final}
spam_words = {word:0 for word in vocabulary_final}
# Calculate probabilities of each word|spam
for word in spam_words:
N_word_given_spam = spam_df[word].sum()
spam_words[word] = (N_word_given_spam + alpha)/(N_spam + alpha*N_vocab)
for word in no_spam_words:
N_word_given_no_spam = no_spam_df[word].sum()
no_spam_words[word] = (N_word_given_no_spam + alpha)/(N_no_spam + alpha*N_vocab)
Now that we've calculated all the constants and parameters we need, we can start creating the spam filter. The spam filter can be understood as a function that:
def classify(message):
message = re.sub('\W', ' ', message).lower().split() # String cleaning
p_spam_given_message = p_spam
p_no_spam_given_message = p_no_spam
for word in message:
if word in spam_words:
p_spam_given_message *= spam_words[word]
if word in no_spam_words:
p_no_spam_given_message *= no_spam_words[word]
print('P(spam|message):', p_spam_given_message)
print('P(no_spam|message):', p_no_spam_given_message)
if p_spam_given_message > p_no_spam_given_message:
return "This is spam"
elif p_spam_given_message < p_no_spam_given_message:
return "This is not spam"
else:
return "We need a human to classify that"
# Trial check
print(classify('WINNER!! This is the secret code to unlock the money: C3421.'))
print(classify("Sounds good, Tom, then see u there"))
P(spam|message): 1.3481290211300841e-25 P(no_spam|message): 1.9368049028589875e-27 This is spam P(spam|message): 2.4372375665888117e-25 P(no_spam|message): 3.687530435009238e-21 This is not spam
# Make a function applicable to the dataset
def classify_test_set(message):
message = re.sub('\W', ' ', message).lower().split() # String cleaning
p_spam_given_message = p_spam
p_no_spam_given_message = p_no_spam
for word in message:
if word in spam_words:
p_spam_given_message *= spam_words[word]
if word in no_spam_words:
p_no_spam_given_message *= no_spam_words[word]
if p_spam_given_message > p_no_spam_given_message:
return "spam"
elif p_spam_given_message < p_no_spam_given_message:
return "ham"
else:
return "We need a human to classify that"
# Test the function
SMS_test['prediction'] = SMS_test['SMS'].apply(classify_test_set)
SMS_test.sample(10)
Label | SMS | prediction | |
---|---|---|---|
856 | ham | Ok. | ham |
721 | ham | Thanks and ! Or bomb and date as my phone wan... | ham |
280 | ham | Yes..he is really great..bhaji told kallis bes... | ham |
197 | ham | <#> ISH MINUTES WAS 5 MINUTES AGO. WTF. | ham |
188 | spam | 4mths half price Orange line rental & latest c... | spam |
428 | ham | Sorry i din lock my keypad. | ham |
867 | ham | Aiyo... Her lesson so early... I'm still sleep... | ham |
240 | ham | .Please charge my mobile when you get up in mo... | ham |
872 | ham | I thk 50 shd be ok he said plus minus 10.. Did... | ham |
500 | spam | Ur balance is now £600. Next question: Complet... | spam |
# Measure the accuracy
correct = 0
total = SMS_test.shape[0]
for row in SMS_test.iterrows():
row = row[1] # To get rid of index
if row['Label'] == row['prediction']:
correct += 1
print(f"Correctly predicted: {correct}")
print(f"Incorrectly predicted: {total}")
print(f"The accuracy of manually written Naive Bayes: {correct/total}")
Correctly predicted: 1100 Incorrectly predicted: 1114 The accuracy of manually written Naive Bayes: 0.9874326750448833
Scikit learn library has a module related to Multinomial Naive Bayes classification. Let's implement it and compare the accuracy with a hand-written algorithm. We'll perform the following operations:
SMS
column.# Applying scikit learn
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
SMS_sk = SMS.copy()
# Encode labels
SMS_sk.replace('ham', 0, inplace = True)
SMS_sk.replace('spam', 1, inplace = True)
# Divide the dataset
train, test = train_test_split(SMS_sk, test_size=0.2, random_state=1)
# Prepare data
y_train = train['Label']
y_test = test['Label']
# Vectorize strings from 'SMS' column
cv = CountVectorizer(binary=False, max_df=0.95)
cv.fit(train['SMS']) # To get a dictionary
train_feature_set = cv.transform(train['SMS'])
test_feature_set = cv.transform(test['SMS'])
# Train the model
clf = MultinomialNB()
clf.fit(train_feature_set, y_train)
# Predict the result
pred = clf.predict(test_feature_set)
# Evaluate the accuracy
accuracy_score(y_test, pred)
print(f"The accuracy of the scikit-learn's Naive Bayes: {accuracy_score(y_test, pred)}")
The accuracy of the scikit-learn's Naive Bayes: 0.9901345291479821
In this project, we managed to build a spam filter for SMS messages using the Multinomial Naive Bayes algorithm. The filter had an accuracy of 98.74% on the test set, which is an excellent result. We initially aimed for an accuracy of over 80%, but we managed to do way better than that. Moreover, our accuracy was just a bit lower than scikit learn's MultinomialNB
classifier's (99.01%).