So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.
The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The UCI Machine Learning Repository here. You can also download the dataset directly from this link. The data collection process is described in more details on this page, where you can also find some of the authors' papers.
For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).
# import all 'libraries' required for this project.
import pandas as pd
import numpy as np
import random
from numpy.random import seed, randint
from IPython.display import HTML
from IPython.display import display, Markdown
messages_df = pd.read_csv('SMSSpamCollection', sep='\t', header=None)
messages_df.columns=['Label', 'SMS']
print(messages_df.head(), '\n')
print(messages_df.info(), '\n')
print(messages_df['Label'].value_counts(), '\n')
print('The total number of messages are:', len(messages_df), '\n')
ham = messages_df[messages_df['Label'] == 'ham']
spam = messages_df[messages_df['Label'] == 'spam']
percent_ham = (100*len(ham))/len(messages_df)
percent_spam = 100 - percent_ham
print('The percentage of messages classified as ham are', "{:.2f}%".format(percent_ham), '\n')
print('The percentage of messages classified as spam are', "{:.2f}%".format(percent_spam))
Label SMS 0 ham Go until jurong point, crazy.. Available only ... 1 ham Ok lar... Joking wif u oni... 2 spam Free entry in 2 a wkly comp to win FA Cup fina... 3 ham U dun say so early hor... U c already then say... 4 ham Nah I don't think he goes to usf, he lives aro... <class 'pandas.core.frame.DataFrame'> RangeIndex: 5572 entries, 0 to 5571 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Label 5572 non-null object 1 SMS 5572 non-null object dtypes: object(2) memory usage: 87.2+ KB None ham 4825 spam 747 Name: Label, dtype: int64 The total number of messages are: 5572 The percentage of messages classified as ham are 86.59% The percentage of messages classified as spam are 13.41%
messages_df = messages_df.sample(frac=1, random_state=1)
print(messages_df.head())
percentage=round(len(messages_df)/100*80)
training_df = messages_df.head(percentage)
test_df = messages_df.iloc[percentage:len(messages_df),:]
print(len(training_df), '\n')
print(len(test_df), '\n')
display(training_df.head())
display(test_df.head())
hamtr = training_df[training_df['Label'] == 'ham']
spamtr = training_df[training_df['Label'] == 'spam']
p_training_ham = (100*len(hamtr))/len(training_df)
p_training_spam = 100 - p_training_ham
display(Markdown('<h2>Training Data</h2>'))
print("{:.2f}%".format(p_training_ham), 'ham ', "{:.2f}%".format(p_training_spam), 'spam', '\n')
hamt = test_df[test_df['Label'] == 'ham']
spamt = test_df[test_df['Label'] == 'spam']
p_test_ham = (100*len(hamt))/len(test_df)
p_test_spam = 100 - p_test_ham
display(Markdown('<h2>Test Data</h2>'))
print("{:.2f}%".format(p_test_ham), 'ham ', "{:.2f}%".format(p_test_spam), 'spam', '\n')
training_df = training_df.reset_index()
display(training_df.head())
test_df = test_df.reset_index()
display(test_df.head())
Label SMS 1078 ham Yep, by the pretty sculpture 4028 ham Yes, princess. Are you going to make me moan? 958 ham Welp apparently he retired 4642 ham Havent. 4674 ham I forgot 2 ask ü all smth.. There's a card on ... 4458 1114
Label | SMS | |
---|---|---|
1078 | ham | Yep, by the pretty sculpture |
4028 | ham | Yes, princess. Are you going to make me moan? |
958 | ham | Welp apparently he retired |
4642 | ham | Havent. |
4674 | ham | I forgot 2 ask ü all smth.. There's a card on ... |
Label | SMS | |
---|---|---|
2131 | ham | Later i guess. I needa do mcat study too. |
3418 | ham | But i haf enuff space got like 4 mb... |
3424 | spam | Had your mobile 10 mths? Update to latest Oran... |
1538 | ham | All sounds good. Fingers . Makes it difficult ... |
5393 | ham | All done, all handed in. Don't know if mega sh... |
86.54% ham 13.46% spam
86.80% ham 13.20% spam
index | Label | SMS | |
---|---|---|---|
0 | 1078 | ham | Yep, by the pretty sculpture |
1 | 4028 | ham | Yes, princess. Are you going to make me moan? |
2 | 958 | ham | Welp apparently he retired |
3 | 4642 | ham | Havent. |
4 | 4674 | ham | I forgot 2 ask ü all smth.. There's a card on ... |
index | Label | SMS | |
---|---|---|---|
0 | 2131 | ham | Later i guess. I needa do mcat study too. |
1 | 3418 | ham | But i haf enuff space got like 4 mb... |
2 | 3424 | spam | Had your mobile 10 mths? Update to latest Oran... |
3 | 1538 | ham | All sounds good. Fingers . Makes it difficult ... |
4 | 5393 | ham | All done, all handed in. Don't know if mega sh... |
import string
def remove_punctuations(text):
for punctuation in string.punctuation:
text = text.replace(punctuation, '')
return text
# Apply to the DF series
training_df['SMS_Cleaned'] = training_df['SMS'].apply(remove_punctuations)
test_df['SMS_Cleaned'] = test_df['SMS'].apply(remove_punctuations)
display(Markdown('<h2>Training Data</h2>'))
display(training_df.head())
display(Markdown('<h2>Test Data</h2>'))
display(test_df.head())
display(Markdown('<h2>Training Data</h2>'))
training_df['SMS_Cleaned'] = training_df['SMS_Cleaned'].str.lower()
display(training_df.head())
test_df['SMS_Cleaned'] = test_df['SMS_Cleaned'].str.lower()
display(Markdown('<h2>Test Data</h2>'))
display(test_df.head())
index | Label | SMS | SMS_Cleaned | |
---|---|---|---|---|
0 | 1078 | ham | Yep, by the pretty sculpture | Yep by the pretty sculpture |
1 | 4028 | ham | Yes, princess. Are you going to make me moan? | Yes princess Are you going to make me moan |
2 | 958 | ham | Welp apparently he retired | Welp apparently he retired |
3 | 4642 | ham | Havent. | Havent |
4 | 4674 | ham | I forgot 2 ask ü all smth.. There's a card on ... | I forgot 2 ask ü all smth Theres a card on da ... |
index | Label | SMS | SMS_Cleaned | |
---|---|---|---|---|
0 | 2131 | ham | Later i guess. I needa do mcat study too. | Later i guess I needa do mcat study too |
1 | 3418 | ham | But i haf enuff space got like 4 mb... | But i haf enuff space got like 4 mb |
2 | 3424 | spam | Had your mobile 10 mths? Update to latest Oran... | Had your mobile 10 mths Update to latest Orang... |
3 | 1538 | ham | All sounds good. Fingers . Makes it difficult ... | All sounds good Fingers Makes it difficult to... |
4 | 5393 | ham | All done, all handed in. Don't know if mega sh... | All done all handed in Dont know if mega shop ... |
index | Label | SMS | SMS_Cleaned | |
---|---|---|---|---|
0 | 1078 | ham | Yep, by the pretty sculpture | yep by the pretty sculpture |
1 | 4028 | ham | Yes, princess. Are you going to make me moan? | yes princess are you going to make me moan |
2 | 958 | ham | Welp apparently he retired | welp apparently he retired |
3 | 4642 | ham | Havent. | havent |
4 | 4674 | ham | I forgot 2 ask ü all smth.. There's a card on ... | i forgot 2 ask ü all smth theres a card on da ... |
index | Label | SMS | SMS_Cleaned | |
---|---|---|---|---|
0 | 2131 | ham | Later i guess. I needa do mcat study too. | later i guess i needa do mcat study too |
1 | 3418 | ham | But i haf enuff space got like 4 mb... | but i haf enuff space got like 4 mb |
2 | 3424 | spam | Had your mobile 10 mths? Update to latest Oran... | had your mobile 10 mths update to latest orang... |
3 | 1538 | ham | All sounds good. Fingers . Makes it difficult ... | all sounds good fingers makes it difficult to... |
4 | 5393 | ham | All done, all handed in. Don't know if mega sh... | all done all handed in dont know if mega shop ... |
# create an empty list to store each unique work for each meassage.
vocabulary = []
training_df['SMS_Cleaned'] = training_df['SMS_Cleaned'].str.split()
display(training_df.head())
for sms in training_df['SMS_Cleaned']:
for word in sms:
vocabulary.append(word)
vocabulary = set(vocabulary)
print(len(vocabulary))
vocabulary = list(vocabulary)
print(len(vocabulary))
index | Label | SMS | SMS_Cleaned | |
---|---|---|---|---|
0 | 1078 | ham | Yep, by the pretty sculpture | [yep, by, the, pretty, sculpture] |
1 | 4028 | ham | Yes, princess. Are you going to make me moan? | [yes, princess, are, you, going, to, make, me,... |
2 | 958 | ham | Welp apparently he retired | [welp, apparently, he, retired] |
3 | 4642 | ham | Havent. | [havent] |
4 | 4674 | ham | I forgot 2 ask ü all smth.. There's a card on ... | [i, forgot, 2, ask, ü, all, smth, theres, a, c... |
8515 8515
word_counts_per_sms = {unique_word: [0] * len(training_df['SMS_Cleaned']) for unique_word in vocabulary}
for index, sms in enumerate(training_df['SMS_Cleaned']):
for word in sms:
word_counts_per_sms[word][index] += 1
word_counts = pd.DataFrame(word_counts_per_sms)
print(len(word_counts))
pd.set_option("display.max_columns", 10)
display(word_counts.head())
4458
totes | rs | opener | jaykwon | father | ... | but | wishes | gigolo | hurried | httpwwwetlpcoukexpressoffer | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
5 rows × 8515 columns
training_df_clean = pd.concat([training_df, word_counts], axis=1)
training_df_clean.head()
index | Label | SMS | SMS_Cleaned | totes | ... | but | wishes | gigolo | hurried | httpwwwetlpcoukexpressoffer | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1078 | ham | Yep, by the pretty sculpture | [yep, by, the, pretty, sculpture] | 0 | ... | 0 | 0 | 0 | 0 | 0 |
1 | 4028 | ham | Yes, princess. Are you going to make me moan? | [yes, princess, are, you, going, to, make, me,... | 0 | ... | 0 | 0 | 0 | 0 | 0 |
2 | 958 | ham | Welp apparently he retired | [welp, apparently, he, retired] | 0 | ... | 0 | 0 | 0 | 0 | 0 |
3 | 4642 | ham | Havent. | [havent] | 0 | ... | 0 | 0 | 0 | 0 | 0 |
4 | 4674 | ham | I forgot 2 ask ü all smth.. There's a card on ... | [i, forgot, 2, ask, ü, all, smth, theres, a, c... | 0 | ... | 0 | 0 | 0 | 0 | 0 |
5 rows × 8519 columns
# Isolating spam and ham messages first
spam_messages = training_df_clean[training_df_clean['Label'] == 'spam']
ham_messages = training_df_clean[training_df_clean['Label'] == 'ham']
# P(Spam) and P(Ham)
p_spam = len(spam_messages) / len(training_df_clean)
p_ham = len(ham_messages) / len(training_df_clean)
# N_Spam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()
# N_Ham
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()
# N_Vocabulary
n_vocabulary = len(vocabulary)
# Laplace smoothing
alpha = 1
# Initiate parameters
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}
# Calculate parameters
for word in vocabulary:
n_word_given_spam = spam_messages[word].sum() # spam_messages already defined in a cell above
p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
parameters_spam[word] = p_word_given_spam
n_word_given_ham = ham_messages[word].sum() # ham_messages already defined in a cell above
p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
parameters_ham[word] = p_word_given_ham
import re
def classify(message):
'''
message: a string
'''
message = re.sub('\W', ' ', message)
message = message.lower().split()
p_spam_given_message = p_spam
p_ham_given_message = p_ham
for word in message:
if word in parameters_spam:
p_spam_given_message *= parameters_spam[word]
if word in parameters_ham:
p_ham_given_message *= parameters_ham[word]
print('P(Spam|message):', p_spam_given_message)
print('P(Ham|message):', p_ham_given_message)
if p_ham_given_message > p_spam_given_message:
print('Label: Ham')
elif p_ham_given_message < p_spam_given_message:
print('Label: Spam')
else:
print('Equal proabilities, have a human classify this!')
classify('You\'ve been chosen a winner. Click on the number below to access your money.')
P(Spam|message): 3.670601065218682e-50 P(Ham|message): 1.0688398108249164e-53 Label: Spam
classify('Hey Bill, let\'s celebrate your lottery win')
P(Spam|message): 3.560359302507492e-30 P(Ham|message): 8.160585293334815e-29 Label: Ham
def classify_test_df(message):
'''
message: a string
'''
message = re.sub('\W', ' ', message)
message = message.lower().split()
p_spam_given_message = p_spam
p_ham_given_message = p_ham
for word in message:
if word in parameters_spam:
p_spam_given_message *= parameters_spam[word]
if word in parameters_ham:
p_ham_given_message *= parameters_ham[word]
if p_ham_given_message > p_spam_given_message:
return 'ham'
elif p_spam_given_message > p_ham_given_message:
return 'spam'
else:
return 'needs human classification'
pd.reset_option("display.max_columns")
test_df['predicted'] = test_df['SMS_Cleaned'].apply(classify_test_df)
test_df.head()
index | Label | SMS | SMS_Cleaned | predicted | |
---|---|---|---|---|---|
0 | 2131 | ham | Later i guess. I needa do mcat study too. | later i guess i needa do mcat study too | ham |
1 | 3418 | ham | But i haf enuff space got like 4 mb... | but i haf enuff space got like 4 mb | ham |
2 | 3424 | spam | Had your mobile 10 mths? Update to latest Oran... | had your mobile 10 mths update to latest orang... | spam |
3 | 1538 | ham | All sounds good. Fingers . Makes it difficult ... | all sounds good fingers makes it difficult to... | ham |
4 | 5393 | ham | All done, all handed in. Don't know if mega sh... | all done all handed in dont know if mega shop ... | ham |
correct = 0
total = test_df.shape[0]
for row in test_df.iterrows():
row = row[1]
if row['Label'] == row['predicted']:
correct += 1
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)
Correct: 1094 Incorrect: 20 Accuracy: 0.9820466786355476
The spam filter developed by using Multinomial Naive Bayes in this case is very reliable at 98.2%.