Here in this guided project, we are going to build a spam filter using the practical side of naive bayes algorithm. The dataset for this project is put together by Tiago A. Almeida and José María Gómez Hidalgo and is available fof download from here.
The computer does the following to classify messages as spam or not spam:
import pandas as pd
import matplotlib.pyplot as plt
import re
import operator
from wordcloud import WordCloud, STOPWORDS
# Open the SMSSpamCollection file using the read_csv() function from the pandas package.
sms = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])
# Explore the dataset a little.
print(f'Number of SMS messages: {sms.shape[0]:,}')
print(f'Number of missing values in the dataframe: {sms.isnull().sum().sum()}\n')
def pretty_print_table(df, substring):
print(f'Spam vs. ham {substring}, %')
spam_ham_pct = round(df['Label'].value_counts(normalize=True)*100, 0)
print(spam_ham_pct.to_markdown(tablefmt='pretty', headers=['Label', '%']))
pretty_print_table(df=sms, substring='(non-spam)')
Number of SMS messages: 5,572 Number of missing values in the dataframe: 0 Spam vs. ham (non-spam), % +-------+------+ | Label | % | +-------+------+ | ham | 87.0 | | spam | 13.0 | +-------+------+
By applying the pretty_print_table() function we are able to get the probabilities of the spam and ham in a tabulated format. Here it 87% ham and 13% spam.
# Start by randomizing the entire dataset by using the DataFrame.sample() method.
sms_randomized = sms.sample(frac=1, random_state=1)
# Split the randomized dataset into a training and a test set.
training_data = sms_randomized[:4458].reset_index(drop=True)
test_data = sms_randomized[4458:].reset_index(drop=True)
# Find the percentage of spam and ham in both the training and the test set.
pretty_print_table(df=training_data, substring=('in the training set'))
print('\n')
pretty_print_table(df=test_data, substring=('in the test set'))
Spam vs. ham in the training set, % +-------+------+ | Label | % | +-------+------+ | ham | 87.0 | | spam | 13.0 | +-------+------+ Spam vs. ham in the test set, % +-------+------+ | Label | % | +-------+------+ | ham | 87.0 | | spam | 13.0 | +-------+------+
We are drawing a random sample of the spam and ham messages and segregating the dataset in train and test data with 75% of train_data and 25% of test_data.
# Remove all the punctuation from the SMS column. You can use the regex '\W' to detect any character that is not from a-z, A-Z or 0-9.
training_data['SMS'] = training_data['SMS'].str.replace('\W', ' ').str.lower()
training_data.head()
# Create a vocabulary for the messages in the training set.
training_data['SMS'] = training_data['SMS'].str.split()
training_data.head()
Label | SMS | |
---|---|---|
0 | ham | [yep, by, the, pretty, sculpture] |
1 | ham | [yes, princess, are, you, going, to, make, me,... |
2 | ham | [welp, apparently, he, retired] |
3 | ham | [havent] |
4 | ham | [i, forgot, 2, ask, ü, all, smth, there, s, a,... |
Using the series.str.replace() function in conjunction with the \W expression we are able to remove the punctuations in the training_data and subsequently we use the series.str.lower() function and series.str.split() function to convert to lowercase and split the sentences.
# Initiate an empty list named vocabulary
vocabulary = []
for sms in training_data['SMS']:
for word in sms:
vocabulary.append(word)
vocabulary = list(set(vocabulary))
print(f'Number of unique words in the vocabulary of the training set: {len(vocabulary):,}')
# Run the code you see above to get the word_counts_per_sms dictionary
word_counts_per_sms = {unique_word: [0] * len(training_data['SMS']) for unique_word in vocabulary}
for index, sms in enumerate(training_data['SMS']):
for word in sms:
word_counts_per_sms[word][index] += 1
# Transform word_counts_per_sms into a DataFrame using pd.DataFrame()
word_counts_per_sms = pd.DataFrame(word_counts_per_sms)
word_counts_per_sms.head(3)
# Concatenate the DataFrame we just built above with the DataFrame containing the training set
training_set_final = pd.concat([training_data, word_counts_per_sms], axis=1)
training_set_final.head(3)
Number of unique words in the vocabulary of the training set: 7,783
Label | SMS | made | 08081263000 | knock | eat | off | 08002986030 | misss | continue | ... | picsfree1 | kent | welcomes | magical | steyn | an | sounds | alert | breath | edition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ham | [yep, by, the, pretty, sculpture] | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | ham | [yes, princess, are, you, going, to, make, me,... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | ham | [welp, apparently, he, retired] | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 rows × 7785 columns
# Calculate P(Spam) and P(Ham).
training_spam = training_set_final[training_set_final['Label']=='spam']
training_ham = training_set_final[training_set_final['Label']=='ham']
# Creating a dictionary of words from all spam messages with their frequencies
spam_dict = {}
for sms in training_spam['SMS']:
for word in sms:
if word not in spam_dict:
spam_dict[word]=0
spam_dict[word]+=1
# Sorting the dictionary in descending order of word frequencies
sorted_spam_dict = dict(sorted(spam_dict.items(), key=operator.itemgetter(1), reverse=True))
To build a wordcloud we are:
selected = ['call', 'free', 'stop', 'mobile', 'text', 'claim', 'www',
'prize', 'send', 'cash', 'nokia', 'win', 'urgent', 'service',
'contact', 'com', 'msg', 'chat', 'guaranteed', 'customer',
'awarded', 'sms', 'ringtone', 'video', 'rate', 'latest',
'award', 'code', 'camera', 'chance', 'apply', 'valid', 'selected',
'offer', 'tones', 'collection', 'mob', 'network', 'attempt',
'bonus', 'delivery', 'weekly', 'club', 'http', 'help', 'dating',
'vouchers', 'poly', 'auction', 'ltd', 'pounds', 'special',
'services', 'games', 'await', 'double', 'unsubscribe', 'hot',
'price', 'sexy', 'camcorder', 'content', 'top', 'calls',
'account', 'private', 'winner', 'savamob', 'offers', 'pobox',
'gift', 'net', 'quiz', 'expires', 'freemsg', 'play', 'ipod',
'last', 'order', 'anytime', 'congratulations', 'caller', 'points',
'identifier', 'voucher', 'statement', 'operator', 'real',
'mobiles', 'important', 'join', 'rental', 'valued', 'congrats',
'final', 'enjoy', 'unlimited', 'tv', 'charged', 'sex']
# Extracting only the 100 most frequent spam words with their frequencies
filtered_sorted_spam_dict = {}
for word in selected:
if word in sorted_spam_dict:
filtered_sorted_spam_dict[word]=sorted_spam_dict[word]
print(f'The number of the most popular spam words selected: {len(filtered_sorted_spam_dict)}')
The number of the most popular spam words selected: 100
We select the first 100 words in the order of their frequencies and append them to the dictionary filtered_sorted_spam.
# Creating a word cloud
fig = plt.subplots(figsize=(12,10))
wordcloud = WordCloud(width=1000, height=700,
background_color='white',
random_state=1).generate_from_frequencies(filtered_sorted_spam_dict)
plt.title('The most frequent words in spam messages\n', fontsize=29)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
The 100 most frequent words of spam messages reveal the following patterns :
We make the classification using Naive Bayes algorithm based on the probabilities of two equations.
To calculate the probabilities of Spam and Ham above:
where:
# Calculate P(Spam) and P(Ham)
p_spam = training_set_final['Label'].value_counts()['spam']/len(training_set_final)
p_ham = training_set_final['Label'].value_counts()['ham']/len(training_set_final)
n_spam = 0
n_ham = 0
for i in range(len(training_set_final)):
row = list(training_set_final.iloc[i].values)
for j in range(2,len(row)):
if row[0]=='spam':
n_spam+=row[j]
else:
n_ham+=row[j]
n_vocabulary = len(vocabulary)
alpha = 1
print(f'p_spam: {p_spam:.2f}\n'
f'p_ham: {p_ham:.2f}\n'
f'n_spam: {n_spam:,}\n'
f'n_ham: {n_ham:,}\n'
f'n_vocabulary: {n_vocabulary:,}\n'
f'alpha: {alpha}')
p_spam: 0.13 p_ham: 0.87 n_spam: 15,190 n_ham: 57,237 n_vocabulary: 7,783 alpha: 1
For the 7,783 words in the vocabulary a total of 15,566 probabilities, we calculate the P(wi|Spam) and P(wi|Ham) for each word.
# Initialize two dictionaries, where each key-value pair is a unique word (from our vocabulary) represented as a string, and the value is 0
p_wi_spam = {}
p_wi_ham = {}
for word in vocabulary:
p_wi_spam[word] = (training_spam[word].sum()+alpha)/(n_spam+alpha*n_vocabulary)
p_wi_ham[word] = (training_ham[word].sum()+alpha)/(n_ham+alpha*n_vocabulary)
We apply the smoothing parameter alpha to this dictionary and append the respective probabilities to the same using a for loop.
Create a spam filter which:
Compares both values and:
# Copy the classify() function you see above and write the code needed for calculating
def classify(message):
message = re.sub('\W', ' ', message)
message = message.lower()
message = message.split()
p_spam_given_message = p_spam
p_ham_given_message = p_ham
for word in message:
if word in p_wi_spam:
p_spam_given_message*=p_wi_spam[word]
if word in p_wi_ham:
p_ham_given_message*=p_wi_ham[word]
if p_ham_given_message > p_spam_given_message:
return 'ham'
elif p_ham_given_message < p_spam_given_message:
return 'spam'
else:
return 'Equal proabilities, have a human classify this!'
# Use the classify() function to classify two new messages.
print(classify('WINNER!! This is the secret code to unlock the money: C3421.'))
print(classify('Sounds good, Tom, then see u there'))
spam ham
Testing the spam filter on the test set of 1114 messages.
test_data['Predicted'] = test_data['SMS'].apply(classify)
test_data.head()
# Measure the accuracy of the spam filter.
correct = 0
total = len(test_data)
for row in test_data.iterrows():
if row[1]['Predicted']==row[1]['Label']:
correct+=1
accuracy = correct/total*100
print(f'The accuracy of the spam filter: {accuracy:.2f}%')
The accuracy of the spam filter: 98.74%
We have successfully used our spam filter to get more than the expected 80% accuracy using the multinomial naive bayes algorithm.
Here is the list of the wrongly classified messages.
false_spam = test_data[(test_data['Predicted']=='spam')&(test_data['Label']=='ham')].reset_index(drop=True)
false_ham = test_data[(test_data['Predicted']=='ham')&(test_data['Label']=='spam')].reset_index(drop=True)
unclear = test_data[test_data['Predicted']=='needs human classification'].reset_index(drop=True)
print('Total number of wrongly classified messages: ', len(false_spam)+len(false_ham)+len(unclear))
print('_________________________________________________________________________\n')
print('FALSE SPAM MESSAGES:')
for row in false_spam.iterrows():
print(f'{row[0]+1}. ', row[1]['SMS'])
print('_________________________________________________________________________\n')
print('FALSE HAM MESSAGES:')
for row in false_ham.iterrows():
print(f'{row[0]+1}. ', row[1]['SMS'])
print('_________________________________________________________________________\n')
print('UNCLEAR MESSAGES:')
for row in unclear.iterrows():
print(f'{row[0]+1}. ', row[1]['SMS'])
print('_________________________________________________________________________')
Total number of wrongly classified messages: 13 _________________________________________________________________________ FALSE SPAM MESSAGES: 1. Unlimited texts. Limited minutes. 2. 26th OF JULY 3. Nokia phone is lovly.. 4. No calls..messages..missed calls 5. We have sent JD for Customer Service cum Accounts Executive to ur mail id, For details contact us _________________________________________________________________________ FALSE HAM MESSAGES: 1. Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net 2. More people are dogging in your area now. Call 09090204448 and join like minded guys. Why not arrange 1 yourself. There's 1 this evening. A£1.50 minAPN LS278BB 3. Oh my god! I've found your number again! I'm so glad, text me back xafter this msgs cst std ntwk chg £1.50 4. Hi babe its Chloe, how r u? I was smashed on saturday night, it was great! How was your weekend? U been missing me? SP visionsms.com Text stop to stop 150p/text 5. 0A$NETWORKS allow companies to bill for SMS, so they are responsible for their "suppliers", just as a shop has to give a guarantee on what they sell. B. G. 6. RCT' THNQ Adrian for U text. Rgds Vatian 7. 2/2 146tf150p 8. Hello. We need some posh birds and chaps to user trial prods for champneys. Can i put you down? I need your address and dob asap. Ta r _________________________________________________________________________ UNCLEAR MESSAGES: _________________________________________________________________________
From the total number of 13 wrongly classified messages, we can see that.
The slgorithm however had worked perfectly with other messages giving it an overal accuracy of 98.74%
By making the following adjustments we can try tuning the spam filter accuracy the following ways.
training_set_exp = sms_randomized[:4458].reset_index(drop=True)
test_set_exp = sms_randomized[4458:].reset_index(drop=True)
training_set_exp['SMS'] = training_set_exp['SMS'].str.replace('\W', ' ')
vocabulary_exp = []
for sms in training_set_exp['SMS']:
for word in sms:
vocabulary_exp.append(word)
vocabulary_exp = list(set(vocabulary_exp))
word_counts_per_sms_exp = {unique_word: [0] * len(training_set_exp['SMS']) for unique_word in vocabulary_exp}
for index, sms in enumerate(training_set_exp['SMS']):
for word in sms:
word_counts_per_sms_exp[word][index]+=1
word_counts_exp = pd.DataFrame(word_counts_per_sms_exp)
training_set_final_exp = pd.concat([training_set_exp, word_counts_exp], axis=1)
spam_sms_exp = training_set_final_exp[training_set_final_exp['Label']=='spam']
ham_sms_exp = training_set_final_exp[training_set_final_exp['Label']=='ham']
p_spam_exp = training_set_final_exp['Label'].value_counts()['spam']/len(training_set_final_exp)
p_ham_exp = training_set_final_exp['Label'].value_counts()['ham']/len(training_set_final_exp)
n_spam_exp = 0
n_ham_exp = 0
for i in range(len(training_set_final_exp)):
row = list(training_set_final_exp.iloc[i].values)
for j in range(2,len(row)):
if row[0]=='spam':
n_spam_exp+=row[j]
else:
n_ham_exp+=row[j]
n_vocabulary_exp = len(vocabulary_exp)
alpha = 1
p_wi_spam_exp = {}
p_wi_ham_exp = {}
for word in vocabulary_exp:
p_wi_spam_exp[word] = (spam_sms_exp[word].sum()+alpha)/(n_spam_exp+alpha*n_vocabulary_exp)
p_wi_ham_exp[word] = (ham_sms_exp[word].sum()+alpha)/(n_ham_exp+alpha*n_vocabulary_exp)
def classify_exp(message):
message = re.sub('\W', ' ', message)
message = message.lower()
message = message.split()
p_spam_given_message_exp = p_spam_exp
p_ham_given_message_exp = p_ham_exp
for word in message:
if word in p_wi_spam_exp:
p_spam_given_message_exp*=p_wi_spam_exp[word]
if word in p_wi_ham_exp:
p_ham_given_message_exp*=p_wi_ham_exp[word]
if p_ham_given_message_exp > p_spam_given_message_exp:
return 'ham'
elif p_ham_given_message_exp < p_spam_given_message_exp:
return 'spam'
else:
return 'Equal proabilities, have a human classify this!'
test_set_exp['Predicted'] = test_set_exp['SMS'].apply(classify_exp)
correct_exp = 0
total_exp = len(test_set_exp)
for row in test_set_exp.iterrows():
if row[1]['Predicted']==row[1]['Label']:
correct_exp+=1
accuracy_exp = correct_exp/total_exp*100
print(f'The accuracy of the spam filter: {accuracy_exp:.2f}%')
The accuracy of the spam filter: 85.10%
Through this project, we have managed to achieve in creating a spam filter using multinomial naive bayes algorithm.
Apart from that we also have achieved learning:
Also, the effieiency achieved using Naive Bayes algorithm in this project is much better than the expected efficiency.