(C) 2017-2024 by Damir Cavar
Download: This and various other Jupyter notebooks are available from my GitHub repo.
Version: 1.5, January 2024
This is a tutorial related to the discussion of a Bayesian classifier in the textbook Machine Learning: The Art and Science of Algorithms that Make Sense of Data by Peter Flach.
This tutorial was developed as part of my course material for the course Machine Learning for Computational Linguistics in the Computational Linguistics Program of the Department of Linguistics at Indiana University.
Prerequisites:
!pip install -U nltk
Assume that we have a set of e-mails that are annotated as spam or ham, as described in the textbook.
There are $4$ e-mails labeled ham and $1$ e-mail is labeled spam, that is we have a total of $5$ texts in our corpus.
If we would randomly pick an e-mail from the collection, the probability that we pick a spam e-mail would be $1 / 5$.
Spam emails might differ from ham e-mails just in some words. Here is a sample email constructed with typical keywords:
spam = [ """Our medicine cures baldness. No diagnostics needed.
We guarantee Fast Viagra delivery.
We can provide Human growth hormone. The cheapest Life
Insurance with us. You can Lose weight with this treatment.
Our Medicine now and No medical exams necessary.
Our Online pharmacy is the best. This cream Removes
wrinkles and Reverses aging.
One treatment and you will Stop snoring. We sell Valium
and Viagra.
Our Vicodin will help with Weight loss. Cheap Xanax.""" ]
The data structure above is a list of strings that contains only one string. The triple-double-quotes mark multi-line text. We can output the size of the variable spam this way:
print(len(spam))
1
We can create a list of ham mails in a similar way:
ham = [ """Hi Hans, hope to see you soon at our family party.
When will you arrive.
All the best to the family.
Sue""",
"""Dear Ata,
did you receive my last email related to the car insurance
offer? I would be happy to discuss the details with you.
Please give me a call, if you have any questions.
John Smith
Super Car Insurance""",
"""Hi everyone:
This is just a gentle reminder of today's first 2017 SLS
Colloquium, from 2.30 to 4.00 pm, in Ballantine 103.
Rodica Frimu will present a job talk entitled "What is
so tricky in subject-verb agreement?". The text of the
abstract is below.
If you would like to present something during the Spring,
please let me know.
The current online schedule with updated title
information and abstracts is available under:
http://www.iub.edu/~psyling/SLSColloquium/Spring2017.html
See you soon,
Peter""",
"""Dear Friends,
As our first event of 2017, the Polish Studies Center
presents an evening with artist and filmmaker Wojtek Sawa.
Please join us on JANUARY 26, 2017 from 5:30 p.m. to
7:30 p.m. in the Global and International Studies
Building room 1100 for a presentation by Wojtek Sawa
on his interactive installation art piece The Wall
Speaks–Voices of the Unheard. A reception will follow
the event where you will have a chance to meet the artist
and discuss his work.
Best,"""]
The ham-mail list contains $4$ e-mails:
print(len(ham))
4
We can access a particular e-mail via index from either spam or ham:
print(spam[0])
Our medicine cures baldness. No diagnostics needed. We guarantee Fast Viagra delivery. We can provide Human growth hormone. The cheapest Life Insurance with us. You can Lose weight with this treatment. Our Medicine now and No medical exams necessary. Our Online pharmacy is the best. This cream Removes wrinkles and Reverses aging. One treatment and you will Stop snoring. We sell Valium and Viagra. Our Vicodin will help with Weight loss. Cheap Xanax.
print(ham[3])
Dear Friends, As our first event of 2017, the Polish Studies Center presents an evening with artist and filmmaker Wojtek Sawa. Please join us on JANUARY 26, 2017 from 5:30 p.m. to 7:30 p.m. in the Global and International Studies Building room 1100 for a presentation by Wojtek Sawa on his interactive installation art piece The Wall Speaks–Voices of the Unheard. A reception will follow the event where you will have a chance to meet the artist and discuss his work. Best,
We can lower-case the email using the string lower function:
print(ham[3].lower())
dear friends, as our first event of 2017, the polish studies center presents an evening with artist and filmmaker wojtek sawa. please join us on january 26, 2017 from 5:30 p.m. to 7:30 p.m. in the global and international studies building room 1100 for a presentation by wojtek sawa on his interactive installation art piece the wall speaks–voices of the unheard. a reception will follow the event where you will have a chance to meet the artist and discuss his work. best,
We can loop over all e-mails in spam or ham and lower-case the content:
for text in ham:
print(text.lower())
hi hans, hope to see you soon at our family party. when will you arrive. all the best to the family. sue dear ata, did you receive my last email related to the car insurance offer? i would be happy to discuss the details with you. please give me a call, if you have any questions. john smith super car insurance hi everyone: this is just a gentle reminder of today's first 2017 sls colloquium, from 2.30 to 4.00 pm, in ballantine 103. rodica frimu will present a job talk entitled "what is so tricky in subject-verb agreement?". the text of the abstract is below. if you would like to present something during the spring, please let me know. the current online schedule with updated title information and abstracts is available under: http://www.iub.edu/~psyling/slscolloquium/spring2017.html see you soon, peter dear friends, as our first event of 2017, the polish studies center presents an evening with artist and filmmaker wojtek sawa. please join us on january 26, 2017 from 5:30 p.m. to 7:30 p.m. in the global and international studies building room 1100 for a presentation by wojtek sawa on his interactive installation art piece the wall speaks–voices of the unheard. a reception will follow the event where you will have a chance to meet the artist and discuss his work. best,
We can use the tokenizer from NLTK to tokenize the lower-cased text into single tokens (words and punctuation marks):
from nltk import word_tokenize
print(word_tokenize(ham[0].lower()))
['hi', 'hans', ',', 'hope', 'to', 'see', 'you', 'soon', 'at', 'our', 'family', 'party', '.', 'when', 'will', 'you', 'arrive', '.', 'all', 'the', 'best', 'to', 'the', 'family', '.', 'sue']
We can count the numer of tokens and types in lower-cased text:
from collections import Counter
myCounts = Counter(word_tokenize("This is a test. Will this test teach us how to count tokens?".lower()))
print(myCounts)
print("number of types:", len(myCounts))
print("number of tokens:", sum(myCounts.values()))
Counter({'this': 2, 'test': 2, 'is': 1, 'a': 1, '.': 1, 'will': 1, 'teach': 1, 'us': 1, 'how': 1, 'to': 1, 'count': 1, 'tokens': 1, '?': 1}) number of types: 13 number of tokens: 15
Now we can create a frequency profile of ham and spam words given the two text collections:
hamFP = Counter()
spamFP = Counter()
for text in spam:
spamFP.update(word_tokenize(text.lower()))
for text in ham:
hamFP.update(word_tokenize(text.lower()))
print("Ham:\n", hamFP)
print("-" * 30)
print("Spam:\n", spamFP)
Ham: Counter({'the': 14, ',': 11, '.': 11, 'to': 8, 'you': 8, 'a': 6, 'will': 4, 'is': 4, 'of': 4, 'and': 4, 'with': 3, 'please': 3, ':': 3, '2017': 3, 'in': 3, 'hi': 2, 'see': 2, 'soon': 2, 'our': 2, 'family': 2, 'best': 2, 'dear': 2, 'car': 2, 'insurance': 2, '?': 2, 'would': 2, 'discuss': 2, 'me': 2, 'if': 2, 'have': 2, 'first': 2, 'from': 2, 'present': 2, 'event': 2, 'studies': 2, 'artist': 2, 'wojtek': 2, 'sawa': 2, 'on': 2, 'p.m.': 2, 'his': 2, 'hans': 1, 'hope': 1, 'at': 1, 'party': 1, 'when': 1, 'arrive': 1, 'all': 1, 'sue': 1, 'ata': 1, 'did': 1, 'receive': 1, 'my': 1, 'last': 1, 'email': 1, 'related': 1, 'offer': 1, 'i': 1, 'be': 1, 'happy': 1, 'details': 1, 'give': 1, 'call': 1, 'any': 1, 'questions': 1, 'john': 1, 'smith': 1, 'super': 1, 'everyone': 1, 'this': 1, 'just': 1, 'gentle': 1, 'reminder': 1, 'today': 1, "'s": 1, 'sls': 1, 'colloquium': 1, '2.30': 1, '4.00': 1, 'pm': 1, 'ballantine': 1, '103.': 1, 'rodica': 1, 'frimu': 1, 'job': 1, 'talk': 1, 'entitled': 1, '``': 1, 'what': 1, 'so': 1, 'tricky': 1, 'subject-verb': 1, 'agreement': 1, "''": 1, 'text': 1, 'abstract': 1, 'below': 1, 'like': 1, 'something': 1, 'during': 1, 'spring': 1, 'let': 1, 'know': 1, 'current': 1, 'online': 1, 'schedule': 1, 'updated': 1, 'title': 1, 'information': 1, 'abstracts': 1, 'available': 1, 'under': 1, 'http': 1, '//www.iub.edu/~psyling/slscolloquium/spring2017.html': 1, 'peter': 1, 'friends': 1, 'as': 1, 'polish': 1, 'center': 1, 'presents': 1, 'an': 1, 'evening': 1, 'filmmaker': 1, 'join': 1, 'us': 1, 'january': 1, '26': 1, '5:30': 1, '7:30': 1, 'global': 1, 'international': 1, 'building': 1, 'room': 1, '1100': 1, 'for': 1, 'presentation': 1, 'by': 1, 'interactive': 1, 'installation': 1, 'art': 1, 'piece': 1, 'wall': 1, 'speaks–voices': 1, 'unheard': 1, 'reception': 1, 'follow': 1, 'where': 1, 'chance': 1, 'meet': 1, 'work': 1}) ------------------------------ Spam: Counter({'.': 13, 'our': 4, 'and': 4, 'we': 3, 'with': 3, 'medicine': 2, 'no': 2, 'viagra': 2, 'can': 2, 'the': 2, 'you': 2, 'weight': 2, 'this': 2, 'treatment': 2, 'will': 2, 'cures': 1, 'baldness': 1, 'diagnostics': 1, 'needed': 1, 'guarantee': 1, 'fast': 1, 'delivery': 1, 'provide': 1, 'human': 1, 'growth': 1, 'hormone': 1, 'cheapest': 1, 'life': 1, 'insurance': 1, 'us': 1, 'lose': 1, 'now': 1, 'medical': 1, 'exams': 1, 'necessary': 1, 'online': 1, 'pharmacy': 1, 'is': 1, 'best': 1, 'cream': 1, 'removes': 1, 'wrinkles': 1, 'reverses': 1, 'aging': 1, 'one': 1, 'stop': 1, 'snoring': 1, 'sell': 1, 'valium': 1, 'vicodin': 1, 'help': 1, 'loss': 1, 'cheap': 1, 'xanax': 1})
from math import log
tokenlist = []
frqprofiles = []
for x in spam:
frqprofiles.append( Counter(word_tokenize(x.lower())) )
tokenlist.append( set(word_tokenize(x.lower())) )
for x in ham:
frqprofiles.append( Counter(word_tokenize(x.lower())) )
tokenlist.append( set(word_tokenize(x.lower())) )
#print(tokenlist)
for x in frqprofiles[0]:
frq = frqprofiles[0][x]
counter = 0
for y in tokenlist:
if x in y:
counter += 1
print(x, frq * log(len(tokenlist)/counter, 2))
our 2.947862376664825 medicine 4.643856189774724 cures 2.321928094887362 baldness 2.321928094887362 . 0.0 no 4.643856189774724 diagnostics 2.321928094887362 needed 2.321928094887362 we 6.965784284662087 guarantee 2.321928094887362 fast 2.321928094887362 viagra 4.643856189774724 delivery 2.321928094887362 can 4.643856189774724 provide 2.321928094887362 human 2.321928094887362 growth 2.321928094887362 hormone 2.321928094887362 the 0.0 cheapest 2.321928094887362 life 2.321928094887362 insurance 1.3219280948873624 with 0.965784284662087 us 1.3219280948873624 you 0.0 lose 2.321928094887362 weight 4.643856189774724 this 2.643856189774725 treatment 4.643856189774724 now 2.321928094887362 and 2.947862376664825 medical 2.321928094887362 exams 2.321928094887362 necessary 2.321928094887362 online 1.3219280948873624 pharmacy 2.321928094887362 is 1.3219280948873624 best 0.7369655941662062 cream 2.321928094887362 removes 2.321928094887362 wrinkles 2.321928094887362 reverses 2.321928094887362 aging 2.321928094887362 one 2.321928094887362 will 0.6438561897747247 stop 2.321928094887362 snoring 2.321928094887362 sell 2.321928094887362 valium 2.321928094887362 vicodin 2.321928094887362 help 2.321928094887362 loss 2.321928094887362 cheap 2.321928094887362 xanax 2.321928094887362
The probability that we pick randomly an e-mail that is spam or ham can be computed as the ratio of the counts divided by the number of e-mails:
total = len(spam) + len(ham)
spamP = len(spam) / total
hamP = len(ham) / total
print("probability to pick spam:", spamP)
print("probability to pick ham:", hamP)
probability to pick spam: 0.2 probability to pick ham: 0.8
We will need the total token count to calculate the relative frequency of the tokens, that is to generate likelihood estimates. We could brute force add one to create space in the probability mass for unknown tokens.
totalSpam = sum(spamFP.values()) + 1
totalHam = sum(hamFP.values()) + 1
print("total spam counts + 1:", totalSpam)
print("total ham counts + 1:", totalHam)
total spam counts + 1: 87 total ham counts + 1: 251
We can relativize the counts in the frequency profiles now:
hamFP = Counter( dict([ (token, frequency/totalHam) for token, frequency in hamFP.items() ]) )
spamFP = Counter( dict([ (token, frequency/totalSpam) for token, frequency in spamFP.items() ]) )
print(hamFP)
print("-" * 30)
print(spamFP)
Counter({'the': 0.055776892430278883, ',': 0.043824701195219126, '.': 0.043824701195219126, 'to': 0.03187250996015936, 'you': 0.03187250996015936, 'a': 0.02390438247011952, 'will': 0.01593625498007968, 'is': 0.01593625498007968, 'of': 0.01593625498007968, 'and': 0.01593625498007968, 'with': 0.01195219123505976, 'please': 0.01195219123505976, ':': 0.01195219123505976, '2017': 0.01195219123505976, 'in': 0.01195219123505976, 'hi': 0.00796812749003984, 'see': 0.00796812749003984, 'soon': 0.00796812749003984, 'our': 0.00796812749003984, 'family': 0.00796812749003984, 'best': 0.00796812749003984, 'dear': 0.00796812749003984, 'car': 0.00796812749003984, 'insurance': 0.00796812749003984, '?': 0.00796812749003984, 'would': 0.00796812749003984, 'discuss': 0.00796812749003984, 'me': 0.00796812749003984, 'if': 0.00796812749003984, 'have': 0.00796812749003984, 'first': 0.00796812749003984, 'from': 0.00796812749003984, 'present': 0.00796812749003984, 'event': 0.00796812749003984, 'studies': 0.00796812749003984, 'artist': 0.00796812749003984, 'wojtek': 0.00796812749003984, 'sawa': 0.00796812749003984, 'on': 0.00796812749003984, 'p.m.': 0.00796812749003984, 'his': 0.00796812749003984, 'hans': 0.00398406374501992, 'hope': 0.00398406374501992, 'at': 0.00398406374501992, 'party': 0.00398406374501992, 'when': 0.00398406374501992, 'arrive': 0.00398406374501992, 'all': 0.00398406374501992, 'sue': 0.00398406374501992, 'ata': 0.00398406374501992, 'did': 0.00398406374501992, 'receive': 0.00398406374501992, 'my': 0.00398406374501992, 'last': 0.00398406374501992, 'email': 0.00398406374501992, 'related': 0.00398406374501992, 'offer': 0.00398406374501992, 'i': 0.00398406374501992, 'be': 0.00398406374501992, 'happy': 0.00398406374501992, 'details': 0.00398406374501992, 'give': 0.00398406374501992, 'call': 0.00398406374501992, 'any': 0.00398406374501992, 'questions': 0.00398406374501992, 'john': 0.00398406374501992, 'smith': 0.00398406374501992, 'super': 0.00398406374501992, 'everyone': 0.00398406374501992, 'this': 0.00398406374501992, 'just': 0.00398406374501992, 'gentle': 0.00398406374501992, 'reminder': 0.00398406374501992, 'today': 0.00398406374501992, "'s": 0.00398406374501992, 'sls': 0.00398406374501992, 'colloquium': 0.00398406374501992, '2.30': 0.00398406374501992, '4.00': 0.00398406374501992, 'pm': 0.00398406374501992, 'ballantine': 0.00398406374501992, '103.': 0.00398406374501992, 'rodica': 0.00398406374501992, 'frimu': 0.00398406374501992, 'job': 0.00398406374501992, 'talk': 0.00398406374501992, 'entitled': 0.00398406374501992, '``': 0.00398406374501992, 'what': 0.00398406374501992, 'so': 0.00398406374501992, 'tricky': 0.00398406374501992, 'subject-verb': 0.00398406374501992, 'agreement': 0.00398406374501992, "''": 0.00398406374501992, 'text': 0.00398406374501992, 'abstract': 0.00398406374501992, 'below': 0.00398406374501992, 'like': 0.00398406374501992, 'something': 0.00398406374501992, 'during': 0.00398406374501992, 'spring': 0.00398406374501992, 'let': 0.00398406374501992, 'know': 0.00398406374501992, 'current': 0.00398406374501992, 'online': 0.00398406374501992, 'schedule': 0.00398406374501992, 'updated': 0.00398406374501992, 'title': 0.00398406374501992, 'information': 0.00398406374501992, 'abstracts': 0.00398406374501992, 'available': 0.00398406374501992, 'under': 0.00398406374501992, 'http': 0.00398406374501992, '//www.iub.edu/~psyling/slscolloquium/spring2017.html': 0.00398406374501992, 'peter': 0.00398406374501992, 'friends': 0.00398406374501992, 'as': 0.00398406374501992, 'polish': 0.00398406374501992, 'center': 0.00398406374501992, 'presents': 0.00398406374501992, 'an': 0.00398406374501992, 'evening': 0.00398406374501992, 'filmmaker': 0.00398406374501992, 'join': 0.00398406374501992, 'us': 0.00398406374501992, 'january': 0.00398406374501992, '26': 0.00398406374501992, '5:30': 0.00398406374501992, '7:30': 0.00398406374501992, 'global': 0.00398406374501992, 'international': 0.00398406374501992, 'building': 0.00398406374501992, 'room': 0.00398406374501992, '1100': 0.00398406374501992, 'for': 0.00398406374501992, 'presentation': 0.00398406374501992, 'by': 0.00398406374501992, 'interactive': 0.00398406374501992, 'installation': 0.00398406374501992, 'art': 0.00398406374501992, 'piece': 0.00398406374501992, 'wall': 0.00398406374501992, 'speaks–voices': 0.00398406374501992, 'unheard': 0.00398406374501992, 'reception': 0.00398406374501992, 'follow': 0.00398406374501992, 'where': 0.00398406374501992, 'chance': 0.00398406374501992, 'meet': 0.00398406374501992, 'work': 0.00398406374501992}) ------------------------------ Counter({'.': 0.14942528735632185, 'our': 0.04597701149425287, 'and': 0.04597701149425287, 'we': 0.034482758620689655, 'with': 0.034482758620689655, 'medicine': 0.022988505747126436, 'no': 0.022988505747126436, 'viagra': 0.022988505747126436, 'can': 0.022988505747126436, 'the': 0.022988505747126436, 'you': 0.022988505747126436, 'weight': 0.022988505747126436, 'this': 0.022988505747126436, 'treatment': 0.022988505747126436, 'will': 0.022988505747126436, 'cures': 0.011494252873563218, 'baldness': 0.011494252873563218, 'diagnostics': 0.011494252873563218, 'needed': 0.011494252873563218, 'guarantee': 0.011494252873563218, 'fast': 0.011494252873563218, 'delivery': 0.011494252873563218, 'provide': 0.011494252873563218, 'human': 0.011494252873563218, 'growth': 0.011494252873563218, 'hormone': 0.011494252873563218, 'cheapest': 0.011494252873563218, 'life': 0.011494252873563218, 'insurance': 0.011494252873563218, 'us': 0.011494252873563218, 'lose': 0.011494252873563218, 'now': 0.011494252873563218, 'medical': 0.011494252873563218, 'exams': 0.011494252873563218, 'necessary': 0.011494252873563218, 'online': 0.011494252873563218, 'pharmacy': 0.011494252873563218, 'is': 0.011494252873563218, 'best': 0.011494252873563218, 'cream': 0.011494252873563218, 'removes': 0.011494252873563218, 'wrinkles': 0.011494252873563218, 'reverses': 0.011494252873563218, 'aging': 0.011494252873563218, 'one': 0.011494252873563218, 'stop': 0.011494252873563218, 'snoring': 0.011494252873563218, 'sell': 0.011494252873563218, 'valium': 0.011494252873563218, 'vicodin': 0.011494252873563218, 'help': 0.011494252873563218, 'loss': 0.011494252873563218, 'cheap': 0.011494252873563218, 'xanax': 0.011494252873563218})
We can now compute the default probability that we want to assign to unknown words as $1 / totalSpam$ or $1 / totalHam$ respectively. Whenever we encounter an unknown token that is not in our frequency profile, we will assign the default probability to it.
defaultSpam = 1 / totalSpam
defaultHam = 1 / totalHam
print("default spam probability:", defaultSpam)
print("default ham probability:", defaultHam)
default spam probability: 0.011494252873563218 default ham probability: 0.00398406374501992
We can test an unknown document by calculating how likely it was generated by the hamFP-distribution or the spamFP-distribution. We have to tokenize the lower-cased unknown document and compute the product of the likelihood of every single token in the text. We should scale this likelihood with the likelihood of randomly picking a ham or a spam e-mail. Let us calculate the likelihood that the random email is spam:
unknownEmail = """Dear ,
we sell the cheapest and best Viagra on the planet. Our delivery is guaranteed confident and cheap.
"""
#unknownEmail = """Dear Hans,
#I have not seen you for so long. When will we go out for a coffee again.
#"""
tokens = word_tokenize(unknownEmail.lower())
result = 1.0
for token in tokens:
result *= spamFP.get(token, defaultSpam)
print(result * spamP)
9.669490943645368e-37
Since this number is very small, a better strategy might be to sum up the log-likelihoods:
from math import log
resultSpam = 0.0
for token in tokens:
resultSpam += log(spamFP.get(token, defaultSpam), 2)
resultSpam += log(spamP)
print(resultSpam)
-118.92540938825404
resultHam = 0.0
for token in tokens:
resultHam += log(hamFP.get(token, defaultHam), 2)
resultHam += log(hamP)
print(resultHam)
-139.6325534842533
The log-likelihood for spam is larger than for ham. Our simple classifier would have guessed that this e-mail is spam.
if max(resultHam, resultSpam) == resultHam:
print("e-mail is ham")
else:
print("e-mail is spam")
The are numerous ways to improve the algorithm and tutorial. Please send me your suggestions.
(C) 2017-2024 by Damir Cavar - Creative Commons Attribution-ShareAlike 4.0 International License (CA BY-SA 4.0)