Notebook

Bayesian Classification for Machine Learning for Computational Linguistics¶

Using token probabilities for classification¶

Download: This and various other Jupyter notebooks are available from my GitHub repo.

Version: 1.5, January 2024

License: Creative Commons Attribution-ShareAlike 4.0 International License (CA BY-SA 4.0)

This is a tutorial related to the discussion of a Bayesian classifier in the textbook Machine Learning: The Art and Science of Algorithms that Make Sense of Data by Peter Flach.

This tutorial was developed as part of my course material for the course Machine Learning for Computational Linguistics in the Computational Linguistics Program of the Department of Linguistics at Indiana University.

Prerequisites:

In [ ]:

!pip install -U nltk

Creating a Training Corpus¶

Assume that we have a set of e-mails that are annotated as spam or ham, as described in the textbook.

There are $4$ e-mails labeled ham and $1$ e-mail is labeled spam, that is we have a total of $5$ texts in our corpus.

If we would randomly pick an e-mail from the collection, the probability that we pick a spam e-mail would be $1 / 5$.

Spam emails might differ from ham e-mails just in some words. Here is a sample email constructed with typical keywords:

In [1]:

spam = [ """Our medicine cures baldness. No diagnostics needed.
            We guarantee Fast Viagra delivery.
            We can provide Human growth hormone. The cheapest Life
            Insurance with us. You can Lose weight with this treatment.
            Our Medicine now and No medical exams necessary.
            Our Online pharmacy is the best.  This cream Removes
            wrinkles and Reverses aging.
            One treatment and you will Stop snoring.  We sell Valium
            and Viagra.
            Our Vicodin will help with Weight loss. Cheap Xanax.""" ]

The data structure above is a list of strings that contains only one string. The triple-double-quotes mark multi-line text. We can output the size of the variable spam this way:

In [2]:

print(len(spam))

We can create a list of ham mails in a similar way:

In [3]:

ham = [ """Hi Hans, hope to see you soon at our family party.
           When will you arrive.
           All the best to the family.
           Sue""",
      """Dear Ata,
         did you receive my last email related to the car insurance
         offer? I would be happy to discuss the details with you.
         Please give me a call, if you have any questions.
         John Smith
         Super Car Insurance""",
      """Hi everyone:
         This is just a gentle reminder of today's first 2017 SLS
         Colloquium, from 2.30 to 4.00 pm, in Ballantine 103.
         Rodica Frimu will present a job talk entitled "What is
         so tricky in subject-verb agreement?". The text of the
         abstract is below.
         If you would like to present something during the Spring,
         please let me know.
         The current online schedule with updated title
         information and abstracts is available under:
         http://www.iub.edu/~psyling/SLSColloquium/Spring2017.html
         See you soon,
         Peter""",
      """Dear Friends,
         As our first event of 2017, the Polish Studies Center
         presents an evening with artist and filmmaker Wojtek Sawa.
         Please join us on JANUARY 26, 2017 from 5:30 p.m. to
         7:30 p.m. in the Global and International Studies
         Building room 1100 for a presentation by Wojtek Sawa
         on his interactive  installation art piece The Wall
         Speaks–Voices of the Unheard. A reception will follow
         the event where you will have a chance to meet the artist
         and discuss his work.
         Best,"""]

The ham-mail list contains $4$ e-mails:

In [4]:

print(len(ham))

We can access a particular e-mail via index from either spam or ham:

In [5]:

print(spam[0])

Our medicine cures baldness. No diagnostics needed.
            We guarantee Fast Viagra delivery.
            We can provide Human growth hormone. The cheapest Life
            Insurance with us. You can Lose weight with this treatment.
            Our Medicine now and No medical exams necessary.
            Our Online pharmacy is the best.  This cream Removes
            wrinkles and Reverses aging.
            One treatment and you will Stop snoring.  We sell Valium
            and Viagra.
            Our Vicodin will help with Weight loss. Cheap Xanax.

In [6]:

print(ham[3])

Dear Friends,
         As our first event of 2017, the Polish Studies Center
         presents an evening with artist and filmmaker Wojtek Sawa.
         Please join us on JANUARY 26, 2017 from 5:30 p.m. to
         7:30 p.m. in the Global and International Studies
         Building room 1100 for a presentation by Wojtek Sawa
         on his interactive  installation art piece The Wall
         Speaks–Voices of the Unheard. A reception will follow
         the event where you will have a chance to meet the artist
         and discuss his work.
         Best,

We can lower-case the email using the string lower function:

In [7]:

print(ham[3].lower())

dear friends,
         as our first event of 2017, the polish studies center
         presents an evening with artist and filmmaker wojtek sawa.
         please join us on january 26, 2017 from 5:30 p.m. to
         7:30 p.m. in the global and international studies
         building room 1100 for a presentation by wojtek sawa
         on his interactive  installation art piece the wall
         speaks–voices of the unheard. a reception will follow
         the event where you will have a chance to meet the artist
         and discuss his work.
         best,

We can loop over all e-mails in spam or ham and lower-case the content:

In [8]:

for text in ham:
    print(text.lower())

hi hans, hope to see you soon at our family party.
           when will you arrive.
           all the best to the family.
           sue
dear ata,
         did you receive my last email related to the car insurance
         offer? i would be happy to discuss the details with you.
         please give me a call, if you have any questions.
         john smith
         super car insurance
hi everyone:
         this is just a gentle reminder of today's first 2017 sls
         colloquium, from 2.30 to 4.00 pm, in ballantine 103.
         rodica frimu will present a job talk entitled "what is
         so tricky in subject-verb agreement?". the text of the
         abstract is below.
         if you would like to present something during the spring,
         please let me know.
         the current online schedule with updated title
         information and abstracts is available under:
         http://www.iub.edu/~psyling/slscolloquium/spring2017.html
         see you soon,
         peter
dear friends,
         as our first event of 2017, the polish studies center
         presents an evening with artist and filmmaker wojtek sawa.
         please join us on january 26, 2017 from 5:30 p.m. to
         7:30 p.m. in the global and international studies
         building room 1100 for a presentation by wojtek sawa
         on his interactive  installation art piece the wall
         speaks–voices of the unheard. a reception will follow
         the event where you will have a chance to meet the artist
         and discuss his work.
         best,

We can use the tokenizer from NLTK to tokenize the lower-cased text into single tokens (words and punctuation marks):

In [9]:

from nltk import word_tokenize

print(word_tokenize(ham[0].lower()))

['hi', 'hans', ',', 'hope', 'to', 'see', 'you', 'soon', 'at', 'our', 'family', 'party', '.', 'when', 'will', 'you', 'arrive', '.', 'all', 'the', 'best', 'to', 'the', 'family', '.', 'sue']

We can count the numer of tokens and types in lower-cased text:

In [11]:

from collections import Counter

myCounts = Counter(word_tokenize("This is a test. Will this test teach us how to count tokens?".lower()))

print(myCounts)
print("number of  types:", len(myCounts))
print("number of tokens:", sum(myCounts.values()))

Counter({'this': 2, 'test': 2, 'is': 1, 'a': 1, '.': 1, 'will': 1, 'teach': 1, 'us': 1, 'how': 1, 'to': 1, 'count': 1, 'tokens': 1, '?': 1})
number of  types: 13
number of tokens: 15

Now we can create a frequency profile of ham and spam words given the two text collections:

In [12]:

hamFP = Counter()
spamFP = Counter()

for text in spam:
    spamFP.update(word_tokenize(text.lower()))

for text in ham:
    hamFP.update(word_tokenize(text.lower()))

print("Ham:\n",  hamFP)
print("-" * 30)
print("Spam:\n", spamFP)

Ham:
 Counter({'the': 14, ',': 11, '.': 11, 'to': 8, 'you': 8, 'a': 6, 'will': 4, 'is': 4, 'of': 4, 'and': 4, 'with': 3, 'please': 3, ':': 3, '2017': 3, 'in': 3, 'hi': 2, 'see': 2, 'soon': 2, 'our': 2, 'family': 2, 'best': 2, 'dear': 2, 'car': 2, 'insurance': 2, '?': 2, 'would': 2, 'discuss': 2, 'me': 2, 'if': 2, 'have': 2, 'first': 2, 'from': 2, 'present': 2, 'event': 2, 'studies': 2, 'artist': 2, 'wojtek': 2, 'sawa': 2, 'on': 2, 'p.m.': 2, 'his': 2, 'hans': 1, 'hope': 1, 'at': 1, 'party': 1, 'when': 1, 'arrive': 1, 'all': 1, 'sue': 1, 'ata': 1, 'did': 1, 'receive': 1, 'my': 1, 'last': 1, 'email': 1, 'related': 1, 'offer': 1, 'i': 1, 'be': 1, 'happy': 1, 'details': 1, 'give': 1, 'call': 1, 'any': 1, 'questions': 1, 'john': 1, 'smith': 1, 'super': 1, 'everyone': 1, 'this': 1, 'just': 1, 'gentle': 1, 'reminder': 1, 'today': 1, "'s": 1, 'sls': 1, 'colloquium': 1, '2.30': 1, '4.00': 1, 'pm': 1, 'ballantine': 1, '103.': 1, 'rodica': 1, 'frimu': 1, 'job': 1, 'talk': 1, 'entitled': 1, '``': 1, 'what': 1, 'so': 1, 'tricky': 1, 'subject-verb': 1, 'agreement': 1, "''": 1, 'text': 1, 'abstract': 1, 'below': 1, 'like': 1, 'something': 1, 'during': 1, 'spring': 1, 'let': 1, 'know': 1, 'current': 1, 'online': 1, 'schedule': 1, 'updated': 1, 'title': 1, 'information': 1, 'abstracts': 1, 'available': 1, 'under': 1, 'http': 1, '//www.iub.edu/~psyling/slscolloquium/spring2017.html': 1, 'peter': 1, 'friends': 1, 'as': 1, 'polish': 1, 'center': 1, 'presents': 1, 'an': 1, 'evening': 1, 'filmmaker': 1, 'join': 1, 'us': 1, 'january': 1, '26': 1, '5:30': 1, '7:30': 1, 'global': 1, 'international': 1, 'building': 1, 'room': 1, '1100': 1, 'for': 1, 'presentation': 1, 'by': 1, 'interactive': 1, 'installation': 1, 'art': 1, 'piece': 1, 'wall': 1, 'speaks–voices': 1, 'unheard': 1, 'reception': 1, 'follow': 1, 'where': 1, 'chance': 1, 'meet': 1, 'work': 1})
------------------------------
Spam:
 Counter({'.': 13, 'our': 4, 'and': 4, 'we': 3, 'with': 3, 'medicine': 2, 'no': 2, 'viagra': 2, 'can': 2, 'the': 2, 'you': 2, 'weight': 2, 'this': 2, 'treatment': 2, 'will': 2, 'cures': 1, 'baldness': 1, 'diagnostics': 1, 'needed': 1, 'guarantee': 1, 'fast': 1, 'delivery': 1, 'provide': 1, 'human': 1, 'growth': 1, 'hormone': 1, 'cheapest': 1, 'life': 1, 'insurance': 1, 'us': 1, 'lose': 1, 'now': 1, 'medical': 1, 'exams': 1, 'necessary': 1, 'online': 1, 'pharmacy': 1, 'is': 1, 'best': 1, 'cream': 1, 'removes': 1, 'wrinkles': 1, 'reverses': 1, 'aging': 1, 'one': 1, 'stop': 1, 'snoring': 1, 'sell': 1, 'valium': 1, 'vicodin': 1, 'help': 1, 'loss': 1, 'cheap': 1, 'xanax': 1})

In [13]:

from math import log

tokenlist = []
frqprofiles = []
for x in spam:
    frqprofiles.append( Counter(word_tokenize(x.lower())) )
    tokenlist.append( set(word_tokenize(x.lower())) )
for x in ham:
    frqprofiles.append( Counter(word_tokenize(x.lower())) )
    tokenlist.append( set(word_tokenize(x.lower())) )
#print(tokenlist)

for x in frqprofiles[0]:
    frq = frqprofiles[0][x]
    counter = 0
    for y in tokenlist:
        if x in y:
            counter += 1
    print(x, frq * log(len(tokenlist)/counter, 2))

our 2.947862376664825
medicine 4.643856189774724
cures 2.321928094887362
baldness 2.321928094887362
. 0.0
no 4.643856189774724
diagnostics 2.321928094887362
needed 2.321928094887362
we 6.965784284662087
guarantee 2.321928094887362
fast 2.321928094887362
viagra 4.643856189774724
delivery 2.321928094887362
can 4.643856189774724
provide 2.321928094887362
human 2.321928094887362
growth 2.321928094887362
hormone 2.321928094887362
the 0.0
cheapest 2.321928094887362
life 2.321928094887362
insurance 1.3219280948873624
with 0.965784284662087
us 1.3219280948873624
you 0.0
lose 2.321928094887362
weight 4.643856189774724
this 2.643856189774725
treatment 4.643856189774724
now 2.321928094887362
and 2.947862376664825
medical 2.321928094887362
exams 2.321928094887362
necessary 2.321928094887362
online 1.3219280948873624
pharmacy 2.321928094887362
is 1.3219280948873624
best 0.7369655941662062
cream 2.321928094887362
removes 2.321928094887362
wrinkles 2.321928094887362
reverses 2.321928094887362
aging 2.321928094887362
one 2.321928094887362
will 0.6438561897747247
stop 2.321928094887362
snoring 2.321928094887362
sell 2.321928094887362
valium 2.321928094887362
vicodin 2.321928094887362
help 2.321928094887362
loss 2.321928094887362
cheap 2.321928094887362
xanax 2.321928094887362

The probability that we pick randomly an e-mail that is spam or ham can be computed as the ratio of the counts divided by the number of e-mails:

In [14]:

total = len(spam) + len(ham)

spamP = len(spam) / total
hamP  = len(ham) / total

print("probability to pick spam:", spamP)
print("probability to pick  ham:", hamP)

probability to pick spam: 0.2
probability to pick  ham: 0.8

We will need the total token count to calculate the relative frequency of the tokens, that is to generate likelihood estimates. We could brute force add one to create space in the probability mass for unknown tokens.

In [15]:

totalSpam = sum(spamFP.values()) + 1
totalHam  = sum(hamFP.values()) + 1

print("total spam counts + 1:", totalSpam)
print("total  ham counts + 1:", totalHam)

total spam counts + 1: 87
total  ham counts + 1: 251

We can relativize the counts in the frequency profiles now:

In [16]:

hamFP  = Counter( dict([ (token, frequency/totalHam)  for token, frequency in hamFP.items() ]) )
spamFP = Counter( dict([ (token, frequency/totalSpam) for token, frequency in spamFP.items() ]) )

print(hamFP)
print("-" * 30)
print(spamFP)

Counter({'the': 0.055776892430278883, ',': 0.043824701195219126, '.': 0.043824701195219126, 'to': 0.03187250996015936, 'you': 0.03187250996015936, 'a': 0.02390438247011952, 'will': 0.01593625498007968, 'is': 0.01593625498007968, 'of': 0.01593625498007968, 'and': 0.01593625498007968, 'with': 0.01195219123505976, 'please': 0.01195219123505976, ':': 0.01195219123505976, '2017': 0.01195219123505976, 'in': 0.01195219123505976, 'hi': 0.00796812749003984, 'see': 0.00796812749003984, 'soon': 0.00796812749003984, 'our': 0.00796812749003984, 'family': 0.00796812749003984, 'best': 0.00796812749003984, 'dear': 0.00796812749003984, 'car': 0.00796812749003984, 'insurance': 0.00796812749003984, '?': 0.00796812749003984, 'would': 0.00796812749003984, 'discuss': 0.00796812749003984, 'me': 0.00796812749003984, 'if': 0.00796812749003984, 'have': 0.00796812749003984, 'first': 0.00796812749003984, 'from': 0.00796812749003984, 'present': 0.00796812749003984, 'event': 0.00796812749003984, 'studies': 0.00796812749003984, 'artist': 0.00796812749003984, 'wojtek': 0.00796812749003984, 'sawa': 0.00796812749003984, 'on': 0.00796812749003984, 'p.m.': 0.00796812749003984, 'his': 0.00796812749003984, 'hans': 0.00398406374501992, 'hope': 0.00398406374501992, 'at': 0.00398406374501992, 'party': 0.00398406374501992, 'when': 0.00398406374501992, 'arrive': 0.00398406374501992, 'all': 0.00398406374501992, 'sue': 0.00398406374501992, 'ata': 0.00398406374501992, 'did': 0.00398406374501992, 'receive': 0.00398406374501992, 'my': 0.00398406374501992, 'last': 0.00398406374501992, 'email': 0.00398406374501992, 'related': 0.00398406374501992, 'offer': 0.00398406374501992, 'i': 0.00398406374501992, 'be': 0.00398406374501992, 'happy': 0.00398406374501992, 'details': 0.00398406374501992, 'give': 0.00398406374501992, 'call': 0.00398406374501992, 'any': 0.00398406374501992, 'questions': 0.00398406374501992, 'john': 0.00398406374501992, 'smith': 0.00398406374501992, 'super': 0.00398406374501992, 'everyone': 0.00398406374501992, 'this': 0.00398406374501992, 'just': 0.00398406374501992, 'gentle': 0.00398406374501992, 'reminder': 0.00398406374501992, 'today': 0.00398406374501992, "'s": 0.00398406374501992, 'sls': 0.00398406374501992, 'colloquium': 0.00398406374501992, '2.30': 0.00398406374501992, '4.00': 0.00398406374501992, 'pm': 0.00398406374501992, 'ballantine': 0.00398406374501992, '103.': 0.00398406374501992, 'rodica': 0.00398406374501992, 'frimu': 0.00398406374501992, 'job': 0.00398406374501992, 'talk': 0.00398406374501992, 'entitled': 0.00398406374501992, '``': 0.00398406374501992, 'what': 0.00398406374501992, 'so': 0.00398406374501992, 'tricky': 0.00398406374501992, 'subject-verb': 0.00398406374501992, 'agreement': 0.00398406374501992, "''": 0.00398406374501992, 'text': 0.00398406374501992, 'abstract': 0.00398406374501992, 'below': 0.00398406374501992, 'like': 0.00398406374501992, 'something': 0.00398406374501992, 'during': 0.00398406374501992, 'spring': 0.00398406374501992, 'let': 0.00398406374501992, 'know': 0.00398406374501992, 'current': 0.00398406374501992, 'online': 0.00398406374501992, 'schedule': 0.00398406374501992, 'updated': 0.00398406374501992, 'title': 0.00398406374501992, 'information': 0.00398406374501992, 'abstracts': 0.00398406374501992, 'available': 0.00398406374501992, 'under': 0.00398406374501992, 'http': 0.00398406374501992, '//www.iub.edu/~psyling/slscolloquium/spring2017.html': 0.00398406374501992, 'peter': 0.00398406374501992, 'friends': 0.00398406374501992, 'as': 0.00398406374501992, 'polish': 0.00398406374501992, 'center': 0.00398406374501992, 'presents': 0.00398406374501992, 'an': 0.00398406374501992, 'evening': 0.00398406374501992, 'filmmaker': 0.00398406374501992, 'join': 0.00398406374501992, 'us': 0.00398406374501992, 'january': 0.00398406374501992, '26': 0.00398406374501992, '5:30': 0.00398406374501992, '7:30': 0.00398406374501992, 'global': 0.00398406374501992, 'international': 0.00398406374501992, 'building': 0.00398406374501992, 'room': 0.00398406374501992, '1100': 0.00398406374501992, 'for': 0.00398406374501992, 'presentation': 0.00398406374501992, 'by': 0.00398406374501992, 'interactive': 0.00398406374501992, 'installation': 0.00398406374501992, 'art': 0.00398406374501992, 'piece': 0.00398406374501992, 'wall': 0.00398406374501992, 'speaks–voices': 0.00398406374501992, 'unheard': 0.00398406374501992, 'reception': 0.00398406374501992, 'follow': 0.00398406374501992, 'where': 0.00398406374501992, 'chance': 0.00398406374501992, 'meet': 0.00398406374501992, 'work': 0.00398406374501992})
------------------------------
Counter({'.': 0.14942528735632185, 'our': 0.04597701149425287, 'and': 0.04597701149425287, 'we': 0.034482758620689655, 'with': 0.034482758620689655, 'medicine': 0.022988505747126436, 'no': 0.022988505747126436, 'viagra': 0.022988505747126436, 'can': 0.022988505747126436, 'the': 0.022988505747126436, 'you': 0.022988505747126436, 'weight': 0.022988505747126436, 'this': 0.022988505747126436, 'treatment': 0.022988505747126436, 'will': 0.022988505747126436, 'cures': 0.011494252873563218, 'baldness': 0.011494252873563218, 'diagnostics': 0.011494252873563218, 'needed': 0.011494252873563218, 'guarantee': 0.011494252873563218, 'fast': 0.011494252873563218, 'delivery': 0.011494252873563218, 'provide': 0.011494252873563218, 'human': 0.011494252873563218, 'growth': 0.011494252873563218, 'hormone': 0.011494252873563218, 'cheapest': 0.011494252873563218, 'life': 0.011494252873563218, 'insurance': 0.011494252873563218, 'us': 0.011494252873563218, 'lose': 0.011494252873563218, 'now': 0.011494252873563218, 'medical': 0.011494252873563218, 'exams': 0.011494252873563218, 'necessary': 0.011494252873563218, 'online': 0.011494252873563218, 'pharmacy': 0.011494252873563218, 'is': 0.011494252873563218, 'best': 0.011494252873563218, 'cream': 0.011494252873563218, 'removes': 0.011494252873563218, 'wrinkles': 0.011494252873563218, 'reverses': 0.011494252873563218, 'aging': 0.011494252873563218, 'one': 0.011494252873563218, 'stop': 0.011494252873563218, 'snoring': 0.011494252873563218, 'sell': 0.011494252873563218, 'valium': 0.011494252873563218, 'vicodin': 0.011494252873563218, 'help': 0.011494252873563218, 'loss': 0.011494252873563218, 'cheap': 0.011494252873563218, 'xanax': 0.011494252873563218})

We can now compute the default probability that we want to assign to unknown words as $1 / totalSpam$ or $1 / totalHam$ respectively. Whenever we encounter an unknown token that is not in our frequency profile, we will assign the default probability to it.

In [17]:

defaultSpam = 1 / totalSpam
defaultHam  = 1 / totalHam

print("default spam probability:", defaultSpam)
print("default  ham probability:", defaultHam)

default spam probability: 0.011494252873563218
default  ham probability: 0.00398406374501992

We can test an unknown document by calculating how likely it was generated by the hamFP-distribution or the spamFP-distribution. We have to tokenize the lower-cased unknown document and compute the product of the likelihood of every single token in the text. We should scale this likelihood with the likelihood of randomly picking a ham or a spam e-mail. Let us calculate the likelihood that the random email is spam:

In [19]:

unknownEmail = """Dear ,
we sell the cheapest and best Viagra on the planet. Our delivery is guaranteed confident and cheap.
"""
#unknownEmail = """Dear Hans,
#I have not seen you for so long. When will we go out for a coffee again.
#"""

tokens = word_tokenize(unknownEmail.lower())

result = 1.0
for token in tokens:
    result *= spamFP.get(token, defaultSpam)

print(result * spamP)

9.669490943645368e-37

Since this number is very small, a better strategy might be to sum up the log-likelihoods:

In [20]:

from math import log

resultSpam = 0.0
for token in tokens:
    resultSpam += log(spamFP.get(token, defaultSpam), 2)
resultSpam += log(spamP)

print(resultSpam)

-118.92540938825404

In [21]:

resultHam = 0.0
for token in tokens:
    resultHam += log(hamFP.get(token, defaultHam), 2)
resultHam += log(hamP)

print(resultHam)

-139.6325534842533

The log-likelihood for spam is larger than for ham. Our simple classifier would have guessed that this e-mail is spam.

In [ ]:

if max(resultHam, resultSpam) == resultHam:
    print("e-mail is ham")
else:
    print("e-mail is spam")

The are numerous ways to improve the algorithm and tutorial. Please send me your suggestions.