4. Doing Naive Bayes Classification

Lynn Cherny, 2/10/15, [email protected]

Full repo here: https://github.com/arnicas/NLP-in-Python

This is an example of going from labeled text to machine classification, first with NLTK and then the Python machine learning library scikit-learn. Examples updated from my OpenVis Conf talk here, which is more entertaining: https://www.youtube.com/watch?v=f41U936WqPM and slides: http://www.slideshare.net/arnicas/the-bones-of-a-bestseller

Warning: Rated NC-17. Using text samples from "50 Shades of Gray"! (Because spam is boring.)

Inspired by this image from the Economist (orignally http://www.economist.com/blogs/graphicdetail/2012/11/fifty-shades-data-visualisations):

I wondered if I could identify the sex scenes automatically, based on training examples. That's "classification."

Because the book was too hard to read, I farmed out (badly formatted) short chunks of "50 Shades of Gray" to Mechanical Turkers to rate as "sexy" or "not" (on a ratings scale): a score of 0 is "not a sex scene", while "1" and "2" are increasing in steaminess, and "3" is definitely a sex scene. (In later scoring, I reduced the options to just 3.)

Assuming that a score of >= 2.5 is a sex scene, I put the scores and text into a usable file with that labeling.

In [3]:
labelsfile = 'data/csv/fiftyshades_labeled.txt'
In [4]:
def get_documents_csv(filename):
    """ Read in the labeled chunks and classifications.
    Assume label is cell 1, doc text is cell 2, classification is cell 3.
    """
    
    labels = []
    documents = []
    classif = []
    for line in open(filename):
        fields = line.split("\t")
        if fields[0].strip != 'label':    # header row
            documents.append(fields[1].strip())
            labels.append(fields[0].strip())
            if fields[2]:
                classif.append(fields[2].strip("\n"))
    print "Got", len(documents), "chunks"
    return (documents, labels, classif)
In [5]:
docs, labels, classes = get_documents_csv(labelsfile)
Got 382 chunks
In [6]:
# Hmm, in my classes I did categorize "maybes."  These can either be used as "no" or as a third class.
classes[25:40]
Out[6]:
['maybe\r',
 'yes\r',
 'maybe\r',
 'no\r',
 'maybe\r',
 'yes\r',
 'maybe\r',
 'maybe\r',
 'maybe\r',
 'no\r',
 'maybe\r',
 'maybe\r',
 'no\r',
 'yes\r',
 'maybe\r']
In [7]:
labels[25:40]  # these are the original file chunks for reference
Out[7]:
['fifty_500_166',
 'fifty_500_199',
 'fifty_500_247',
 'fifty_500_365',
 'fifty_500_138',
 'fifty_500_338',
 'fifty_500_348',
 'fifty_500_364',
 'fifty_500_84',
 'fifty_500_366',
 'fifty_500_85',
 'fifty_500_368',
 'fifty_500_56',
 'fifty_500_92',
 'fifty_500_276']

I did do some hierarchical cluster analysis on it, btw (code elsewhere)- colored labels by green = yes, red = no (apologies to the color blind). This suggests I probably can build a classifier!

But let's build a classifier. Here's a schematic from Perkins for the machine learning workflow:

Some references:

In [8]:
# this text can't be used as is in the classifier -- and notice this was a "maybe":

docs[25]
Out[8]:
'"any time , Anastasia. I won * t stop you. If you go , however * that * s it. Just so you know. * * Okay , * I answer softly. If I go , that * s it. The thought is surprisingly painful . The waiter arrives with our first course. How can I possibly eat ? Holy Moses * he * s ordered oysters on a bed of ice . * I hope you like oysters. * Christian * s voice is soft . * I * ve never had one. * Ever . * Really ? Well. * He reaches for one. * All you do is tip and swallow. I think you can manage that. * He gazes at me , and I know what he * s referring to. I blush scarlet. He grins at me , squirts some lemon juice onto his oyster , and then tips it into his mouth . * Hmm , delicious. Tastes of the sea. * He grins at me. * Go on , * he encourages . * So , I don * t chew it ? * * No , Anastasia , you don * t. * His eyes are alight with humor. He looks so young like this . I bite my lip and his expression changes instantly. He looks sternly at me. I reach across and pick up my first-ever oyster. Okay * here goes nothing. I squirt some lemon juice on it and tip it up. It slips down my throat , all sea water , salt , the sharp tang of citrus , and fleshiness * ooh. I lick my lips , and he * s watching me intently , his eyes hooded . * Well ? * * I * ll have another , * I say dryly . * Good girl , * he says proudly . * Did you choose these deliberately ? Aren * t they known for their aphrodisiac qualities ? * * No , they are the first item on the menu. I don * t need an aphrodisiac near you. I think you know that , and I think you react the same way near me , * he says simply. * So where were we ? * He glances at my e-mail as I reach for another oyster . He reacts the same way. I affect him * wow . * Obey me in all things. Yes , I want you to do that. I need you to do that. Think of it as role-play , Anastasia. * * But I * m worried you * ll hurt me. * * Hurt you how ? * * Physically. * And emotionally . * Do you really think I would do that ? Go beyond any limit you can * t take ? * * You * ve said you * ve hurt someone before. * * Yes , I have. It was a long"'

I don't remember why my original text extracts had so much horrible * formatting - something to do with getting it to Excel for Mechanical Turk as fast as possible. Showing it to you as the raters saw it, for honestly purposes, and maybe it won't get me in trouble for sharing if it's so awful to read like that!

Let's clean it up - here we are using a regex tokenizer to clean the garbage out. Also, making a class to store a bunch of stuff in!

In [9]:
class Document:
    def __init__(self):
        Document.words = []
        Document.original = ""
        Document.clean = ""
        Document.label = ""
        Document.classif = ""

def clean_doc(doc):
    from nltk import corpus
    import re
    stopwords = corpus.stopwords.words('english')
    new = Document()
    new.original = doc
    sentence = doc
    sentence = sentence.lower()
    # note that I'm looking for non-numeric alphabetic items; this makes a difference from sklearn
    words = re.findall(r'\w+', sentence, flags = re.UNICODE | re.LOCALE)
    new.clean = " ".join(words)
    words = [word for word in words if word not in stopwords]
    new.words = words
    return new
In [10]:
# example ouput:

clean_doc(docs[25]).clean
Out[10]:
'any time anastasia i won t stop you if you go however that s it just so you know okay i answer softly if i go that s it the thought is surprisingly painful the waiter arrives with our first course how can i possibly eat holy moses he s ordered oysters on a bed of ice i hope you like oysters christian s voice is soft i ve never had one ever really well he reaches for one all you do is tip and swallow i think you can manage that he gazes at me and i know what he s referring to i blush scarlet he grins at me squirts some lemon juice onto his oyster and then tips it into his mouth hmm delicious tastes of the sea he grins at me go on he encourages so i don t chew it no anastasia you don t his eyes are alight with humor he looks so young like this i bite my lip and his expression changes instantly he looks sternly at me i reach across and pick up my first ever oyster okay here goes nothing i squirt some lemon juice on it and tip it up it slips down my throat all sea water salt the sharp tang of citrus and fleshiness ooh i lick my lips and he s watching me intently his eyes hooded well i ll have another i say dryly good girl he says proudly did you choose these deliberately aren t they known for their aphrodisiac qualities no they are the first item on the menu i don t need an aphrodisiac near you i think you know that and i think you react the same way near me he says simply so where were we he glances at my e mail as i reach for another oyster he reacts the same way i affect him wow obey me in all things yes i want you to do that i need you to do that think of it as role play anastasia but i m worried you ll hurt me hurt you how physically and emotionally do you really think i would do that go beyond any limit you can t take you ve said you ve hurt someone before yes i have it was a long'
In [11]:
# clean them all...

clean_docs = [clean_doc(x) for x in docs]
In [12]:
# Fix up with more info on each object:

def add_ids_classes(doc_objs, labels, classes):
    # Go thru the objects we just made and add the corresponding class and label
    for i,x in enumerate(doc_objs):
        x.label = labels[i]
        x.id = i
        x.classif = classes[i].strip("\r")  # may be necessary to strip, was for me
    return doc_objs
In [13]:
clean_docs = add_ids_classes(clean_docs, labels, classes)
In [14]:
clean_docs[0]
clean_docs[0].classif
Out[14]:
'maybe'
In [15]:
# We will consider the "maybe" as no, for now:

neg_docs = [doc for doc in clean_docs if doc.classif == 'no' or doc.classif == 'maybe']
pos_docs = [doc for doc in clean_docs if doc.classif == 'yes']
In [16]:
print len(neg_docs), len(pos_docs)
327 55
In [18]:
# Bag of Words - just a True for each word's presence in a document.  Later we'll use TF-IDF weights.

def word_feats(words):
        return dict([(word, True) for word in words])
In [19]:
neg_words = [(word_feats(doc.words),'neg') for doc in neg_docs]
pos_words = [(word_feats(doc.words),'pos') for doc in pos_docs]
In [24]:
# These are lists of dictionaries, one for each text.  Here's the first "no" text:
neg_words[0]
Out[24]:
({'accomplishments': True,
  'admiring': True,
  'ana': True,
  'apogee': True,
  'arms': True,
  'around': True,
  'ask': True,
  'asks': True,
  'baby': True,
  'back': True,
  'bed': True,
  'blue': True,
  'breathes': True,
  'care': True,
  'cares': True,
  'christian': True,
  'come': True,
  'comprehending': True,
  'concern': True,
  'confusion': True,
  'conquest': True,
  'course': True,
  'covered': True,
  'cries': True,
  'crushes': True,
  'd': True,
  'depths': True,
  'devours': True,
  'didn': True,
  'doubt': True,
  'effect': True,
  'energized': True,
  'everywhere': True,
  'exactly': True,
  'exciting': True,
  'eyes': True,
  'face': True,
  'far': True,
  'favorite': True,
  'feel': True,
  'fifteen': True,
  'film': True,
  'find': True,
  'finds': True,
  'flown': True,
  'fronts': True,
  'frowns': True,
  'fulfilling': True,
  'full': True,
  'funny': True,
  'gape': True,
  'good': True,
  'gray': True,
  'greatest': True,
  'grey': True,
  'grin': True,
  'grinning': True,
  'grins': True,
  'happening': True,
  'head': True,
  'holy': True,
  'hugging': True,
  'idiot': True,
  'incredulity': True,
  'infectious': True,
  'inside': True,
  'invocation': True,
  'king': True,
  'lie': True,
  'like': True,
  'lips': True,
  'looking': True,
  'love': True,
  'm': True,
  'man': True,
  'many': True,
  'meant': True,
  'mine': True,
  'mirroring': True,
  'miss': True,
  'mr': True,
  'naked': True,
  'number': True,
  'obvious': True,
  'oh': True,
  'one': True,
  'orgasm': True,
  'passion': True,
  'passionate': True,
  'piano': True,
  'pillows': True,
  'play': True,
  'playroom': True,
  'quirk': True,
  'referring': True,
  'release': True,
  'right': True,
  'ripping': True,
  'sad': True,
  'said': True,
  'score': True,
  'see': True,
  'seventeen': True,
  'sex': True,
  'shakes': True,
  'sheet': True,
  'shining': True,
  'shit': True,
  'silly': True,
  'sleep': True,
  'sloshing': True,
  'smiles': True,
  'soft': True,
  'soul': True,
  'staring': True,
  'steele': True,
  'still': True,
  'stirring': True,
  'stop': True,
  'strangely': True,
  'stuff': True,
  'suddenly': True,
  'super': True,
  'talk': True,
  'thought': True,
  'tired': True,
  'today': True,
  'touching': True,
  'turbulent': True,
  'um': True,
  'unexpected': True,
  'vanilla': True,
  've': True,
  'voice': True,
  'want': True,
  'whole': True,
  'wild': True,
  'women': True,
  'wrapped': True},
 'neg')
In [25]:
# Let's make a cut point at 3/4 of each list, so we can do separate test and training runs.

negcutoff = len(neg_words)*3/4
poscutoff = len(pos_words)*3/4
In [19]:
# Now split up the lists into training and testing.

import random

random.shuffle(neg_words)
random.shuffle(pos_words)

train_fic = neg_words[:negcutoff] + pos_words[:poscutoff]
test_fic = neg_words[negcutoff:] + pos_words[poscutoff:]

print 'train on %d docs, test on %d docs' % (len(train_fic), len(test_fic))
train on 286 docs, test on 96 docs

Naive Bayes

Naive Bayes classifiers (they are a family) are pretty good on text. The Bayes in "Naive Bayes" refers to Bayes' Theorem:

In English, we say:

For texts, this means that the probability of a document being in a class (like, "sex scene") is based on the probability of that class in the training set (our prior) times the combined probabilities of the words in the new document appearing in that class. The "winning" class is the one with a higher score for the prior times the word-category probability. For a longer explanation of how this works, read YHat's blog post.

It is "naive" because it assumes no relationship between the features (words) in the data. It also requires relatively little data to train it, which is good for smaller data problems. But it also scales well to a lot of vocabulary.

The results are reasonably good...

In [20]:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_fic)


print 'accuracy:', nltk.classify.util.accuracy(classifier, test_fic)
classifier.show_most_informative_features(15)
accuracy: 0.8125
Most Informative Features
                   navel = True              pos : neg    =     25.4 : 1.0
                 breasts = True              pos : neg    =     22.3 : 1.0
                clitoris = True              pos : neg    =     21.5 : 1.0
                   groan = True              pos : neg    =     20.3 : 1.0
                     beg = True              pos : neg    =     17.6 : 1.0
                   eases = True              pos : neg    =     17.6 : 1.0
                  voices = True              pos : neg    =     17.6 : 1.0
                   upper = True              pos : neg    =     17.6 : 1.0
                   peels = True              pos : neg    =     17.6 : 1.0
                stilling = True              pos : neg    =     17.6 : 1.0
                 washing = True              pos : neg    =     17.6 : 1.0
                swirling = True              pos : neg    =     17.6 : 1.0
              stretching = True              pos : neg    =     17.6 : 1.0
                 assault = True              pos : neg    =     17.6 : 1.0
                   blows = True              pos : neg    =     17.6 : 1.0

An interface to Sklearn (scikit-learn) from NLTK is available (see Perkin's Python 3 and NLTK book from Packt)

In [21]:
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB
from nltk.classify.util import accuracy

sk_classifier = SklearnClassifier(MultinomialNB())
sk_classifier.train(train_fic)
accuracy(sk_classifier, test_fic)
Out[21]:
0.9583333333333334

Perkins discusses some of the differences in performance in his book. Basically, for machine learning problems, sklearn is highly optimized.

Let's look at a visualization of the accuracy of one of the runs, from my Openvis Conf talk:

http://www.ghostweather.com/essays/talks/openvisconf/text_scores/rollover.html

Now, optionally, you can look at notebook 5 on using sklearn to do the same things!

In [21]: