Today's session shows how you can build classifiers to respond to the gist or purpose of an utterance. Our techniques illustrate what's called supervised machine learning: finding generalizations from labeled examples that we can use to respond corretly to new, unlabeled examples.
This session draws heavily on the NLTK toolkit. It has a bunch of convenient functions for doing supervised machine learning, which is helpful. But there are lots of other good Python utilities for doing machine learning, especially scikit-learn, which gives access to a really wide variety of methods. NLTK's learning is more of a greatest hits compilation by comparison, but we're only going to use the simplest possible learning method: naive Bayes classifiers. What NLTK also has is a bunch of bundled training data: collections of language that have been marked up by hand to indicate the answers to important questions. In order to do machine learning, we need that kind of training data.
We're going to focus on two problems that are particularly relevant for a chatbot.
We start with the usual invocations of NLTK.
import nltk, re
from nltk.corpus import movie_reviews
We'll start with sentiment analysis. This code has a few sources. The first is Chapter 6 of the NLTK Book. This gives a good overview of supervised learning and the interfaces that NLTK offers for dealing with the data. The second is a series of blog posts by Jacob Perkins. (I've given you the link to the first post in the series, but they're all linked together.) Perkins is a much better source about how to represent documents, select features and evaluate the results. I've also consulted Chris Potts's tutorial about sentiment analysis which is a nice mix of academic and practical.
My goal here, by the way, is to show you some techniques that are known to have good results and illustrate the process of using machine learning to make decisions in a program. This is not the whole picture, and there's a lot more to say about how you go about using machine learning carefully and creatively for a new problem. Some of the shortcuts that I'm using here (like training on all the available data, without saving anything for development and testing) could lead to very bad reslults if we had to tinker to put the program together rather than just reusing techniques that other people have already worked out.
For this classification problem, we represent a text snippet as a collection of features. The features in a document list the informative words that occur in the document. There are a couple choices here that are not obvious but are important.
I have packaged up the reasoning in a function called compute_best_features
, which we'll use both to build our sentiment analyzer and our dialogue act tagger.
def compute_best_features(labels, feature_generator, n) :
feature_fd = nltk.FreqDist()
label_feature_fd = nltk.ConditionalFreqDist()
for label in labels:
for feature in feature_generator(label) :
feature_fd[feature] += 1
label_feature_fd[label][feature] += 1
counts = dict()
for label in labels:
counts[label] = label_feature_fd[label].N()
total_count = sum(counts[label] for label in labels)
feature_scores = {}
for feature, freq in feature_fd.iteritems():
feature_scores[feature] = 0.
for label in labels :
feature_scores[feature] += \
nltk.BigramAssocMeasures.chi_sq(label_feature_fd[label][feature],
(freq, counts[label]),
total_count)
best = sorted(feature_scores.iteritems(), key=lambda (f,s): s, reverse=True)[:n]
return set([f for f, s in best])
This block of code computes the features and defines a function to extract the features corresponding to a list of words. You won't want to execute part of this block - it's a coherent unit of code - so it's commented inline. It takes a little while to run because it's going through the whole corpus, but it's not so slow for right now that it's worth pickling the best_word_list and loading it in later.
stop_word_file = "stop-word-list.txt"
with open(stop_word_file) as f :
stop_words = set(line.strip() for line in f)
def candidate_feature_word(w) :
return w not in stop_words and re.match(r"^[a-z](?:'?[a-z])*$", w) != None
def movie_review_feature_generator(category) :
return (word
for word in movie_reviews.words(categories=[category])
if candidate_feature_word(word))
best_sentiment_words = compute_best_features(['pos', 'neg'], movie_review_feature_generator, 2000)
def best_sentiment_word_feats(words):
return dict([(word, True) for word in words if word in best_sentiment_words])
We're going to explore a few ways of doing the classification, so we'll put some infrastructure in place. First, we load in all the data as a training_corpus
of (word_list, category)
pairs. Then, we create a dummy Python class called Experiment
that will let us package together comparable values made using different instantiations of the features and learning algorithms and play with the results.
training_corpus = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
class Experiment(object) :
pass
Our first experiment uses the best_word_feats
that we've just computed - it understands the sentiment in the text based on the most informative words that occur.
Here's the basic strategy for building and using the classifier:
(feature dictionary, category label)
This also takes a moment to run as it scans through the corpus, makes the features, aggregates them into counts, and uses the counts to build a statistical model. Again, if it bugs you, you could pickle the classifier.
expt1 = Experiment()
expt1.feature_data = [(best_sentiment_word_feats(d), c) for (d,c) in training_corpus]
expt1.opinion_classifier = nltk.NaiveBayesClassifier.train(expt1.feature_data)
expt1.preprocess = lambda text : best_sentiment_word_feats([w.lower() for w in re.findall(r"\w(?:'?\w)*", text)])
expt1.classify = lambda text : expt1.opinion_classifier.classify(expt1.preprocess(text))
NLTK's show_most_informative_features
method allows you to see what the classifier has learned. You can see that a big effect comes from adjectives and a few verbs that do express really strong opinions one way or the other.
expt1.opinion_classifier.show_most_informative_features(20)
Most Informative Features avoids = True pos : neg = 13.0 : 1.0 astounding = True pos : neg = 12.3 : 1.0 slip = True pos : neg = 11.7 : 1.0 outstanding = True pos : neg = 11.5 : 1.0 ludicrous = True neg : pos = 11.0 : 1.0 fascination = True pos : neg = 11.0 : 1.0 insulting = True neg : pos = 11.0 : 1.0 sucks = True neg : pos = 10.6 : 1.0 hudson = True neg : pos = 10.3 : 1.0 hatred = True pos : neg = 10.3 : 1.0 seamless = True pos : neg = 10.3 : 1.0 thematic = True pos : neg = 10.3 : 1.0 dread = True pos : neg = 9.7 : 1.0 conveys = True pos : neg = 9.7 : 1.0 incoherent = True neg : pos = 9.7 : 1.0 addresses = True pos : neg = 9.7 : 1.0 annual = True pos : neg = 9.7 : 1.0 accessible = True pos : neg = 9.7 : 1.0 stupidity = True neg : pos = 9.0 : 1.0 illogical = True neg : pos = 9.0 : 1.0
Here's an example of sentiment detection in action.
expt1.classify("The dinner was outstanding.")
u'pos'
Jacob Perkins recommends including particularly important bigrams in the feature representation of each document. Bigram is a fancy word for two words that occur successively in a document. NLTK's collocation finder selects bigrams that occur much more frequently than you would expect by chance - this is an indication that the two words together make an idiomatic expression for conveying a single concept that is important to the document.
Here we repeat the usual pipleline to include 200 useful bigram features on each document.
def best_bigram_word_feats(words, score_fn=nltk.BigramAssocMeasures.chi_sq, n=200):
bigram_finder = nltk.BigramCollocationFinder.from_words(words)
bigrams = bigram_finder.nbest(score_fn, n)
d = dict([(bigram, True) for bigram in bigrams])
d.update(best_sentiment_word_feats(words))
return d
expt2 = Experiment()
expt2.feature_data = [(best_bigram_word_feats(d), c) for (d,c) in training_corpus]
expt2.opinion_classifier = nltk.NaiveBayesClassifier.train(expt2.feature_data)
expt2.preprocess = lambda text : best_bigram_word_feats([w.lower() for w in re.findall(r"\w(?:'?\w)*", text)])
expt2.classify = lambda text : expt2.opinion_classifier.classify(expt2.preprocess(text))
Perkins reports that the bigram features do lead to a measurable performance improvement. In particular, adding bigrams improves the recall of negative classification, which means that the classifier is much better at reporting negative reviews that are truly negative when it is able to include some of these complex expressions. (Conversely, the classifier also improves the precision with which it recognizes positive reviews, which means that the things that it classifies as positive are more likely to actually be positive.) Probably this is due to the fact that the classifier can now recognize that not good expresses a negative opinion...
There are a bunch of interesting bigrams as informative features in the new classifier.
expt2.opinion_classifier.show_most_informative_features(50)
Most Informative Features (u'give', u'us') = True neg : pos = 14.3 : 1.0 avoids = True pos : neg = 13.0 : 1.0 (u'quite', u'frankly') = True neg : pos = 12.3 : 1.0 astounding = True pos : neg = 12.3 : 1.0 (u'does', u'so') = True pos : neg = 12.3 : 1.0 (u'fairy', u'tale') = True pos : neg = 11.7 : 1.0 slip = True pos : neg = 11.7 : 1.0 (u'&', u'robin') = True neg : pos = 11.7 : 1.0 outstanding = True pos : neg = 11.5 : 1.0 ludicrous = True neg : pos = 11.0 : 1.0 insulting = True neg : pos = 11.0 : 1.0 (u'batman', u'&') = True neg : pos = 11.0 : 1.0 fascination = True pos : neg = 11.0 : 1.0 (u'well', u'worth') = True pos : neg = 11.0 : 1.0 sucks = True neg : pos = 10.6 : 1.0 (u'was', u'made') = True neg : pos = 10.3 : 1.0 hatred = True pos : neg = 10.3 : 1.0 thematic = True pos : neg = 10.3 : 1.0 hudson = True neg : pos = 10.3 : 1.0 seamless = True pos : neg = 10.3 : 1.0 (u'your', u'time') = True neg : pos = 9.7 : 1.0 addresses = True pos : neg = 9.7 : 1.0 incoherent = True neg : pos = 9.7 : 1.0 conveys = True pos : neg = 9.7 : 1.0 dread = True pos : neg = 9.7 : 1.0 (u'red', u'planet') = True neg : pos = 9.7 : 1.0 accessible = True pos : neg = 9.7 : 1.0 (u'dealing', u'with') = True neg : pos = 9.7 : 1.0 annual = True pos : neg = 9.7 : 1.0 stupidity = True neg : pos = 9.0 : 1.0 (u'that', u'will') = True pos : neg = 9.0 : 1.0 reminder = True pos : neg = 9.0 : 1.0 mulan = True pos : neg = 9.0 : 1.0 (u'about', u'an') = True neg : pos = 9.0 : 1.0 excruciatingly = True neg : pos = 9.0 : 1.0 fairness = True neg : pos = 9.0 : 1.0 illogical = True neg : pos = 9.0 : 1.0 (u'about', u'it') = True neg : pos = 9.0 : 1.0 gump = True pos : neg = 9.0 : 1.0 (u'best', u'supporting') = True pos : neg = 9.0 : 1.0 frances = True pos : neg = 9.0 : 1.0 sans = True neg : pos = 9.0 : 1.0 deft = True pos : neg = 9.0 : 1.0 (u'ed', u'harris') = True pos : neg = 9.0 : 1.0 (u'be', u'funny') = True neg : pos = 9.0 : 1.0 winslet = True pos : neg = 9.0 : 1.0 (u'saving', u'grace') = True neg : pos = 8.6 : 1.0 scum = True pos : neg = 8.3 : 1.0 (u'makes', u'no') = True neg : pos = 8.3 : 1.0 predator = True neg : pos = 8.3 : 1.0
Now we turn to dialogue act tagging. NLTK comes with a collection of text chat utterances that were collected by Craig Martell and colleagues at the Naval Postgraduate School. These items have been hand annotated with a number of categories indicating the different roles the utterances play in a conversation. The list of tags appears here. The best way to understand what the tags mean is to see an example utterance from each class, so running this code also prints out some examples. The examples also show what the corpus is like -- including the way user names have been anonymized...
chat_utterances = nltk.corpus.nps_chat.xml_posts()
dialogue_acts = ['Accept',
'Bye',
'Clarify',
'Continuer',
'Emotion',
'Emphasis',
'Greet',
'nAnswer',
'Other',
'Reject',
'Statement',
'System',
'whQuestion',
'yAnswer',
'ynQuestion']
for a in dialogue_acts :
for u in chat_utterances :
if u.get('class') == a:
print "Example of {}: {}".format(a, u.text)
break
Example of Accept: 10-19-20sUser7 is a gay name. Example of Bye: brb Example of Clarify: sho* Example of Continuer: and i dont even know what that means. Example of Emotion: :P Example of Emphasis: i thought of that! Example of Greet: hey everyone Example of nAnswer: no Example of Other: 0 Example of Reject: don't golf clap me. Example of Statement: now im left with this gay name Example of System: PART Example of whQuestion: whats everyone up to? Example of yAnswer: yes 10-19-20sUser30 Example of ynQuestion: any ladis wanna chat? 29 m
This kind of language is pretty different from the edited writing that many NLP tools assume. Obviously, for machine learning, it hardly matters what the input to the classifier is. But it does pay to be smarter about dividing the text up into its tokens (the words or other meaningful elements). So we'll load in the tokenizer that Chris Potts wrote to analyze twitter feeds. Some of the things that it does nicely:
from happyfuntokenizing import Tokenizer
chat_tokenize = Tokenizer(preserve_case=False).tokenize
Now we set up the features for this data set. The code is closely analogous to what we did with the sentiment classifier earlier. The big difference is the tokenization and stopword elimination. Content-free words and weird punctuation bits like what
and :)
are going to be very important for understanding what dialogue act somebody is performing so we need to keep those features around!
def chat_feature_generator(category) :
return (word
for post in chat_utterances
if post.get('class') == category
for word in chat_tokenize(post.text))
best_act_words = compute_best_features(dialogue_acts, chat_feature_generator, 2000)
def best_act_word_feats(words):
return dict([(word, True) for word in words if word in best_act_words])
def best_act_words_post(post) :
return best_act_word_feats(chat_tokenize(post.text))
Here again is the setup to build the classifier and apply it to novel text. No surprises here.
expt3 = Experiment()
expt3.feature_data = [(best_act_words_post(p), p.get('class')) for p in chat_utterances]
expt3.act_classifier = nltk.NaiveBayesClassifier.train(expt3.feature_data)
expt3.preprocess = lambda text : best_act_word_feats(chat_tokenize(text))
expt3.classify = lambda text : expt3.act_classifier.classify(expt3.preprocess(text))
Here's a little glimpse into what this classifier is paying attention to.
expt3.act_classifier.show_most_informative_features(20)
Most Informative Features hi = True Greet : System = 486.1 : 1.0 bye = True Bye : Statem = 460.6 : 1.0 part = True System : Statem = 351.4 : 1.0 brb = True Bye : Statem = 341.4 : 1.0 no = True nAnswe : System = 318.1 : 1.0 yes = True yAnswe : Emotio = 287.8 : 1.0 nope = True nAnswe : Statem = 276.4 : 1.0 are = True whQues : System = 228.5 : 1.0 wanna = True ynQues : System = 192.7 : 1.0 > = True Other : System = 170.7 : 1.0 u = True whQues : System = 162.7 : 1.0 what = True whQues : Emotio = 158.2 : 1.0 nite = True Bye : Statem = 157.1 : 1.0 tc = True Bye : Statem = 157.1 : 1.0 lol = True Emotio : System = 156.5 : 1.0 right = True Accept : System = 146.3 : 1.0 whats = True whQues : Statem = 145.2 : 1.0 any = True ynQues : Greet = 144.4 : 1.0 0 = True Other : Statem = 139.1 : 1.0 and = True Contin : Emotio = 137.6 : 1.0
This demonstration wouldn't be complete without an illustration of how to use the classifiers we've created for an actual chatbot. We've already seen a whole bunch of ways to produce the content of the response -- I won't repeat that here. What's interesting here is to show how you can use the classification results in coherent ways to shape the course of the conversation.
The strategy I illustrate here is to have a different response generator for each of the different dialogue act types. Each response generator gets the input text (that's not used here, but you'd have to use it to make a pattern-matching response or an information-retrieval response like we've seen ealier). It also gets the recognized sentiment of the input text as an argument, so it can potentially do something different depending on whether the input is recognized as expressing a positive opinion or a negative opinion.
I store the response generators in a dictionary -- Python doesn't have a switch
statement like C or Java, but it does have first class functions. That makes an array of functions the easiest way to choose a range of behavior conditioned on a value from a small set of possibilities (like the set of dialogue acts). So the basic pattern of a response is to classify the act and sentiment of the input, and then call the response generator for the recognized act with the original text and the recognized sentiment.
Obviously, this is just an invitation to take this further....
def respond_question(text, valence) :
if valence == 'pos' :
return "I wish I knew."
else :
return "That's a tough question."
def respond_other(text, valence) :
return ":P Well, what next?"
def respond_statement(text, valence) :
if valence == 'pos' :
return "Great! Tell me more."
else :
return "Ugh. Is anything good happening?"
def respond_bye(text, valence) :
return "I guess it's time for me to go then."
def respond_greet(text, valence) :
return "Hey there!"
def respond_reject(text, valence) :
if valence == 'pos' :
return "Well, if you insist!"
else :
return "I still think you should reconsider."
def respond_emphasis(text, valence) :
if valence == 'pos' :
return '!!!'
else :
return ":("
responses = {'Accept': respond_other,
'Bye': respond_bye,
'Clarify': respond_other,
'Continuer': respond_other,
'Emotion': respond_other,
'Emphasis': respond_emphasis,
'Greet': respond_greet,
'nAnswer': respond_other,
'Other': respond_other,
'Reject': respond_reject,
'Statement': respond_statement,
'System': respond_other,
'whQuestion': respond_question,
'yAnswer': respond_other,
'ynQuestion': respond_question}
def respond(text) :
act = expt3.classify(text)
valence = expt1.classify(text)
return responses[act](text, valence)
respond("Everything sucks")
'Ugh. Is anything good happening?'
respond("I've got fantastic news!")
'!!!'
respond("A hot cup of tea always makes me happy.")
'Great! Tell me more.'
respond("Did you hear what happened to me?")
"That's a tough question."
respond("brb")
"I guess it's time for me to go then."