This file shows how you can use a corpus of utterances and information retrieval techniques to try to make a chatbot respond to a user in a relevant way.
The main idea here is the vector space model of documents and queries, along with the term frequency - inverse document frequency (TFIDF) representation for modeling document topics and the cosine metric for measuring document similarity. These techniques are old - the vector space model goes back to the '60s and TFIDF weighting was established in the '70s. There are many improvements to information retrieval models since then, including Google's PageRank model to weight documents based on their importance and a variety of improved statistical models of document topics. However, these techniques still provide an excellent starting point for accessing items from text collections that are relevant to a topic, and they're very easy to implement and use!
This particular chatbot draws on a book of Miscellaneous Aphorisms that Oscar Wilde published in 1911. Most of the work comes in preprocessing this collection to index its candidate utterances using the TFIDF model so we can easily find the utterance that's most similar to what the user has just said.
import re, collections, math, random
The file of responses is formatted as a sequence of paragraphs, separated by blank lines. It's convenient to represent the utterances this way because many of them are quite long, and this way you can use the newlines in the file to have the utterances preformatted for the screen. This particular paragraphs
detector uses Python's generator
constructs to save its state as it successively returns chunks of data from the file. The code originates with Alex Martelli and Magnus Lie Hetland and the Python Cookbook.
def paragraphs(file, separator=None):
if not callable(separator):
if separator != None: raise TypeError, "separator must be callable"
def separator(line): return line == '\n'
paragraph = []
for line in file:
if separator(line):
if paragraph:
yield ''.join(paragraph)
paragraph = []
else:
paragraph.append(line)
if paragraph :
yield ''.join(paragraph)
Read in the things the chatbot will be able to say as paragraphs from the specified file.
strip()
eliminates any final newlines.
response_file = "oscar-wilde-quotes.txt"
with open(response_file) as f :
responses = [p.strip() for p in paragraphs(f)]
Information retrieval models of similarity focus on the important words. There are many words in English text that occur frequently and that don't carry a lot of information about the topic of the text. Such words are known as stop words in information retrieval, and I've downloaded a common collection of stop words from a kind of random looking website for use here.
We just create a set of all the stop words to exclude them from our vocabulary indexing later. In particular, we write a helper function content_words
that takes a string, breaks it into the words that make it up in a simple way (we can ignore punctuation for indexing, since punctuation certainly doesn't say anything about the content of a document!), and makes a list of all the word tokens that are not in the stop word set.
stop_word_file = "stop-word-list.txt"
with open(stop_word_file) as f :
stop_words = set(line.strip() for line in f)
def content_words(text) :
return [w.lower() for w in re.findall(r"\w(?:'?\w)*", text)
if w.lower() not in stop_words]
Vector space models treat training documents differently from test documents. They have to, because language has lots of rare words. You will almost always see words in your test documents (your queries) that you haven't seen anywhere in your corpus of possible responses. You have no choice but to ignore these new words. (Or do you?)
The situation is different for the training documents: here you want to record each new word that you see so that if it comes up again (in training or test) you can associate the word with the documents where it occurs.
So we now define two different procedures that will count up the words in a document, to make the famous bag of words representation. In the case of a training document, we keep all the content words that occur in the document, and add any new ones to a list of the vocabulary of our collection. In the case of a test document, we only keep the words that already occur in the vocabulary, discarding counts for new words.
def count_train(text, vocabulary) :
words = content_words(text)
vocabulary.update(words)
return collections.Counter(words)
def count_test(text, vocabulary) :
words = [w for w in content_words(text) if w in vocabulary]
return collections.Counter(words)
We can now define the key quantities that the information retrieval model uses to weight words.
Finally, because we're interested in cosine measures of similarity which we explain more below, we normalize the scores in the representation of each document, so each document is associated with a vector (in TFIDF space) with length 1.
def mk_idf(vocabulary, counts) :
ndocs = float(len(counts))
result = dict()
for w in vocabulary :
result[w] = math.log(ndocs / sum(1 if count[w] > 0 else 0 for count in counts))
return result
def mk_tf_idf(count, idf) :
total = sum(count[w] for w in count)
result = dict((w, idf[w] * count[w] / total) for w in count)
length = math.sqrt(sum(result[w] * result[w] for w in result))
for w in result :
result[w] = result[w] / length
return result
This code does all the work of setting up the vector space representation of our text collection. We accumulate all the words that occur in our corpus in vocabulary
and accumulate in counts
a bag of words dictionary for each of our texts. Then we can compile the inverse document frequencies for all our vocabulary items as idf_dict
and finally create a normalized, weighted TFIDF vector representation for each document in scores
.
vocabulary = set()
counts = [count_train(utt, vocabulary) for utt in responses]
idf_dict = mk_idf(vocabulary, counts)
scores = [mk_tf_idf(count, idf_dict) for count in counts]
This function defines the cosine similarity measure. In our model, each document is interpreted as a vector. Think of it as an arrow pointing from the origin of the space off into a particular direction (representing its topic). When two arrows point in close to the same direction, they are similar to each other. The distance between two arrows is the angle between them, and you can measure the degree to which the arrows are aligned by the cosine of this angle (a number that's 1 when the arrows are pointing exactly the same way, so the angle is 0, and a number that's 0 when the arrows are pointed perpendicularly to one another, so the angle is 90). It turns out that you can measure the cosine by the dot product, which is the sum of the product of corresponding components in the vectors. So this code is very simple but it has a very complicated intuition behind it:
def similarity(d1, d2) :
return sum(d1[w] * d2[w] for w in d1 if w in d2)
The strategy for our chatbot will now be to respond by retrieval: to make an utterance that seems to be as related as possible to the utterance provided by the user. So we measure the similarity of all our response data to what the user just said, and sort the responses by similarity.
def sort_responses(tf_idf_utt, scores, utts) :
options = [(utts[i], similarity(tf_idf_utt, scores[i]))
for i in range(len(utts))]
return sorted(options, key = lambda (u, s): s, reverse=True)
Rather than simply saying the most relevant thing all the time, we will randomize our responses. We prefer the most relevant ones - more precisely, we say each response with a probability that's proportional to the relevance we have estimated.
def weighted_random_item(options) :
total = sum(w for (u,w) in options)
r = random.uniform(0,total)
i = 0
while r > 0 :
u, w = options[i]
if r < w or i == len(options) - 1 :
return u
else :
r = r - w
i = i + 1
Here's the main response engine: take input text, process it to compute its vector representation, and then say something relevant in response.
def respond(text) :
content = count_test(text, vocabulary)
tf_idf_utt = mk_tf_idf(content, idf_dict)
return(weighted_random_item(sort_responses(tf_idf_utt, scores, responses)))
Some example responses. You get a sense of Oscar Wilde's personality and interests from this. Probably you could hope for a better collection of responses - maybe something with a mix of facts and opinions? And maybe there is a better way to measure continuity of topic - perhaps by prioritizing words at the beginning of documents? Or using a part of speech tagger to focus on particular categories of words? It's all worth exploring.
print respond("love is splendid.")
The only way to behave to a woman, is to make love to her if she is pretty and to someone else if she is plain.
print respond("a fine romance with no kisses.")
What are the virtues? Nature, Renan tells us, cares little about chastity, and it may be that it is to the shame of the Magdalen, and not to their own purity, that the Lucretias of modern life owe their freedom from stain. Charity, as even those of whose religion it makes a formal part have been compelled to acknowledge, creates a multitude of evils. The mere existence of conscience, that faculty of which people prate so much nowadays, and are so ignorantly proud, is a sign of our imperfect development. It must be merged in instinct before we become fine. Self-denial is simply a method by which man arrests his progress, and self-sacrifice a survival of the mutilation of the savage, part of that old worship of pain which is so terrible a factor in the history of the world, and which even now makes its victims day by day and has its altars in the land. Virtues! Who knows what the virtues are? Not you. Not I. Not anyone. It is well for our vanity that we slay the criminal, for if we suffered him to live he might show us what we had gained by his crime. It is well for his peace that the saint goes to his martyrdom. He is spared the sight of the horror of his harvest.
print respond("i'm not feeling so well today doctor. can you help me?")
In this world there are only two tragedies. One is not getting what one wants, and the other is getting it. The last is much the worst -- the last is a real tragedy.
print respond("i will take that as a no.")
Men of the noblest possible moral character are extremely susceptible to the influence of the physical charms of others. Modern, no less than ancient, history supplies us with many most painful examples of what I refer to. If it were not so, indeed, history would be quite unreadable.
print respond("slots are my favorite animal.")
One is tempted to define man as a rational animal who always loses his temper when he is called upon to act in accordance with the dictates of reason.
A final reminder -- here too we can use our usual code to create a chatbot interaction by invoking this script as the argument to python in a terminal.
if __name__ == '__main__':
print("""
Therapist
---------
Talk to the program by typing in plain English, using normal upper-
and lower-case letters and punctuation. Enter "quit" when done.'""")
print('='*72)
print("Hello. What have you been thinking about lately?")
s = ""
while s != "quit":
s = input(">")
while s and s[-1] in "!.":
s = s[:-1]
print(respond(s))