NLP Topic Modeling - Building a chat bot with LDA¶

Using python + natural language processing for topic modeling: a unsupervised technique for document classification.

Topic modeling¶

Imagine you have a huge collection of documents, each one talking about a specific matter|subject. A document could be a movie description, or part of a book, a message, a tweet, etc...

If you take the time to read each document, you will learn that they are talking about science, or politics, or medicine, or sports, etc, but very often they don't have a label specifying the subject.

Now it's your job to group them by subject. Will you read each one of them and label one by one? What if your collection contains 1 billion documents?

This is where Topic modeling comes in hand: it is a very useful technique for document classification through unsupervised learning. It will learn from the collection of documents as a whole and then suggest groups (clusters) of documents by similarities, such as frequency or probabilities for words on each document.

After the documents are split into the suggested groups, we can then look at each group (through samples of them) and choose a proper label for it.

This clustering of documents by topic|subject can be achieved by different techniques:

tf-idf + clustering
tf-idf + PCA
Latent semantic analysis - LSA
Latent Dirichlet Allocation - LDA

On this notebook we will discuss the LDA method.

LDA - Latent Dirichlet Allocation¶

LDA is a especial case of the Latent semantic analysis, where the priors distribution of topics are assumed to be of the beta multivariate type, aka. Dirichlet distribution.

The main advantage of LDA over LSI, PCA or regular clustering is that LDA is capable of detecting intermediary topics between the ones that would be detected by the former, as they will work on principal components and detect only orthogonal topics. Thus, LDA reduces overfitting and increases accuracy.

On the other hand, LDA demands more computation time.

https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

Use case: chat bots¶

Imagine we are a big on line retailer and we want to provide a new communication channel for our costumers.

We had a real time on line chat before and we were using a human attendant to answer our costumers. We have stored all of the costumers messages on our database and now we want to create an algorithm that is capable of answering costumers just like a human attendant would.

But how to answer our costumers in a natural way, simulating the behavior of a human attendant?

The achieve that goal, we are going to use natural language processing and topic modeling. We will combine different techniques for analyzing the costumers queries and questions to elaborate an appropriate natural answer.

The techniques are:

topic modeling - to orient the composition of a proper answer according to the detected subject of conversation.

IE: if the subject is a complaint, the algorithm should compose an answer taking this information into account. If the message is a salutation, then salute back.

keyword detection, as order number, or some specific product.

Retrieving stored messages from our database¶

In order to "train" our algorithm, we must retrieve stored data from our databases. Note that this is not the typical training process because we don't have any targets or labels ready at this point. This is, in fact, a unsupervised training process.

If the database in question is a NoSQL, like MongoDB, we could run queries like:

db.mensagens.find(
    {
    $and: [
            {"user": {$in: ["user_1", "user_2", "attd_1", "sell_1"]}},
            {"message": {$ne: null}}
          ]
    }
).pretty()

on MongoDB Compass.

If the messages are stored in a xml format, we could use the Beautiful Soup library to scrap the data:

db = BeautifulSoup(open('db.xml').read(), "lxml")
messages = db.findAll('message')

If the messages are stored on a txt file, we could scrap then using something as simple as:

messages = []
with open(db.txt, "rb") as incoming:
        for line in incoming:
            if line.startswith('user'):
                messages.append(line)

Getting the messages from the available database¶

For this very particular notebook, we are going to use a very small set of messages containing some conversation between costumers and the attendants. The costumers are identified by the id "user_1"

In [1]:

conversations = [

    # small talk
    [
        {'user': 'user_1', 'message': 'Hi, how are you?', 'status': ''},
        {'user': 'user_2', 'message': 'fine! and you?   ', 'status': ''},
        {'user': 'user_1', 'message': ' I\'m ok!!!', 'status': ''},
        {'user': 'user_1', 'message': 'got any sale today?', 'status': ''},
        {'user': 'user_2', 'message': 'today we have a 50\"tv" \n ', 'status': ''},
    ],
    
    # customer service
    [
        {'user': 'user_1', 'message': 'Where is my iphone?!'                          , 'status': 'payment_approved'},
        {'user': 'attd_1', 'message': 'Hello, your payment has been approved'         , 'status': 'payment_approved'},
        {'user': 'attd_1', 'message': 'Is the product on delivery route'              , 'status': 'payment_approved'},
        {'user': 'user_1', 'message': 'But it\'s 5 days already!'                     , 'status': 'payment_approved'},
        {'user': 'attd_1', 'message': 'our delivery should take five working days '   , 'status': 'payment_approved'},
        {'user': 'user_1', 'message': 'ow it\'s true'                                 , 'status': 'payment_approved'},
    ],
    
    # sale
    [
        {'user': 'user_1', 'message': 'Where is the iphone 10?'                     , 'status': 'shopping'},
        {'user': 'sell_1', 'message': 'Hello! the iphone X it is out of stock;'     , 'status': 'shopping'},
        {'user': 'user_1', 'message': 'Huum, what do you have available?'           , 'status': 'shopping'},
        {'user': 'sell_1', 'message': 'We have the iphone X plus and the samsung s8', 'status': 'shopping'},
        {'user': 'user_1', 'message': 'Is the samsung better than the iphone?'      , 'status': 'shopping'},
        {'user': 'sell_1', 'message': 'They are different, but they are the best'   , 'status': 'shopping'},
    ],
]

In [2]:

# lets retrieve only the costumers messages
user_messages = []

for chat in conversations:
    user_messages.append([message["message"] for message in chat if message["user"]=="user_1"])

user_messages = sum(user_messages, [])
user_messages

Out[2]:

['Hi, how are you?',
 " I'm ok!!!",
 'got any sale today?',
 'Where is my iphone?!',
 "But it's 5 days already!",
 "ow it's true",
 'Where is the iphone 10?',
 'Huum, what do you have available?',
 'Is the samsung better than the iphone?']

In [3]:

# lets add some commom costumer messages to the list

add_msg = ["Good morning!",
              "Good night!",
              "Good evening",
              "I'd like to make a complaint",
              "I'd like to make a exchange",
              "I'd like to make a refund",
              "What's the best mobile?",
              "my smart phone is broken!",
              "I can't finish my purchase",
              "the site is not working!",
              "When it is going to arrive?",
              "What's is the delivery time?",
              "this device is really bad",
              "I need some help choosing a mobile",
              "do you accept credit cards?",
              "the tv arrived already broken",
              "the device arrived already broken",
              "the device doesn't work"
             ]

In [4]:

all_msg = user_messages+add_msg
all_msg

Out[4]:

['Hi, how are you?',
 " I'm ok!!!",
 'got any sale today?',
 'Where is my iphone?!',
 "But it's 5 days already!",
 "ow it's true",
 'Where is the iphone 10?',
 'Huum, what do you have available?',
 'Is the samsung better than the iphone?',
 'Good morning!',
 'Good night!',
 'Good evening',
 "I'd like to make a complaint",
 "I'd like to make a exchange",
 "I'd like to make a refund",
 "What's the best mobile?",
 'my smart phone is broken!',
 "I can't finish my purchase",
 'the site is not working!',
 'When it is going to arrive?',
 "What's is the delivery time?",
 'this device is really bad',
 'I need some help choosing a mobile',
 'do you accept credit cards?',
 'the tv arrived already broken',
 'the device arrived already broken',
 "the device doesn't work"]

Pre-processing¶

In order to get the messages classified by topics, we must perform some pre-processing on them.

The LDA Topic Modeling sees the documents as bag of words (BOW), so we need to start by transforming each message that way.

The first step to get our BOW, we must build a token generator that provides:

lowercase on each word
remove numbers, we are assuming here that they won't help here
remove small words, with less than 2 characters long
spell check
stem|lemmatize the words (IE get only their "root")

Those restrictions are going to be applied in order to avoid unnecessary complexity: there's no evident gains otherwise.

So, we are assuming the following words to be seen as the same by our model:

Mobile|mobile|mobiles
boy|boys
girl|girls
samsung|sansumg|Sannsungui

Spell Check¶

The chosen spell checker is the enchant project: https://github.com/AbiWord/enchant

It depends on the local dictionary, so we must install myspell:

# sudo apt-get install myspell-en-us

Besides, lets add some words to the dictionary, as manufactures and products:

In [5]:

terms = ["samsung", "motorola", "apple", "iphone", "pixel", "google"]

file = "terms.txt"
with open(file, "w") as text_file:
    for term in terms:
        print(term, file=text_file)

In [6]:

import enchant
d = enchant.DictWithPWL("en_US",file)
d.check("Samsung") # if true, it means that the spell checker knowns the word

Out[6]:

True

Getting better suggestions from enchant¶

The enchant.suggest() method provides a list of candidates for fixing the spelling. Usually the first option is the best, but it doesn't always works as expected:

In [7]:

d.suggest("Samjung")

Out[7]:

['Jung', 'Smugging', 'Samarkand', 'Samsung']

Instead, we should chose a proper fix through a similarity comparison:

Thus, we will use the methods:

difflib.SequenceMatcher()
difflib.SequenceMatcher().ratio()

The sequenceMatcher method compares pairs in a human friendly way. The ratio() method evaluates the similarity of the pair. Values above 0.6 indicates we have a match.

In [8]:

import difflib
difflib.SequenceMatcher(None, "samsung", "sony").ratio()

Out[8]:

0.36363636363636365

In [9]:

difflib.SequenceMatcher(None, "samsung", "sansumg").ratio()

Out[9]:

0.7142857142857143

In [10]:

def spell_checker(word):
    
    best_fix = ""
    best_ratio = 0 # começando com similaridade 0

    sugestions = set(d.suggest(word))
    for sugestion in sugestions:
        tmp = difflib.SequenceMatcher(None, word, sugestion).ratio()
        if tmp > best_ratio:
            best_fix = sugestion
            best_ratio = tmp # aumenta o nível para próximas comparações

    return best_fix

In [11]:

spell_checker("samsungui")

Out[11]:

'samsung'

In [12]:

spell_checker("samjung")

Out[12]:

'samsung'

The token generator¶

The token generator will transform the messages into BOW.

It musts:

lowercase on each word
split the message into words, removing numbers
remove small words, with less than 2 characters long
apply spell check
stem or lemmatize the words

Stemming is faster but lemmatizing is more precise, although it takes more computing time.

From the wikipedia

Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.

http://en.wikipedia.org/wiki/Lemmatisation

The choice between the former or the later really depends on the application. Sometimes just stemming will be enough.

In [13]:

import nltk
from nltk.stem import WordNetLemmatizer
lemma = WordNetLemmatizer()
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))
import re # for regular expressions

In [22]:

def tokenizator(message, size=1, fix=1, lemmatize=1):

    message = message.lower() # lowercase
    
    # get tokens, including acentuation, exclude pontuation and numbers
    #tokens = re.findall("[-'a-zA-ZÀ-ÖØ-öø-ÿ]+", message) 
    tokens = nltk.tokenize.word_tokenize(message)
    ## filter words by size
    tokens = [token for token in tokens if len(token) > size] 
    
    # spell check and correction only if needed and only if len > 2
    if fix:
        #tokens = [spell_checker(token) if not d.check(token) else token for token in tokens]
        tokens = [spell_checker(token) if (not d.check(token) and len(token) > 2) else token for token in tokens]
    
    # stemming words
    if lemmatize:
        tokens = [lemma.lemmatize(t) for t in tokens] 
    
    # lets keep stop words by now because the documents are already too small
    #tokens = [t for t in tokens if t not in stopWords] # remove stopwords
    
    return tokens

Applying the token generator¶

Lets now transform the messages from our database in tokens, as this is required to later obtain our bag of words.

In [23]:

tokens = map(lambda x: tokenizator(x), all_msg)
msg_pro = list(tokens)

In [24]:

import pandas as pd
compare = pd.DataFrame({"origin": all_msg, "tokenized": msg_pro})
compare

Out[24]:

	origin	tokenized
0	Hi, how are you?	[hi, how, are, you]
1	I'm ok!!!	['m, ok]
2	got any sale today?	[got, any, sale, today]
3	Where is my iphone?!	[where, is, my, iphone]
4	But it's 5 days already!	[but, it, 's, day, already]
5	ow it's true	[ow, it, 's, true]
6	Where is the iphone 10?	[where, is, the, iphone, 10]
7	Huum, what do you have available?	[hum, what, do, you, have, available]
8	Is the samsung better than the iphone?	[is, the, samsung, better, than, the, iphone]
9	Good morning!	[good, morning]
10	Good night!	[good, night]
11	Good evening	[good, evening]
12	I'd like to make a complaint	['d, like, to, make, complaint]
13	I'd like to make a exchange	['d, like, to, make, exchange]
14	I'd like to make a refund	['d, like, to, make, refund]
15	What's the best mobile?	[what, 's, the, best, mobile]
16	my smart phone is broken!	[my, smart, phone, is, broken]
17	I can't finish my purchase	[ca, don't, finish, my, purchase]
18	the site is not working!	[the, site, is, not, working]
19	When it is going to arrive?	[when, it, is, going, to, arrive]
20	What's is the delivery time?	[what, 's, is, the, delivery, time]
21	this device is really bad	[this, device, is, really, bad]
22	I need some help choosing a mobile	[need, some, help, choosing, mobile]
23	do you accept credit cards?	[do, you, accept, credit, card]
24	the tv arrived already broken	[the, tv, arrived, already, broken]
25	the device arrived already broken	[the, device, arrived, already, broken]
26	the device doesn't work	[the, device, doe, don't, work]

Applying the LDA¶

The gensin package brings the tools needed to implement a LDA analysis in python

https://radimrehurek.com/gensim/models/ldamodel.html

In [25]:

from gensim import corpora, models

dictionary = corpora.Dictionary(msg_pro) # getting a dictionary from our collection

/data/data-erick/anaconda3/lib/python3.6/site-packages/requests/__init__.py:80: RequestsDependencyWarning: urllib3 (1.22) or chardet (2.3.0) doesn't match a supported version!
  RequestsDependencyWarning)

In [26]:

body = [dictionary.doc2bow(msg) for msg in msg_pro] # term matrix from our collection

Training our LDA model¶

We must choose a starting number of topics according to our knowledge about the collection, just as we would do when performing a k-mens clustering, where we have to choose a starting number of clusters. Later we will analyze the proper metrics and decide whether to increase or decrease the number of clusters|topics.

Setting the parameters for the lda training:

lets start assuming our collection contains 5 different topics.
lets run 100 passes over the collection until it reaches convergence about topic separation.
the alpha parameter its about the document-topic density. A higher value indicates that each document contains more topics. We expect the messages from our costumers to be about one topic only, so we should use a small number here.

Lets also record the time it takes to finish the process.

In [52]:

import time
import random

start = time.time()
random.seed(95276)
model = models.ldamodel.LdaModel(body,num_topics=6, id2word = dictionary, passes=100, 
#                                 alpha=.1, eta=5, random_state=95276)
                                 alpha=.05, eta=5, random_state=95276)

    # alpha (document/topic relationship) and eta (topic-words relationship) 
    # could be set to learn from data - "auto" setting
    
print("\n --- %s seconds ---" % round((time.time() - start),4))

 --- 1.892 seconds ---

Showing the 3 main terms for each topic¶

In [38]:

model.print_topics(num_topics=6, num_words=3)

Out[38]:

[(0, '0.032*"the" + 0.030*"is" + 0.018*"iphone"'),
 (1, '0.016*"today" + 0.016*"got" + 0.016*"sale"'),
 (2, '0.020*"to" + 0.020*"\'d" + 0.020*"make"'),
 (3, '0.016*"mobile" + 0.016*"choosing" + 0.016*"some"'),
 (4, '0.016*"my" + 0.016*"don\'t" + 0.016*"finish"'),
 (5, '0.020*"good" + 0.020*"you" + 0.018*"do"')]

Serializing the model¶

So we don't have to train the model again - lets save it to disk.

In [39]:

import pickle

# writes to disk
pickle.dump(model, open("lda.model", 'wb'))
pickle.dump(dictionary, open("dictionary.model", 'wb'))

# loads back
model = pickle.load(open("lda.model", 'rb'))
dictionary = pickle.load(open("dictionary.model", 'rb'))

model.print_topics(num_topics=6, num_words=3)

Out[39]:

[(0, '0.032*"the" + 0.030*"is" + 0.018*"iphone"'),
 (1, '0.016*"today" + 0.016*"got" + 0.016*"sale"'),
 (2, '0.020*"to" + 0.020*"\'d" + 0.020*"make"'),
 (3, '0.016*"mobile" + 0.016*"choosing" + 0.016*"some"'),
 (4, '0.016*"my" + 0.016*"don\'t" + 0.016*"finish"'),
 (5, '0.020*"good" + 0.020*"you" + 0.018*"do"')]

Visual analysis of the topics¶

The bigger the distinction between groups, the better. We can improve this distinction by adjusting the model parameters when training it.

In [62]:

import pyLDAvis.gensim
vis = pyLDAvis.gensim.prepare(model, body, dictionary)
pyLDAvis.enable_notebook()

vis

/data/data-erick/anaconda3/lib/python3.6/site-packages/pyLDAvis/_prepare.py:387: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]

Out[62]:

Showing the identified topics¶

In [50]:

def get_topic(message):
    tokens = tokenizator(message)
    map = dictionary.doc2bow(tokens)
    # model[map] will return all the possible topics and a score that tells 
    # the probability of the message belonging to that topic.
    # 
    # we will order it to get the topic which has the biggest probability.
    # Also, if the probability is lower than 60%, we will chose to say that we just 
    # don't know what the message is about (category 6)
    #guess = sorted(model[map][0], key=lambda y: y[1], reverse=True)[0] # retorna o tópico mais provável e sua pontuação
    guess = sorted(model[map], key=lambda y: y[1], reverse=True)[0] # retorna o tópico mais provável e sua pontuação

    return 6 if guess[1] < .6 else guess[0]

In [53]:

df = pd.DataFrame({"messages": all_msg})
df["tokens"] = df.messages.apply(lambda x: tokenizator(x))
df["topic"] = df.messages.apply(lambda x: get_topic(x))
df.sort_values(by="topic")

Out[53]:

	messages	tokens	topic
19	When it is going to arrive?	[when, it, is, going, to, arrive]	0
20	What's is the delivery time?	[what, 's, is, the, delivery, time]	0
16	my smart phone is broken!	[my, smart, phone, is, broken]	0
3	Where is my iphone?!	[where, is, my, iphone]	0
15	What's the best mobile?	[what, 's, the, best, mobile]	0
21	this device is really bad	[this, device, is, really, bad]	0
6	Where is the iphone 10?	[where, is, the, iphone, 10]	0
8	Is the samsung better than the iphone?	[is, the, samsung, better, than, the, iphone]	0
18	the site is not working!	[the, site, is, not, working]	0
11	Good evening	[good, evening]	1
9	Good morning!	[good, morning]	1
2	got any sale today?	[got, any, sale, today]	1
1	I'm ok!!!	['m, ok]	1
10	Good night!	[good, night]	1
12	I'd like to make a complaint	['d, like, to, make, complaint]	2
14	I'd like to make a refund	['d, like, to, make, refund]	2
13	I'd like to make a exchange	['d, like, to, make, exchange]	2
22	I need some help choosing a mobile	[need, some, help, choosing, mobile]	3
5	ow it's true	[ow, it, 's, true]	3
24	the tv arrived already broken	[the, tv, arrived, already, broken]	4
26	the device doesn't work	[the, device, doe, don't, work]	4
25	the device arrived already broken	[the, device, arrived, already, broken]	4
4	But it's 5 days already!	[but, it, 's, day, already]	4
17	I can't finish my purchase	[ca, don't, finish, my, purchase]	4
7	Huum, what do you have available?	[hum, what, do, you, have, available]	5
23	do you accept credit cards?	[do, you, accept, credit, card]	5
0	Hi, how are you?	[hi, how, are, you]	5

Labeling the topics¶

After applying the model, we can now look at a small sample of each group so we can add a friendly label do it.

In [54]:

#showing a sample of each group, to aid labeling each one
df.groupby('topic').apply(lambda x: x.sample(frac=.8))

Out[54]:

		messages	tokens	topic
topic
0	6	Where is the iphone 10?	[where, is, the, iphone, 10]	0
	8	Is the samsung better than the iphone?	[is, the, samsung, better, than, the, iphone]	0
	3	Where is my iphone?!	[where, is, my, iphone]	0
	16	my smart phone is broken!	[my, smart, phone, is, broken]	0
	15	What's the best mobile?	[what, 's, the, best, mobile]	0
	18	the site is not working!	[the, site, is, not, working]	0
	20	What's is the delivery time?	[what, 's, is, the, delivery, time]	0
1	2	got any sale today?	[got, any, sale, today]	1
	1	I'm ok!!!	['m, ok]	1
	11	Good evening	[good, evening]	1
	9	Good morning!	[good, morning]	1
2	14	I'd like to make a refund	['d, like, to, make, refund]	2
2	13	I'd like to make a exchange	['d, like, to, make, exchange]	2
3	5	ow it's true	[ow, it, 's, true]	3
3	22	I need some help choosing a mobile	[need, some, help, choosing, mobile]	3
4	26	the device doesn't work	[the, device, doe, don't, work]	4
	17	I can't finish my purchase	[ca, don't, finish, my, purchase]	4
	24	the tv arrived already broken	[the, tv, arrived, already, broken]	4
	4	But it's 5 days already!	[but, it, 's, day, already]	4
5	23	do you accept credit cards?	[do, you, accept, credit, card]	5
5	7	Huum, what do you have available?	[hum, what, do, you, have, available]	5

In [100]:

# um dicionário de tópicos
labels = ["delivery", "salutation", "demand", "comparisson", "problem", "indication", "unknown"]
labels = dict(zip(range(7), labels))
labels

Out[100]:

{0: 'delivery',
 1: 'salutation',
 2: 'demand',
 3: 'comparisson',
 4: 'problem',
 5: 'indication',
 6: 'unknown'}

In [56]:

df["label"] = df.topic.replace(labels)
df

Out[56]:

	messages	tokens	topic	label
0	Hi, how are you?	[hi, how, are, you]	5	indication
1	I'm ok!!!	['m, ok]	1	salutation
2	got any sale today?	[got, any, sale, today]	1	salutation
3	Where is my iphone?!	[where, is, my, iphone]	0	delivery
4	But it's 5 days already!	[but, it, 's, day, already]	4	problem
5	ow it's true	[ow, it, 's, true]	3	comparisson
6	Where is the iphone 10?	[where, is, the, iphone, 10]	0	delivery
7	Huum, what do you have available?	[hum, what, do, you, have, available]	5	indication
8	Is the samsung better than the iphone?	[is, the, samsung, better, than, the, iphone]	0	delivery
9	Good morning!	[good, morning]	1	salutation
10	Good night!	[good, night]	1	salutation
11	Good evening	[good, evening]	1	salutation
12	I'd like to make a complaint	['d, like, to, make, complaint]	2	demand
13	I'd like to make a exchange	['d, like, to, make, exchange]	2	demand
14	I'd like to make a refund	['d, like, to, make, refund]	2	demand
15	What's the best mobile?	[what, 's, the, best, mobile]	0	delivery
16	my smart phone is broken!	[my, smart, phone, is, broken]	0	delivery
17	I can't finish my purchase	[ca, don't, finish, my, purchase]	4	problem
18	the site is not working!	[the, site, is, not, working]	0	delivery
19	When it is going to arrive?	[when, it, is, going, to, arrive]	0	delivery
20	What's is the delivery time?	[what, 's, is, the, delivery, time]	0	delivery
21	this device is really bad	[this, device, is, really, bad]	0	delivery
22	I need some help choosing a mobile	[need, some, help, choosing, mobile]	3	comparisson
23	do you accept credit cards?	[do, you, accept, credit, card]	5	indication
24	the tv arrived already broken	[the, tv, arrived, already, broken]	4	problem
25	the device arrived already broken	[the, device, arrived, already, broken]	4	problem
26	the device doesn't work	[the, device, doe, don't, work]	4	problem

Applying the model to new messages¶

In [61]:

message = "Hello! Good evening"
labels[get_topic(message)]

Out[61]:

'salutation'

In [58]:

message = "ah!"
labels[get_topic(message)]

Out[58]:

'unknown'

Keyword detection¶

The keywords, together with the topic prediction, will help to compose a helpful answer.

Order Number¶

In [63]:

def get_order(msg):
    tokens = re.findall("[0-9]+", msg)
    order = [token for token in tokens if len(token)==10] # order number has size 10
    
    return order
    
get_order("just bought a iphone, the order is 1234567890")

Out[63]:

['1234567890']

Device type¶

In [64]:

def get_device(msg):
    tokens = tokenizator(msg, size=1, lemmatize=0, fix=0)
    
    # uma lista de objetos de interesse do cliente - alimentar com os produtos vendidos
    # podemos utilizar NLP POS - parts of speech para reconhecer o objeto, mas aqui acabou o tempo =D
    
    possibilities = ["tv", "mobile", "television", "microwave", "site", "freezer", "tire", "pants"] 
    
    device = [token for token in tokens if token in possibilities]
    return device
    

get_device("I need info on a new mobile and freezer")
#get_aparelho("preciso de indicação para tv")

Out[64]:

['mobile', 'freezer']

Composing the answer¶

The final answer to the user message will be composed according to:

the predicted topic - each topic will have an auxiliary function
detected keywords

We will make use of the following auxiliary functions:

get_topic() - ok
get_device() - ok
get_order() - ok
order_status() # queries a order database to retrive info on the transaction
salute_back()
unknown_msg()

In [65]:

import datetime

def salute_back(tipo="salute"):
    agora = datetime.datetime.now()
    hora = agora.hour
    
    if tipo == "salute":
        saudacao = "Good night"
        if hora in range(12,19):
            saudacao = "Good afternoon"
        if hora in range(6,13):
            saudacao = "Good morning"
        
        mensagem = saudacao + ", dear costumer!"
            
    else:
        mensagem = "It's {} hours".format(hora)
        
    return mensagem
        
salute_back()

Out[65]:

'Good night, dear costumer!'

In [66]:

salute_back("hora")

Out[66]:

"It's 20 hours"

In [68]:

def order_status(number):
    return(" Status for order {}, according to db: status".format(number))

order_status(128312983)

Out[68]:

' Status for order 128312983, according to db: status'

In [70]:

unknown_counter = 0

In [71]:

def unknown_msg():
    global unknown_counter
    messages = ["It's all ok here on Earth. How can I help you?",
                "Please be more specific...",
                "I'm just a tired robot. Please explain, slowly...",
                "is this a new kind of joke?"
                ]
    
    if not unknown_counter:
        message = "{} {} and {}".format(salute_back(), salute_back("hora"), messages[0])
    else:
        message = messages[unknown_counter]
    
    unknown_counter += 1
    if unknown_counter == 4:
        unknown_counter = 1
    
    return message

unknown_msg()

Out[71]:

"Good night, dear costumer! It's 20 hours and It's all ok here on Earth. How can I help you?"

Lets now create a dictionary containing answers to each topic. The first answer on each class assumes that there are no keywords on the message. The second message on each class assumes that keywords are present.

In [97]:

m_delivery = ["Our delivery time is 5 working days. Please inform the order number so I can fetch more information on it",
              "Our delivery time is 5 working days. Lets check the order status {}."]

m_request = ["Sure thing! What's the order number?",
                 "Ok. Lets check the order status {}."]

m_indication = ["You want indications for what type of device?",
                "Here are the best deals for {}."]

m_comparisson = ["We can help you to choose the best options. What kind of device are you looking for?",
                "These are the best options for {}."]

m_problem= ["Easy! Everything can be fixed. What's going on?",
              "Easy! Lets fix this issue with {} the best we can."]



all_answers = {
    0: m_delivery,
    1: salute_back(),
    2: m_request,
    3: m_comparisson,
    4: m_problem,
    5: m_indication,
    6: unknown_msg()
}

all_answers[1]

Out[97]:

'Good night, dear costumer!'

Answering the costumer¶

Now that all the auxiliary functions are ready, we can compose our answer:

In [101]:

labels

Out[101]:

{0: 'delivery',
 1: 'salutation',
 2: 'demand',
 3: 'comparisson',
 4: 'problem',
 5: 'indication',
 6: 'unknown'}

In [102]:

def sub_resp(var, topic): # ok, just one more auxiliary function
    if var:
        if var[0].isdigit():
            answer = all_answers[topic][1].format(var[0]) + order_status(var[0])
        else:
            answer = all_answers[topic][1].format(var[0])
    else:
        answer = all_answers[topic][0]
            
    return answer

def answer_costumer(message): # agora sim

    topic = get_topic(message)
    n_order = get_order(message)
    device = get_device(message)
    
    if topic in [0,2]:
        return sub_resp(n_order, topic)
    
    if topic in [3,4,5]:
        return sub_resp(device, topic)
    
    if topic == 1:
        return all_answers[topic]
    else:
        return all_answers[topic]
    
unknown_counter = 0

In [109]:

message = "I want a new mobile"

In [110]:

answer_costumer(message)

Out[110]:

'These are the best options for mobile.'

In [105]:

answer_costumer("good morning!")

Out[105]:

'Good night, dear costumer!'

In [106]:

answer_costumer("whacka whacka whacka!")

Out[106]:

'Please be more specific...'

In [108]:

answer_costumer("whats the eta for order 1234567890?")

Out[108]:

'Our delivery time is 5 working days. Lets check the order status 1234567890. Status for order 1234567890, according to db: status'

Scalability¶

In order to optimize the processing time, the gensim package offers alternate ways to train a LDA model.

Among them, are:

set up a cluster and compute the job in a distributed fashion.
Instead of running 100 or n steps, the batch mode, we can use the "on line" mode, where only a subset of size m of the messages will be taken into account to train the model. After the model is ready, it will take another subset, process it and then update the model. It will keep processing the subsets and updating the final model.
a mix of both modes

Next, we will compare both modes in terms of computation time.

To do so, lets simulate a collection 1000 times larger than ours.m o modo batch (10 passadas). Para isso, vamos simular um "corpo" com tamanho 1000 vezes maior que o nosso "corpo de mensagens"

In [111]:

import time
body2 = body*1000

In [112]:

start = time.time()

model_bath = models.ldamodel.LdaModel(body2, num_topics=6, id2word = dictionary, passes=10)
    
print("\n --- %s seconds ---" % round((time.time() - start),4))

 --- 64.2224 segundos ---

It was necessary 65 seconds to process the model in batch mode on this machine (Core i5, 4Gb of RAM)

In [113]:

start = time.time()

model_online = models.ldamodel.LdaModel(body2, num_topics=6, id2word = dictionary, update_every=1, chunksize=100, passes=1)
    
print("\n --- %s seconds ---" % round((time.time() - start),4))

 --- 9.1031 seconds ---

It was necessary 9 seconds to process the model in on line mode on this machine (Core i5, 4Gb of RAM)

In [115]:

model_bath.print_topics(num_words=3)

Out[115]:

[(0, '0.114*"is" + 0.086*"the" + 0.086*"my"'),
 (1, '0.136*"to" + 0.136*"make" + 0.136*"\'d"'),
 (2, '0.111*"good" + 0.111*"is" + 0.111*"device"'),
 (3, '0.166*"mobile" + 0.083*"good" + 0.083*"the"'),
 (4, '0.120*"\'s" + 0.120*"it" + 0.080*"is"'),
 (5, '0.095*"broken" + 0.095*"the" + 0.095*"already"')]

In [114]:

model_online.print_topics(num_words=3)

Out[114]:

[(0, '0.182*"\'s" + 0.182*"it" + 0.091*"night"'),
 (1, '0.143*"is" + 0.048*"mobile" + 0.048*"help"'),
 (2, '0.143*"the" + 0.095*"device" + 0.095*"broken"'),
 (3, '0.154*"good" + 0.154*"is" + 0.154*"iphone"'),
 (4, '0.115*"you" + 0.077*"do" + 0.077*"the"'),
 (5, '0.094*"the" + 0.094*"to" + 0.094*"like"')]

Final thoughts¶

The model could benefit from more fine tunning. Also, feeding it with a proper collection of documents would help to improve the results.

I've used such a small database because originally this was an actual challenge proposed on a job interview. The original notebook was in portuguese, so this is a translation from it.

The same task could be done with other methods, such as tf-idf, PCA, LSA, etc... but lda should provide better accuracy, as I mentioned before.

About me¶

https://github.com/erickfis

My skills:

- R & Python (Pandas, SciPy, scikit-learn, dplyr, caret)
- BI Tableau, Google Vis, ggPlot, Shiny Dashboards
- MySQL / Teradata, NoSQL, MongoDB, Apache Cassandra
- Hadoop, Amazon Web Services – AWS EC2
- Machine Learning & Regression models, Decision Trees, etc
- Natural Language Processing – NLP
- Clustering / Association rules
- Inferential statistics, A/B testing
- Exploratory data analysis
- Git / Github
- Rmarkdown Reproducible Research / Jupyter Notebooks
- Physicist

Erick Gomes Anastácio