Using python + natural language processing for topic modeling: a unsupervised technique for document classification.
Imagine you have a huge collection of documents, each one talking about a specific matter|subject. A document could be a movie description, or part of a book, a message, a tweet, etc...
If you take the time to read each document, you will learn that they are talking about science, or politics, or medicine, or sports, etc, but very often they don't have a label specifying the subject.
Now it's your job to group them by subject. Will you read each one of them and label one by one? What if your collection contains 1 billion documents?
This is where Topic modeling comes in hand: it is a very useful technique for document classification through unsupervised learning. It will learn from the collection of documents as a whole and then suggest groups (clusters) of documents by similarities, such as frequency or probabilities for words on each document.
After the documents are split into the suggested groups, we can then look at each group (through samples of them) and choose a proper label for it.
This clustering of documents by topic|subject can be achieved by different techniques:
On this notebook we will discuss the LDA method.
LDA is a especial case of the Latent semantic analysis, where the priors distribution of topics are assumed to be of the beta multivariate type, aka. Dirichlet distribution.
The main advantage of LDA over LSI, PCA or regular clustering is that LDA is capable of detecting intermediary topics between the ones that would be detected by the former, as they will work on principal components and detect only orthogonal topics. Thus, LDA reduces overfitting and increases accuracy.
On the other hand, LDA demands more computation time.
https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
Imagine we are a big on line retailer and we want to provide a new communication channel for our costumers.
We had a real time on line chat before and we were using a human attendant to answer our costumers. We have stored all of the costumers messages on our database and now we want to create an algorithm that is capable of answering costumers just like a human attendant would.
But how to answer our costumers in a natural way, simulating the behavior of a human attendant?
The achieve that goal, we are going to use natural language processing and topic modeling. We will combine different techniques for analyzing the costumers queries and questions to elaborate an appropriate natural answer.
The techniques are:
IE: if the subject is a complaint, the algorithm should compose an answer taking this information into account. If the message is a salutation, then salute back.
In order to "train" our algorithm, we must retrieve stored data from our databases. Note that this is not the typical training process because we don't have any targets or labels ready at this point. This is, in fact, a unsupervised training process.
If the database in question is a NoSQL, like MongoDB, we could run queries like:
db.mensagens.find(
{
$and: [
{"user": {$in: ["user_1", "user_2", "attd_1", "sell_1"]}},
{"message": {$ne: null}}
]
}
).pretty()
on MongoDB Compass.
If the messages are stored in a xml format, we could use the Beautiful Soup library to scrap the data:
db = BeautifulSoup(open('db.xml').read(), "lxml")
messages = db.findAll('message')
If the messages are stored on a txt file, we could scrap then using something as simple as:
messages = []
with open(db.txt, "rb") as incoming:
for line in incoming:
if line.startswith('user'):
messages.append(line)
For this very particular notebook, we are going to use a very small set of messages containing some conversation between costumers and the attendants. The costumers are identified by the id "user_1"
conversations = [
# small talk
[
{'user': 'user_1', 'message': 'Hi, how are you?', 'status': ''},
{'user': 'user_2', 'message': 'fine! and you? ', 'status': ''},
{'user': 'user_1', 'message': ' I\'m ok!!!', 'status': ''},
{'user': 'user_1', 'message': 'got any sale today?', 'status': ''},
{'user': 'user_2', 'message': 'today we have a 50\"tv" \n ', 'status': ''},
],
# customer service
[
{'user': 'user_1', 'message': 'Where is my iphone?!' , 'status': 'payment_approved'},
{'user': 'attd_1', 'message': 'Hello, your payment has been approved' , 'status': 'payment_approved'},
{'user': 'attd_1', 'message': 'Is the product on delivery route' , 'status': 'payment_approved'},
{'user': 'user_1', 'message': 'But it\'s 5 days already!' , 'status': 'payment_approved'},
{'user': 'attd_1', 'message': 'our delivery should take five working days ' , 'status': 'payment_approved'},
{'user': 'user_1', 'message': 'ow it\'s true' , 'status': 'payment_approved'},
],
# sale
[
{'user': 'user_1', 'message': 'Where is the iphone 10?' , 'status': 'shopping'},
{'user': 'sell_1', 'message': 'Hello! the iphone X it is out of stock;' , 'status': 'shopping'},
{'user': 'user_1', 'message': 'Huum, what do you have available?' , 'status': 'shopping'},
{'user': 'sell_1', 'message': 'We have the iphone X plus and the samsung s8', 'status': 'shopping'},
{'user': 'user_1', 'message': 'Is the samsung better than the iphone?' , 'status': 'shopping'},
{'user': 'sell_1', 'message': 'They are different, but they are the best' , 'status': 'shopping'},
],
]
# lets retrieve only the costumers messages
user_messages = []
for chat in conversations:
user_messages.append([message["message"] for message in chat if message["user"]=="user_1"])
user_messages = sum(user_messages, [])
user_messages
['Hi, how are you?', " I'm ok!!!", 'got any sale today?', 'Where is my iphone?!', "But it's 5 days already!", "ow it's true", 'Where is the iphone 10?', 'Huum, what do you have available?', 'Is the samsung better than the iphone?']
# lets add some commom costumer messages to the list
add_msg = ["Good morning!",
"Good night!",
"Good evening",
"I'd like to make a complaint",
"I'd like to make a exchange",
"I'd like to make a refund",
"What's the best mobile?",
"my smart phone is broken!",
"I can't finish my purchase",
"the site is not working!",
"When it is going to arrive?",
"What's is the delivery time?",
"this device is really bad",
"I need some help choosing a mobile",
"do you accept credit cards?",
"the tv arrived already broken",
"the device arrived already broken",
"the device doesn't work"
]
all_msg = user_messages+add_msg
all_msg
['Hi, how are you?', " I'm ok!!!", 'got any sale today?', 'Where is my iphone?!', "But it's 5 days already!", "ow it's true", 'Where is the iphone 10?', 'Huum, what do you have available?', 'Is the samsung better than the iphone?', 'Good morning!', 'Good night!', 'Good evening', "I'd like to make a complaint", "I'd like to make a exchange", "I'd like to make a refund", "What's the best mobile?", 'my smart phone is broken!', "I can't finish my purchase", 'the site is not working!', 'When it is going to arrive?', "What's is the delivery time?", 'this device is really bad', 'I need some help choosing a mobile', 'do you accept credit cards?', 'the tv arrived already broken', 'the device arrived already broken', "the device doesn't work"]
In order to get the messages classified by topics, we must perform some pre-processing on them.
The LDA Topic Modeling sees the documents as bag of words (BOW), so we need to start by transforming each message that way.
The first step to get our BOW, we must build a token generator that provides:
Those restrictions are going to be applied in order to avoid unnecessary complexity: there's no evident gains otherwise.
So, we are assuming the following words to be seen as the same by our model:
The chosen spell checker is the enchant project: https://github.com/AbiWord/enchant
It depends on the local dictionary, so we must install myspell:
# sudo apt-get install myspell-en-us
Besides, lets add some words to the dictionary, as manufactures and products:
terms = ["samsung", "motorola", "apple", "iphone", "pixel", "google"]
file = "terms.txt"
with open(file, "w") as text_file:
for term in terms:
print(term, file=text_file)
import enchant
d = enchant.DictWithPWL("en_US",file)
d.check("Samsung") # if true, it means that the spell checker knowns the word
True
The enchant.suggest() method provides a list of candidates for fixing the spelling. Usually the first option is the best, but it doesn't always works as expected:
d.suggest("Samjung")
['Jung', 'Smugging', 'Samarkand', 'Samsung']
Instead, we should chose a proper fix through a similarity comparison:
Thus, we will use the methods:
difflib.SequenceMatcher()
difflib.SequenceMatcher().ratio()
The sequenceMatcher method compares pairs in a human friendly way. The ratio() method evaluates the similarity of the pair. Values above 0.6 indicates we have a match.
import difflib
difflib.SequenceMatcher(None, "samsung", "sony").ratio()
0.36363636363636365
difflib.SequenceMatcher(None, "samsung", "sansumg").ratio()
0.7142857142857143
def spell_checker(word):
best_fix = ""
best_ratio = 0 # começando com similaridade 0
sugestions = set(d.suggest(word))
for sugestion in sugestions:
tmp = difflib.SequenceMatcher(None, word, sugestion).ratio()
if tmp > best_ratio:
best_fix = sugestion
best_ratio = tmp # aumenta o nível para próximas comparações
return best_fix
spell_checker("samsungui")
'samsung'
spell_checker("samjung")
'samsung'
The token generator will transform the messages into BOW.
It musts:
Stemming is faster but lemmatizing is more precise, although it takes more computing time.
From the wikipedia
Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.
http://en.wikipedia.org/wiki/Lemmatisation
The choice between the former or the later really depends on the application. Sometimes just stemming will be enough.
import nltk
from nltk.stem import WordNetLemmatizer
lemma = WordNetLemmatizer()
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))
import re # for regular expressions
def tokenizator(message, size=1, fix=1, lemmatize=1):
message = message.lower() # lowercase
# get tokens, including acentuation, exclude pontuation and numbers
#tokens = re.findall("[-'a-zA-ZÀ-ÖØ-öø-ÿ]+", message)
tokens = nltk.tokenize.word_tokenize(message)
## filter words by size
tokens = [token for token in tokens if len(token) > size]
# spell check and correction only if needed and only if len > 2
if fix:
#tokens = [spell_checker(token) if not d.check(token) else token for token in tokens]
tokens = [spell_checker(token) if (not d.check(token) and len(token) > 2) else token for token in tokens]
# stemming words
if lemmatize:
tokens = [lemma.lemmatize(t) for t in tokens]
# lets keep stop words by now because the documents are already too small
#tokens = [t for t in tokens if t not in stopWords] # remove stopwords
return tokens
Lets now transform the messages from our database in tokens, as this is required to later obtain our bag of words.
tokens = map(lambda x: tokenizator(x), all_msg)
msg_pro = list(tokens)
import pandas as pd
compare = pd.DataFrame({"origin": all_msg, "tokenized": msg_pro})
compare
origin | tokenized | |
---|---|---|
0 | Hi, how are you? | [hi, how, are, you] |
1 | I'm ok!!! | ['m, ok] |
2 | got any sale today? | [got, any, sale, today] |
3 | Where is my iphone?! | [where, is, my, iphone] |
4 | But it's 5 days already! | [but, it, 's, day, already] |
5 | ow it's true | [ow, it, 's, true] |
6 | Where is the iphone 10? | [where, is, the, iphone, 10] |
7 | Huum, what do you have available? | [hum, what, do, you, have, available] |
8 | Is the samsung better than the iphone? | [is, the, samsung, better, than, the, iphone] |
9 | Good morning! | [good, morning] |
10 | Good night! | [good, night] |
11 | Good evening | [good, evening] |
12 | I'd like to make a complaint | ['d, like, to, make, complaint] |
13 | I'd like to make a exchange | ['d, like, to, make, exchange] |
14 | I'd like to make a refund | ['d, like, to, make, refund] |
15 | What's the best mobile? | [what, 's, the, best, mobile] |
16 | my smart phone is broken! | [my, smart, phone, is, broken] |
17 | I can't finish my purchase | [ca, don't, finish, my, purchase] |
18 | the site is not working! | [the, site, is, not, working] |
19 | When it is going to arrive? | [when, it, is, going, to, arrive] |
20 | What's is the delivery time? | [what, 's, is, the, delivery, time] |
21 | this device is really bad | [this, device, is, really, bad] |
22 | I need some help choosing a mobile | [need, some, help, choosing, mobile] |
23 | do you accept credit cards? | [do, you, accept, credit, card] |
24 | the tv arrived already broken | [the, tv, arrived, already, broken] |
25 | the device arrived already broken | [the, device, arrived, already, broken] |
26 | the device doesn't work | [the, device, doe, don't, work] |
The gensin package brings the tools needed to implement a LDA analysis in python
from gensim import corpora, models
dictionary = corpora.Dictionary(msg_pro) # getting a dictionary from our collection
/data/data-erick/anaconda3/lib/python3.6/site-packages/requests/__init__.py:80: RequestsDependencyWarning: urllib3 (1.22) or chardet (2.3.0) doesn't match a supported version! RequestsDependencyWarning)
body = [dictionary.doc2bow(msg) for msg in msg_pro] # term matrix from our collection
We must choose a starting number of topics according to our knowledge about the collection, just as we would do when performing a k-mens clustering, where we have to choose a starting number of clusters. Later we will analyze the proper metrics and decide whether to increase or decrease the number of clusters|topics.
Setting the parameters for the lda training:
Lets also record the time it takes to finish the process.
import time
import random
start = time.time()
random.seed(95276)
model = models.ldamodel.LdaModel(body,num_topics=6, id2word = dictionary, passes=100,
# alpha=.1, eta=5, random_state=95276)
alpha=.05, eta=5, random_state=95276)
# alpha (document/topic relationship) and eta (topic-words relationship)
# could be set to learn from data - "auto" setting
print("\n --- %s seconds ---" % round((time.time() - start),4))
--- 1.892 seconds ---
model.print_topics(num_topics=6, num_words=3)
[(0, '0.032*"the" + 0.030*"is" + 0.018*"iphone"'), (1, '0.016*"today" + 0.016*"got" + 0.016*"sale"'), (2, '0.020*"to" + 0.020*"\'d" + 0.020*"make"'), (3, '0.016*"mobile" + 0.016*"choosing" + 0.016*"some"'), (4, '0.016*"my" + 0.016*"don\'t" + 0.016*"finish"'), (5, '0.020*"good" + 0.020*"you" + 0.018*"do"')]
So we don't have to train the model again - lets save it to disk.
import pickle
# writes to disk
pickle.dump(model, open("lda.model", 'wb'))
pickle.dump(dictionary, open("dictionary.model", 'wb'))
# loads back
model = pickle.load(open("lda.model", 'rb'))
dictionary = pickle.load(open("dictionary.model", 'rb'))
model.print_topics(num_topics=6, num_words=3)
[(0, '0.032*"the" + 0.030*"is" + 0.018*"iphone"'), (1, '0.016*"today" + 0.016*"got" + 0.016*"sale"'), (2, '0.020*"to" + 0.020*"\'d" + 0.020*"make"'), (3, '0.016*"mobile" + 0.016*"choosing" + 0.016*"some"'), (4, '0.016*"my" + 0.016*"don\'t" + 0.016*"finish"'), (5, '0.020*"good" + 0.020*"you" + 0.018*"do"')]
The bigger the distinction between groups, the better. We can improve this distinction by adjusting the model parameters when training it.
import pyLDAvis.gensim
vis = pyLDAvis.gensim.prepare(model, body, dictionary)
pyLDAvis.enable_notebook()
vis
/data/data-erick/anaconda3/lib/python3.6/site-packages/pyLDAvis/_prepare.py:387: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexing See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated topic_term_dists = topic_term_dists.ix[topic_order]
def get_topic(message):
tokens = tokenizator(message)
map = dictionary.doc2bow(tokens)
# model[map] will return all the possible topics and a score that tells
# the probability of the message belonging to that topic.
#
# we will order it to get the topic which has the biggest probability.
# Also, if the probability is lower than 60%, we will chose to say that we just
# don't know what the message is about (category 6)
#guess = sorted(model[map][0], key=lambda y: y[1], reverse=True)[0] # retorna o tópico mais provável e sua pontuação
guess = sorted(model[map], key=lambda y: y[1], reverse=True)[0] # retorna o tópico mais provável e sua pontuação
return 6 if guess[1] < .6 else guess[0]
df = pd.DataFrame({"messages": all_msg})
df["tokens"] = df.messages.apply(lambda x: tokenizator(x))
df["topic"] = df.messages.apply(lambda x: get_topic(x))
df.sort_values(by="topic")
messages | tokens | topic | |
---|---|---|---|
19 | When it is going to arrive? | [when, it, is, going, to, arrive] | 0 |
20 | What's is the delivery time? | [what, 's, is, the, delivery, time] | 0 |
16 | my smart phone is broken! | [my, smart, phone, is, broken] | 0 |
3 | Where is my iphone?! | [where, is, my, iphone] | 0 |
15 | What's the best mobile? | [what, 's, the, best, mobile] | 0 |
21 | this device is really bad | [this, device, is, really, bad] | 0 |
6 | Where is the iphone 10? | [where, is, the, iphone, 10] | 0 |
8 | Is the samsung better than the iphone? | [is, the, samsung, better, than, the, iphone] | 0 |
18 | the site is not working! | [the, site, is, not, working] | 0 |
11 | Good evening | [good, evening] | 1 |
9 | Good morning! | [good, morning] | 1 |
2 | got any sale today? | [got, any, sale, today] | 1 |
1 | I'm ok!!! | ['m, ok] | 1 |
10 | Good night! | [good, night] | 1 |
12 | I'd like to make a complaint | ['d, like, to, make, complaint] | 2 |
14 | I'd like to make a refund | ['d, like, to, make, refund] | 2 |
13 | I'd like to make a exchange | ['d, like, to, make, exchange] | 2 |
22 | I need some help choosing a mobile | [need, some, help, choosing, mobile] | 3 |
5 | ow it's true | [ow, it, 's, true] | 3 |
24 | the tv arrived already broken | [the, tv, arrived, already, broken] | 4 |
26 | the device doesn't work | [the, device, doe, don't, work] | 4 |
25 | the device arrived already broken | [the, device, arrived, already, broken] | 4 |
4 | But it's 5 days already! | [but, it, 's, day, already] | 4 |
17 | I can't finish my purchase | [ca, don't, finish, my, purchase] | 4 |
7 | Huum, what do you have available? | [hum, what, do, you, have, available] | 5 |
23 | do you accept credit cards? | [do, you, accept, credit, card] | 5 |
0 | Hi, how are you? | [hi, how, are, you] | 5 |
After applying the model, we can now look at a small sample of each group so we can add a friendly label do it.
#showing a sample of each group, to aid labeling each one
df.groupby('topic').apply(lambda x: x.sample(frac=.8))
messages | tokens | topic | ||
---|---|---|---|---|
topic | ||||
0 | 6 | Where is the iphone 10? | [where, is, the, iphone, 10] | 0 |
8 | Is the samsung better than the iphone? | [is, the, samsung, better, than, the, iphone] | 0 | |
3 | Where is my iphone?! | [where, is, my, iphone] | 0 | |
16 | my smart phone is broken! | [my, smart, phone, is, broken] | 0 | |
15 | What's the best mobile? | [what, 's, the, best, mobile] | 0 | |
18 | the site is not working! | [the, site, is, not, working] | 0 | |
20 | What's is the delivery time? | [what, 's, is, the, delivery, time] | 0 | |
1 | 2 | got any sale today? | [got, any, sale, today] | 1 |
1 | I'm ok!!! | ['m, ok] | 1 | |
11 | Good evening | [good, evening] | 1 | |
9 | Good morning! | [good, morning] | 1 | |
2 | 14 | I'd like to make a refund | ['d, like, to, make, refund] | 2 |
13 | I'd like to make a exchange | ['d, like, to, make, exchange] | 2 | |
3 | 5 | ow it's true | [ow, it, 's, true] | 3 |
22 | I need some help choosing a mobile | [need, some, help, choosing, mobile] | 3 | |
4 | 26 | the device doesn't work | [the, device, doe, don't, work] | 4 |
17 | I can't finish my purchase | [ca, don't, finish, my, purchase] | 4 | |
24 | the tv arrived already broken | [the, tv, arrived, already, broken] | 4 | |
4 | But it's 5 days already! | [but, it, 's, day, already] | 4 | |
5 | 23 | do you accept credit cards? | [do, you, accept, credit, card] | 5 |
7 | Huum, what do you have available? | [hum, what, do, you, have, available] | 5 |
# um dicionário de tópicos
labels = ["delivery", "salutation", "demand", "comparisson", "problem", "indication", "unknown"]
labels = dict(zip(range(7), labels))
labels
{0: 'delivery', 1: 'salutation', 2: 'demand', 3: 'comparisson', 4: 'problem', 5: 'indication', 6: 'unknown'}
df["label"] = df.topic.replace(labels)
df
messages | tokens | topic | label | |
---|---|---|---|---|
0 | Hi, how are you? | [hi, how, are, you] | 5 | indication |
1 | I'm ok!!! | ['m, ok] | 1 | salutation |
2 | got any sale today? | [got, any, sale, today] | 1 | salutation |
3 | Where is my iphone?! | [where, is, my, iphone] | 0 | delivery |
4 | But it's 5 days already! | [but, it, 's, day, already] | 4 | problem |
5 | ow it's true | [ow, it, 's, true] | 3 | comparisson |
6 | Where is the iphone 10? | [where, is, the, iphone, 10] | 0 | delivery |
7 | Huum, what do you have available? | [hum, what, do, you, have, available] | 5 | indication |
8 | Is the samsung better than the iphone? | [is, the, samsung, better, than, the, iphone] | 0 | delivery |
9 | Good morning! | [good, morning] | 1 | salutation |
10 | Good night! | [good, night] | 1 | salutation |
11 | Good evening | [good, evening] | 1 | salutation |
12 | I'd like to make a complaint | ['d, like, to, make, complaint] | 2 | demand |
13 | I'd like to make a exchange | ['d, like, to, make, exchange] | 2 | demand |
14 | I'd like to make a refund | ['d, like, to, make, refund] | 2 | demand |
15 | What's the best mobile? | [what, 's, the, best, mobile] | 0 | delivery |
16 | my smart phone is broken! | [my, smart, phone, is, broken] | 0 | delivery |
17 | I can't finish my purchase | [ca, don't, finish, my, purchase] | 4 | problem |
18 | the site is not working! | [the, site, is, not, working] | 0 | delivery |
19 | When it is going to arrive? | [when, it, is, going, to, arrive] | 0 | delivery |
20 | What's is the delivery time? | [what, 's, is, the, delivery, time] | 0 | delivery |
21 | this device is really bad | [this, device, is, really, bad] | 0 | delivery |
22 | I need some help choosing a mobile | [need, some, help, choosing, mobile] | 3 | comparisson |
23 | do you accept credit cards? | [do, you, accept, credit, card] | 5 | indication |
24 | the tv arrived already broken | [the, tv, arrived, already, broken] | 4 | problem |
25 | the device arrived already broken | [the, device, arrived, already, broken] | 4 | problem |
26 | the device doesn't work | [the, device, doe, don't, work] | 4 | problem |
message = "Hello! Good evening"
labels[get_topic(message)]
'salutation'
message = "ah!"
labels[get_topic(message)]
'unknown'
def get_order(msg):
tokens = re.findall("[0-9]+", msg)
order = [token for token in tokens if len(token)==10] # order number has size 10
return order
get_order("just bought a iphone, the order is 1234567890")
['1234567890']
def get_device(msg):
tokens = tokenizator(msg, size=1, lemmatize=0, fix=0)
# uma lista de objetos de interesse do cliente - alimentar com os produtos vendidos
# podemos utilizar NLP POS - parts of speech para reconhecer o objeto, mas aqui acabou o tempo =D
possibilities = ["tv", "mobile", "television", "microwave", "site", "freezer", "tire", "pants"]
device = [token for token in tokens if token in possibilities]
return device
get_device("I need info on a new mobile and freezer")
#get_aparelho("preciso de indicação para tv")
['mobile', 'freezer']
The final answer to the user message will be composed according to:
We will make use of the following auxiliary functions:
import datetime
def salute_back(tipo="salute"):
agora = datetime.datetime.now()
hora = agora.hour
if tipo == "salute":
saudacao = "Good night"
if hora in range(12,19):
saudacao = "Good afternoon"
if hora in range(6,13):
saudacao = "Good morning"
mensagem = saudacao + ", dear costumer!"
else:
mensagem = "It's {} hours".format(hora)
return mensagem
salute_back()
'Good night, dear costumer!'
salute_back("hora")
"It's 20 hours"
def order_status(number):
return(" Status for order {}, according to db: status".format(number))
order_status(128312983)
' Status for order 128312983, according to db: status'
unknown_counter = 0
def unknown_msg():
global unknown_counter
messages = ["It's all ok here on Earth. How can I help you?",
"Please be more specific...",
"I'm just a tired robot. Please explain, slowly...",
"is this a new kind of joke?"
]
if not unknown_counter:
message = "{} {} and {}".format(salute_back(), salute_back("hora"), messages[0])
else:
message = messages[unknown_counter]
unknown_counter += 1
if unknown_counter == 4:
unknown_counter = 1
return message
unknown_msg()
"Good night, dear costumer! It's 20 hours and It's all ok here on Earth. How can I help you?"
Lets now create a dictionary containing answers to each topic. The first answer on each class assumes that there are no keywords on the message. The second message on each class assumes that keywords are present.
m_delivery = ["Our delivery time is 5 working days. Please inform the order number so I can fetch more information on it",
"Our delivery time is 5 working days. Lets check the order status {}."]
m_request = ["Sure thing! What's the order number?",
"Ok. Lets check the order status {}."]
m_indication = ["You want indications for what type of device?",
"Here are the best deals for {}."]
m_comparisson = ["We can help you to choose the best options. What kind of device are you looking for?",
"These are the best options for {}."]
m_problem= ["Easy! Everything can be fixed. What's going on?",
"Easy! Lets fix this issue with {} the best we can."]
all_answers = {
0: m_delivery,
1: salute_back(),
2: m_request,
3: m_comparisson,
4: m_problem,
5: m_indication,
6: unknown_msg()
}
all_answers[1]
'Good night, dear costumer!'
Now that all the auxiliary functions are ready, we can compose our answer:
labels
{0: 'delivery', 1: 'salutation', 2: 'demand', 3: 'comparisson', 4: 'problem', 5: 'indication', 6: 'unknown'}
def sub_resp(var, topic): # ok, just one more auxiliary function
if var:
if var[0].isdigit():
answer = all_answers[topic][1].format(var[0]) + order_status(var[0])
else:
answer = all_answers[topic][1].format(var[0])
else:
answer = all_answers[topic][0]
return answer
def answer_costumer(message): # agora sim
topic = get_topic(message)
n_order = get_order(message)
device = get_device(message)
if topic in [0,2]:
return sub_resp(n_order, topic)
if topic in [3,4,5]:
return sub_resp(device, topic)
if topic == 1:
return all_answers[topic]
else:
return all_answers[topic]
unknown_counter = 0
message = "I want a new mobile"
answer_costumer(message)
'These are the best options for mobile.'
answer_costumer("good morning!")
'Good night, dear costumer!'
answer_costumer("whacka whacka whacka!")
'Please be more specific...'
answer_costumer("whats the eta for order 1234567890?")
'Our delivery time is 5 working days. Lets check the order status 1234567890. Status for order 1234567890, according to db: status'
In order to optimize the processing time, the gensim package offers alternate ways to train a LDA model.
Among them, are:
Next, we will compare both modes in terms of computation time.
To do so, lets simulate a collection 1000 times larger than ours.m o modo batch (10 passadas). Para isso, vamos simular um "corpo" com tamanho 1000 vezes maior que o nosso "corpo de mensagens"
import time
body2 = body*1000
start = time.time()
model_bath = models.ldamodel.LdaModel(body2, num_topics=6, id2word = dictionary, passes=10)
print("\n --- %s seconds ---" % round((time.time() - start),4))
--- 64.2224 segundos ---
It was necessary 65 seconds to process the model in batch mode on this machine (Core i5, 4Gb of RAM)
start = time.time()
model_online = models.ldamodel.LdaModel(body2, num_topics=6, id2word = dictionary, update_every=1, chunksize=100, passes=1)
print("\n --- %s seconds ---" % round((time.time() - start),4))
--- 9.1031 seconds ---
It was necessary 9 seconds to process the model in on line mode on this machine (Core i5, 4Gb of RAM)
model_bath.print_topics(num_words=3)
[(0, '0.114*"is" + 0.086*"the" + 0.086*"my"'), (1, '0.136*"to" + 0.136*"make" + 0.136*"\'d"'), (2, '0.111*"good" + 0.111*"is" + 0.111*"device"'), (3, '0.166*"mobile" + 0.083*"good" + 0.083*"the"'), (4, '0.120*"\'s" + 0.120*"it" + 0.080*"is"'), (5, '0.095*"broken" + 0.095*"the" + 0.095*"already"')]
model_online.print_topics(num_words=3)
[(0, '0.182*"\'s" + 0.182*"it" + 0.091*"night"'), (1, '0.143*"is" + 0.048*"mobile" + 0.048*"help"'), (2, '0.143*"the" + 0.095*"device" + 0.095*"broken"'), (3, '0.154*"good" + 0.154*"is" + 0.154*"iphone"'), (4, '0.115*"you" + 0.077*"do" + 0.077*"the"'), (5, '0.094*"the" + 0.094*"to" + 0.094*"like"')]
The model could benefit from more fine tunning. Also, feeding it with a proper collection of documents would help to improve the results.
I've used such a small database because originally this was an actual challenge proposed on a job interview. The original notebook was in portuguese, so this is a translation from it.
The same task could be done with other methods, such as tf-idf, PCA, LSA, etc... but lda should provide better accuracy, as I mentioned before.
My skills:
- R & Python (Pandas, SciPy, scikit-learn, dplyr, caret)
- BI Tableau, Google Vis, ggPlot, Shiny Dashboards
- MySQL / Teradata, NoSQL, MongoDB, Apache Cassandra
- Hadoop, Amazon Web Services – AWS EC2
- Machine Learning & Regression models, Decision Trees, etc
- Natural Language Processing – NLP
- Clustering / Association rules
- Inferential statistics, A/B testing
- Exploratory data analysis
- Git / Github
- Rmarkdown Reproducible Research / Jupyter Notebooks
- Physicist
Erick Gomes Anastácio