#!/usr/bin/env python
# coding: utf-8
# # Practical Topic Finding for Short-Sentence Texts
#
# ## 1. Introduction
#
# Many analysis applications involve finding topics in corpora of short sentences, such as tweets, short messages, logs and comments. On one hand, the direct insights provided in these topics will serve as the basis of many further analysis, e.g., sentiment scoring or document classificaition. On the other hand, those short texts have some unique characterics that deserve more attentions when applying the traditional topic-finding algorithms to them. Some challenges are
# - Short texts usually have higher language variability, where the same meaning can be phrased in various ways [[1]](http://www.aclweb.org/anthology/S12-1005). For example, "dollars", "\$", "$$", "fee", "charges" may all have similar meanings; but this is harder to capture in shorter sentences due to the limited information of surrounding contexts of each word.
# - Unlike longer articles such as wiki pages, short comments or tweets may each have a single topic. At first glance it may look like a simplification. But practically it poses challenges when the algorithms to be applied assume a mixture of topics in each document. Forcing a sparse representation usually comes with a price, either computational or performance-related.
#
# To make it even more complicated, topic-finding is a multiple-step process, involving preprocessing of texts, vectorization, topic-mining and finally topic representations in keywords. Each step has multiple choices in practice and the different combinations may generate very different results.
#
# This article explores the pros and cons of differnet algorithmic decisions in topic-finding, by considering the natures of short texts discussed above. Instead of providing a bird's-eye view of theoritical comparisons, I want to highlight how a practical decision should be made based on the structure of your data and the structure of your topics. A good theoritical review can be found in [[2]](http://anthology.aclweb.org/D/D12/D12-1087.pdf).
#
# In the following I use "toy-like" artificial data to make those "blackbox" models transparent. All the models compared below are implemented in Python [scikit-learn](http://scikit-learn.org/stable/) package.
# ## 2. Topic Finding Models
#
# I look into three topic-finding models, namely Latent Dirichlet Allocation ([LDA](https://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf)), Non-Negative Matrix Factorization ([NMF](http://epubs.siam.org/doi/pdf/10.1137/1.9781611972740.45)), and Truncated Singular Value Decomposition ([SVD](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html)). There are many extensions of these traditional models and different implementations. I pick those from [scikit-learn](http://scikit-learn.org/stable/) package.
#
# Besides the three traditional topic models, another related approach in finding text structures is document clustering. One of its implementation [KMeans Clustering](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) has also been included in this comparison although it is usually not considered as a topic model approach. You can find interesting discussions on the differences of the models [online](https://www.quora.com/search?q=nmf+vs+lda). The code comments below also provide addtional information on why an implementation decision is made in that way.
# In[1]:
get_ipython().run_line_magic('matplotlib', 'inline')
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD, NMF, LatentDirichletAllocation
from sklearn.cluster import KMeans
# Before diving into models, let's prepare some sample texts below. Here I generated four artificial corpora for topic-finding.
# - `clearcut topics`: texts clearly with 2 topics - "berger-lovers" and "sandwich-haters". It shouldn't be a problem for most methods.
# - `unbalanced topics`: it has the same 2 topics as above, but the topic distributions are skewed. A real scenario would be finding *outlier* messages or comments from a haystack of normal ones.
# - `semantic topics`: the corpus has four topics, each for both berger/sandwich lovers and haters. However, in addition to structuring the texts this way, there is another potential dimension that can group "berger" vs "sandwich" as "foods topic" and "hate" v.s. "love" as "feelings topic". Is there any setup that can find topics in this new perspective?
# - `noisy topics`: as discussed above, short texts may have language variability due to different terms for the same meaning, or even typos. This corpus simulates texts with different typos for two topics. The number of texts in this corpus is smaller than others, so that we can test how the models deal with these ambigurities.
# In[2]:
def generate_clearcut_topics():
## for demostration purpose, don't take it personally : )
return np.repeat(["we love bergers", "we hate sandwiches"], [1000, 1000])
def generate_unbalanced_topics():
return np.repeat(["we love bergers", "we hate sandwiches"], [10, 1000])
def generate_semantic_context_topics():
return np.repeat(["we love bergers"
, "we hate bergers"
, "we love sandwiches"
, "we hate sandwiches"], 1000)
def generate_noisy_topics():
def _random_typos(word, n):
typo_index = np.random.randint(0, len(word), n)
return [word[:i]+"X"+word[i+1:] for i in typo_index]
t1 = ["we love %s" % w for w in _random_typos("bergers", 15)]
t2 = ["we hate %s" % w for w in _random_typos("sandwiches", 15)]
return np.r_[t1, t2]
sample_texts = {
"clearcut topics": generate_clearcut_topics()
, "unbalanced topics": generate_unbalanced_topics()
, "semantic topics": generate_semantic_context_topics()
, "noisy topics": generate_noisy_topics()
}
# In[3]:
from collections import Counter
for desc, texts in sample_texts.items():
print desc
print Counter(texts).most_common()
print ""
# Let's first take a step back and consider what makes a "good" topic modelling. Although the standard should depend on the nature of the analysis, there are usually some common understanding. For many cases, the keywords in each topic should be
# - frequent enough to appear in more than a few documents/sentences. Topics covering only a few examples may not be interesting, unless outlier detection is the purpose.
# - representative enough to distinguish one topic from another. This sometimes implies *orthogonality* or *independence* among the topics.
#
# Some research [[2]](http://anthology.aclweb.org/D/D12/D12-1087.pdf) also propose other criteria such as
# - co-occurrence of keywords in the same topic should be high, which implies they are coming from the same contexts.
# - semantic meanings of keywords in a topic should be close, e.g., "apples" and "oranges" in "fruits topic", "love" and "hate" in "emotions topic".
#
# Now let's take a look at the implementation of the models to be compared. The four models, namely `NMF`, `SVD`, `LDA` and `KMEANS` are implemented with a single interface `find_topic` below. Each topic model can be combined with two `vectorization` methods, i.e., term-frequence (TF) and term-frequence-inverse-document-frequence ([TFIDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)). In general, you should choose TFIDF over TF if you have a lot of common words shared by many texts. Those common words are considered as "noise" (or stop-words) that may impair the expressness of real important words in topics. However, this difference between TF and TFIDF is not significant for applications on short sentences, because there is less chance for a word to become "dominant" out of many short sentences. Finding other possible vector representations of documents is an active research area. For example, vectorizations based on [word embedding](http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/) models, e.g. [word2vec](https://en.wikipedia.org/wiki/Word2vec) and [doc2vec](http://eng.kifi.com/from-word2vec-to-doc2vec-an-approach-driven-by-chinese-restaurant-process/) have become popular.
#
# The following implementation chooses the keywords of topics as the most frequent words in a ***topic-word distribution***, which is usually generated by the topic models or clustering algorithms. However for some models such as SVD or KMEANS clustering, the topic-word matrix could have both positive and negative values, which makes it difficult to be explained as a "distribution" and thus choosing the keywords for topics is more ambiguous. For demostration, I choose to pick the keywords as those with significant absolute values and keep the signs with these keywords - the negative words would be prefixed with a ***"^"***, such as ***"^bergers"***.
# In[23]:
def find_topic(texts, topic_model, n_topics, vec_model="tf", thr=1e-2, **kwargs):
"""Return a list of topics from texts by topic models - for demostration of simple data
texts: array-like strings
topic_model: {"nmf", "svd", "lda", "kmeans"} for LSA_NMF, LSA_SVD, LDA, KMEANS (not actually a topic model)
n_topics: # of topics in texts
vec_model: {"tf", "tfidf"} for term_freq, term_freq_inverse_doc_freq
thr: threshold for finding keywords in a topic model
"""
## 1. vectorization
vectorizer = CountVectorizer() if vec_model == "tf" else TfidfVectorizer()
text_vec = vectorizer.fit_transform(texts)
words = np.array(vectorizer.get_feature_names())
## 2. topic finding
topic_models = {"nmf": NMF, "svd": TruncatedSVD, "lda": LatentDirichletAllocation, "kmeans": KMeans}
topicfinder = topic_models[topic_model](n_topics, **kwargs).fit(text_vec)
topic_dists = topicfinder.components_ if topic_model is not "kmeans" else topicfinder.cluster_centers_
topic_dists /= topic_dists.max(axis = 1).reshape((-1, 1))
## 3. keywords for topics
## Unlike other models, LSA_SVD will generate both positive and negative values in topic_word distribution,
## which makes it more ambiguous to choose keywords for topics. The sign of the weights are kept with the
## words for a demostration here
def _topic_keywords(topic_dist):
keywords_index = np.abs(topic_dist) >= thr
keywords_prefix = np.where(np.sign(topic_dist) > 0, "", "^")[keywords_index]
keywords = " | ".join(map(lambda x: "".join(x), zip(keywords_prefix, words[keywords_index])))
return keywords
topic_keywords = map(_topic_keywords, topic_dists)
return "\n".join("Topic %i: %s" % (i, t) for i, t in enumerate(topic_keywords))
# ### 2.1 SVD: orthogonal decomposition of text variances
# The [truncated SVD implementation in sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) is intuitively similiar to PCA algorithm, which tries to find ***orthogonal*** directions that explains the largest ***variances*** in the texts.
#
# When applying SVD with TF and TFIDF on the clearcut-topic texts, we got the results below. As discussed, one unique signature of SVD's results is that the words in topics can be both positive and negative. For simple cases, they can be understood as **including** and **excluding** the corresponding word in the topic.
#
# For example `"Topic 1: bergers | ^hate | love | ^sandwiches"` can be "intuitively" explained as the texts that include "love bergers" and exclude "hate sandwiches".
#
# Depending on the random state, your topic results may be different. In the results below, we don't see clear indications of the two topics "love bergers" and "hate sandwiches". However, it does have topics such as `Topic 3: ^bergers | love`, which means "love" but NOT "bergers".
#
# Interestingly, we may also generate topics such as `Topic 3: bergers | ^hate | ^love | sandwiches`, which captures "bergers" and "sandwiches" as a "food" topic.
# In[34]:
print(find_topic(sample_texts["clearcut topics"], "svd", 4, vec_model="tf"))
# In[36]:
print(find_topic(sample_texts["clearcut topics"], "svd", 4, vec_model="tfidf"))
# In the above examples, we set a larger number of topics than expected on purpose, because most of time you don't have the prior knowledge of how many topics there are in your texts. If we explicitly set the topic number = 2. We get the following result.
# In[46]:
print(find_topic(sample_texts["clearcut topics"], "svd", 2, vec_model="tfidf"))
# When explaining the result of SVD, it's important to contrast each topic with previous ones, instead of looking at them separately. So the result above can be explained as *that the major difference in the texts are (1) including "love bergers" and (2) excluding "hate sandwiches".*
# Let's try SVD on the unbalanced topic texts, to see how it performs on detecting minor groups - it performs quite well on this.
# In[47]:
print(find_topic(sample_texts["unbalanced topics"], "svd", 3, vec_model="tf"))
# However, it did poorly on texts with noises - SVD treats each form of the same meaning differently and fails to capture any semantic connections.
# In[6]:
print(find_topic(sample_texts["noisy topics"], "svd", 2, vec_model="tf"))
# In summary,
#
# - SVD finds "topics" using the words that explain most of the text variances.
# - The topics may not be the most explainable based on human standards, because it doesn't enforce a probability. And the keywords within the same topics don't necessarily have high co-occurrences.
# - However, the topics are "complementary" to each other as there is no duplicate and they capture most information in texts when put together.
# - This usually makes its result very useful as a representation of documents for other analysis purposes, e.g., document classification.
# - SVD is also capable of finding topics in unbalanced distributions.
# - There is an upper limit of the # of topics generated by SVD due to its computation algorithm - using other vectorization method, such as tf/idf for n-grams or word embeddings may help.
# - ***SVD may have problems if you have texts that are mostly similiar to each other but their slight differences actually determine their topics. ***
# - This can be observed in SVD's result on "noisy topics" texts above, where the difference between "love bergers" and "hate sandwiches" are blurred by the more dominant variaty of different spellings.
# ### 2.2 LDA: gluing similar words based on their co-occurrences
#
# [LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is one of the most mentioned topic-finding models, due to its good performances on many different types of texts, and its intuitive interpretation as a "generative" process.
#
# Intuitively, LDA finds topics as a group of words that have high co-occurrences among different documents. On the other side, documents from the similiar mixture of topics should also be similiar, such that they can be described by these topics in a "compact" way. So ideally the similiarity in the latent ***topic space*** would imply the the similiarity in both the observed ***word space*** as well as the ***document space*** - this is where the word "latent" in the name come from.
#
# The LDA algorithms has two main parameters controlling
# 1. how sparse the topics are in terms of the distribution of keywords in each topic
# 2. how sparse the documents are in terms of the distribution of topics in each document
#
# Later we will see how these parameters help with fining minor topics in a skewed distribution. Finding the right parameter values is mostly based on experimental experiences.
#
# (***TODO: make it clearer on this part***)
# Compared to SVD, the topics found by LDA is much more human understandable. This is shown as the results on the clearcut-topic texts below. Within each topic, there is a clear indication of the keywords' connections based on their co-occurrence. This is different from what we observed in the results of SVD, specially,
# - the topics from LDA can be duplicated.
# - different topics can share keywords if they co-appear frequently enough with other keywords in different topics. e.g., the word "we" have been repeated in all the topics below simply because it co-occurs with all the other words in each sentence.
#
# There is also a difference in combining it with different vectorizations - `tfidf` if you don't want to see too many common words in the topics.
# In[16]:
print(find_topic(sample_texts["clearcut topics"], "lda", 4, vec_model="tf"))
# In[20]:
print(find_topic(sample_texts["clearcut topics"], "lda", 4, vec_model="tfidf"))
# I have introduced how to tune the topic-skewness parameter to deal with unbalanced topic modelling. In the sklearn implementation, this parameter is `topic_word_prior`. (and the other one is `doc_topic_prior` that controls the sparseness of topics in each doc).
#
# The default value of `topic_word_prior` is $\frac{1}{n\_topics} $. which assumes an even distribution of topics. A smaller value will make it more "uneven". This is illustrated in the results below.
# ***The minor topic `we love bergers` have been "glued" to a bigger one if the topic distribution is assumed to be symmetric.***
# In[53]:
print(find_topic(sample_texts["unbalanced topics"], "lda", 4, vec_model="tf"))
# ***Using a smaller `topic_word_prior` value will help capture the minor topics, because now the topics are forced to be more sparse in choosing keywords.***
# In[50]:
print(find_topic(sample_texts["unbalanced topics"], "lda", 4, vec_model="tf", topic_word_prior=1e-5))
# Noisy texts is also a challenge for LDA. From below we can see LDA's result on noisy-topics is not clear because there is no clear connections between the different typos of the same words.
# In[24]:
print find_topic(sample_texts["noisy topics"],"lda",3, vec_model = "tfidf")
# In summary,
#
# - Topics generated by LDA is close to human understanding, in terms of grouping co-occuring words together.
# - However, these topics may not necessarily be the ones that distinguish different groups of documents - sometimes enforcing the documents to be sparse and specific in topics may help.
# - For example, in a suboptimal parameter setting, the minor topics could be "obsorbed" into a major one, if they happen to have shared keywords.
# - This is different from SVD's results, where the topics are generated to be far away (orthogonal) from each other as much as possible.
# - As a result, topics generated by LDA may not necessarily optimal for representing the documents for other purposes, such as document classifications.
# - A good understanding of your text data is the key to good performance of LDA. But usually you don't have this knowledge at the beginning. LDA is usually expensive to run. There are other ways to help understand the structure of data before using LDA.
# ### 2.3 NMF: an LDA-like decomposition
#
# NMF has been discussed as a [special case of LDA](https://www.quora.com/What-are-the-pros-and-cons-of-LDA-and-NMF-in-topic-modeling#). The theory behind their link might be complicated to understand. But in practice, NMF can be mostly seen as a LDA of which the parameters have been fixed to enforce a sparse solution. So it may not be as flexible as LDA if you want to find multiple topics in single documents, e.g., from long articles. But it could work very well out of box for corpora of short texts. This makes NMF attractive for short text analysis because its computation is usually much cheaper than LDA.
#
# On the other hand, the most discussed weakness of NMF is the inconsistency of its results - when you set the number of topics to be too high than the reality in texts, NMF might generate some rubbish out of nowhere. LDA is more robust to a big variety of different topic numbers.
# Let's first see an example of NMF being inconsistent. For clearcut topics texts, when we set topic number = 5, which is close to reality (= 2), the generated topics are of good quality.
# In[35]:
print(find_topic(sample_texts["clearcut topics"], "nmf", 5, vec_model="tf"))
# However, when we increase the number of topics to 25 (much larger than 2), some werid topics have started to jump out
# In[32]:
print(find_topic(sample_texts["clearcut topics"], "nmf", 25, vec_model="tf"))
# Running the same experiment on LDA, the results is much more consistent.
# In[36]:
print(find_topic(sample_texts["clearcut topics"], "lda", 25, vec_model="tf"))
# Set with an appropriate # of topics, NMF is also good at finding unbalanced topic distributions.
# In[39]:
print(find_topic(sample_texts["unbalanced topics"], "nmf", 5, vec_model="tfidf"))
# Impressively, NMF seems to be the only topic-finding model that can deal with "noisy texts" without a lot of fine-tuning. This is very useful for the first round of exploration of your data.
# In[45]:
print find_topic(sample_texts["noisy topics"],"nmf",5, vec_model = "tfidf",)
# In summary,
# - NMF seems to work very well with short texts out-of-box.
# - One possible explaination is that its assumption has a good match with the nature of short texts that we discussed above.
# - NMF is usually cheaper in computation compared to LDA.
# - The main cons of NMF is its gradual inconsistency of results when keep increasing number of topics.
# ### 2.4 KMeans: cheap and powerful
#
# Clustering method such as KMeans can group documents based on their vector representations (or even directly based on their distance matrices). However it is not usually seen as a topic-finding method because it is hard to explain its results as groups of keywords.
#
# However, when used together with tf/tfidf, the centers of the clusters can be interpreted as a probability over words in the same way as in LDA and NMF.
# In[47]:
print find_topic(sample_texts["clearcut topics"],"kmeans",10, vec_model = "tf",)
# In[43]:
print find_topic(sample_texts["unbalanced topics"],"kmeans",10, vec_model = "tf",)
# In[42]:
print find_topic(sample_texts["noisy topics"],"kmeans",10, vec_model = "tf",)
# In summary,
#
# Just like NMF, KMeans also performs well on different types of short texts, including finding unbalanced topic distributions and dealing with noisy data. Even better, its results seem more consistent than NMF's to the setting of number of topics.
#
# Furthermore, its computation is usually cheap and there are [implementations that can scale up to very large datasets](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html). Unlike LDA, the integration of clustering with other document-vectoization methods is much easier to implement. For example, if external large corpus is available to train a word-embedding model, topic-finding via clustering can be easily extended by using the word vectors that convey more semantic meanings.
# ### 2.5 Finding topics with high semantic coherence
#
# Lastly, I want to briefly discuss another perspective of topic finding. In most cases we are interested in grouping documents according to their topic distributions and finding describing keywords for each topic. Another way to look at the topics is to see whether they can group "semantically connected" words into the same groups.
#
# Most researchers agree that the "semantic" of a word is defined by its contexts, i.e., other words surrounding it. For example, "love" and "hate" can be seen as semantically connected because both words can be used in the same context "I _ apples." In fact, one of the most important focus of word embedding research is to find vector representations of the words, phrases, or even documents such that their semantic closeness is retained in the vector space.
#
# Finding topics grouping "semantically similiar" words may not necessarily be the same as grouping words that co-occur frequently. From the results below, we can see that most methods discussed are generating co-occurance oriented topics instead of semantic ones. Only SVD shed some lights on it - both "bergers vs sandwiches" and "love vs hate" groups are generated.
#
# But please bear in mind that these results are only based on very simple toy-like texts. I include them here only to further highlight the differences of models from another perspective.
# In[48]:
print(find_topic(sample_texts["semantic topics"], "svd", 4, vec_model="tfidf"))
# In[49]:
print(find_topic(sample_texts["semantic topics"], "nmf", 5, vec_model="tfidf"))
# In[50]:
print(find_topic(sample_texts["semantic topics"], "lda", 5, vec_model="tfidf"))
# In[51]:
print(find_topic(sample_texts["semantic topics"], "kmeans", 5, vec_model="tfidf"))
# ## 3. Summary
#
# - Corpora of short texts have some unique characteristics that need to be considered when doing topic finding.
# - Choice of a method depends on the definition of "topics" (high co-occurance, semantic-similarity) and the purpose of topic finding (representation of docs, summarization, outlier detection and etc.)
# - It's usually a good idea to start with ***KMeans*** or ***NMF***, and to quickly get a better understaning of the structures of texts, including but not limited to,
# - sparseness of words in topics
# - sparseness of topics in documents
# - number of topics
# - number of words in each topic
# - what does co-occurance imply in your data
# - ***LDA*** is flexible for different types of tasks. But its parameter tunning should be based on a good understanding of the data. So if you want to try LDA, keep at least another model such as KMeans or NMF as a baseline.
# - ***SVD*** is mostly useful to capture the variances in the texts. For example, if your data is semi-structured, e.g., forms of a template, screenshots, html tables, SVD might be useful in analyzing them when used together with regular expressions.
#
# ***Thank you for reading. I am sure there will be mistakes, inaccuracies. Please feel free to PR at my [github](https://github.com/dolaameng/tutorials.git).***
#
# In[ ]: