Many analysis applications involve finding topics in corpora of short sentences, such as tweets, short messages, logs and comments. On one hand, the direct insights provided in these topics will serve as the basis of many further analysis, e.g., sentiment scoring or document classificaition. On the other hand, those short texts have some unique characterics that deserve more attentions when applying the traditional topic-finding algorithms to them. Some challenges are
To make it even more complicated, topic-finding is a multiple-step process, involving preprocessing of texts, vectorization, topic-mining and finally topic representations in keywords. Each step has multiple choices in practice and the different combinations may generate very different results.
This article explores the pros and cons of differnet algorithmic decisions in topic-finding, by considering the natures of short texts discussed above. Instead of providing a bird's-eye view of theoritical comparisons, I want to highlight how a practical decision should be made based on the structure of your data and the structure of your topics. A good theoritical review can be found in [2].
In the following I use "toy-like" artificial data to make those "blackbox" models transparent. All the models compared below are implemented in Python scikit-learn package.
I look into three topic-finding models, namely Latent Dirichlet Allocation (LDA), Non-Negative Matrix Factorization (NMF), and Truncated Singular Value Decomposition (SVD). There are many extensions of these traditional models and different implementations. I pick those from scikit-learn package.
Besides the three traditional topic models, another related approach in finding text structures is document clustering. One of its implementation KMeans Clustering has also been included in this comparison although it is usually not considered as a topic model approach. You can find interesting discussions on the differences of the models online. The code comments below also provide addtional information on why an implementation decision is made in that way.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD, NMF, LatentDirichletAllocation
from sklearn.cluster import KMeans
Before diving into models, let's prepare some sample texts below. Here I generated four artificial corpora for topic-finding.
clearcut topics
: texts clearly with 2 topics - "berger-lovers" and "sandwich-haters". It shouldn't be a problem for most methods.unbalanced topics
: it has the same 2 topics as above, but the topic distributions are skewed. A real scenario would be finding outlier messages or comments from a haystack of normal ones.semantic topics
: the corpus has four topics, each for both berger/sandwich lovers and haters. However, in addition to structuring the texts this way, there is another potential dimension that can group "berger" vs "sandwich" as "foods topic" and "hate" v.s. "love" as "feelings topic". Is there any setup that can find topics in this new perspective?noisy topics
: as discussed above, short texts may have language variability due to different terms for the same meaning, or even typos. This corpus simulates texts with different typos for two topics. The number of texts in this corpus is smaller than others, so that we can test how the models deal with these ambigurities.def generate_clearcut_topics():
## for demostration purpose, don't take it personally : )
return np.repeat(["we love bergers", "we hate sandwiches"], [1000, 1000])
def generate_unbalanced_topics():
return np.repeat(["we love bergers", "we hate sandwiches"], [10, 1000])
def generate_semantic_context_topics():
return np.repeat(["we love bergers"
, "we hate bergers"
, "we love sandwiches"
, "we hate sandwiches"], 1000)
def generate_noisy_topics():
def _random_typos(word, n):
typo_index = np.random.randint(0, len(word), n)
return [word[:i]+"X"+word[i+1:] for i in typo_index]
t1 = ["we love %s" % w for w in _random_typos("bergers", 15)]
t2 = ["we hate %s" % w for w in _random_typos("sandwiches", 15)]
return np.r_[t1, t2]
sample_texts = {
"clearcut topics": generate_clearcut_topics()
, "unbalanced topics": generate_unbalanced_topics()
, "semantic topics": generate_semantic_context_topics()
, "noisy topics": generate_noisy_topics()
}
from collections import Counter
for desc, texts in sample_texts.items():
print desc
print Counter(texts).most_common()
print ""
noisy topics [('we love bergXrs', 5), ('we love bergerX', 3), ('we hate sandXiches', 3), ('we love Xergers', 2), ('we hate sanXwiches', 2), ('we hate sandwiXhes', 2), ('we hate sandwicXes', 2), ('we hate Xandwiches', 2), ('we love bergeXs', 2), ('we love berXers', 1), ('we hate sandwicheX', 1), ('we hate sandwXches', 1), ('we love bXrgers', 1), ('we hate saXdwiches', 1), ('we love beXgers', 1), ('we hate sXndwiches', 1)] clearcut topics [('we love bergers', 1000), ('we hate sandwiches', 1000)] unbalanced topics [('we hate sandwiches', 1000), ('we love bergers', 10)] semantic topics [('we love bergers', 1000), ('we love sandwiches', 1000), ('we hate sandwiches', 1000), ('we hate bergers', 1000)]
Let's first take a step back and consider what makes a "good" topic modelling. Although the standard should depend on the nature of the analysis, there are usually some common understanding. For many cases, the keywords in each topic should be
Some research [2] also propose other criteria such as
Now let's take a look at the implementation of the models to be compared. The four models, namely NMF
, SVD
, LDA
and KMEANS
are implemented with a single interface find_topic
below. Each topic model can be combined with two vectorization
methods, i.e., term-frequence (TF) and term-frequence-inverse-document-frequence (TFIDF). In general, you should choose TFIDF over TF if you have a lot of common words shared by many texts. Those common words are considered as "noise" (or stop-words) that may impair the expressness of real important words in topics. However, this difference between TF and TFIDF is not significant for applications on short sentences, because there is less chance for a word to become "dominant" out of many short sentences. Finding other possible vector representations of documents is an active research area. For example, vectorizations based on word embedding models, e.g. word2vec and doc2vec have become popular.
The following implementation chooses the keywords of topics as the most frequent words in a *topic-word distribution, which is usually generated by the topic models or clustering algorithms. However for some models such as SVD or KMEANS clustering, the topic-word matrix could have both positive and negative values, which makes it difficult to be explained as a "distribution" and thus choosing the keywords for topics is more ambiguous. For demostration, I choose to pick the keywords as those with significant absolute values and keep the signs with these keywords - the negative words would be prefixed with a "^", such as "^bergers"*.
def find_topic(texts, topic_model, n_topics, vec_model="tf", thr=1e-2, **kwargs):
"""Return a list of topics from texts by topic models - for demostration of simple data
texts: array-like strings
topic_model: {"nmf", "svd", "lda", "kmeans"} for LSA_NMF, LSA_SVD, LDA, KMEANS (not actually a topic model)
n_topics: # of topics in texts
vec_model: {"tf", "tfidf"} for term_freq, term_freq_inverse_doc_freq
thr: threshold for finding keywords in a topic model
"""
## 1. vectorization
vectorizer = CountVectorizer() if vec_model == "tf" else TfidfVectorizer()
text_vec = vectorizer.fit_transform(texts)
words = np.array(vectorizer.get_feature_names())
## 2. topic finding
topic_models = {"nmf": NMF, "svd": TruncatedSVD, "lda": LatentDirichletAllocation, "kmeans": KMeans}
topicfinder = topic_models[topic_model](n_topics, **kwargs).fit(text_vec)
topic_dists = topicfinder.components_ if topic_model is not "kmeans" else topicfinder.cluster_centers_
topic_dists /= topic_dists.max(axis = 1).reshape((-1, 1))
## 3. keywords for topics
## Unlike other models, LSA_SVD will generate both positive and negative values in topic_word distribution,
## which makes it more ambiguous to choose keywords for topics. The sign of the weights are kept with the
## words for a demostration here
def _topic_keywords(topic_dist):
keywords_index = np.abs(topic_dist) >= thr
keywords_prefix = np.where(np.sign(topic_dist) > 0, "", "^")[keywords_index]
keywords = " | ".join(map(lambda x: "".join(x), zip(keywords_prefix, words[keywords_index])))
return keywords
topic_keywords = map(_topic_keywords, topic_dists)
return "\n".join("Topic %i: %s" % (i, t) for i, t in enumerate(topic_keywords))
The truncated SVD implementation in sklearn is intuitively similiar to PCA algorithm, which tries to find *orthogonal* directions that explains the largest *variances* in the texts.
When applying SVD with TF and TFIDF on the clearcut-topic texts, we got the results below. As discussed, one unique signature of SVD's results is that the words in topics can be both positive and negative. For simple cases, they can be understood as including and excluding the corresponding word in the topic.
For example "Topic 1: bergers | ^hate | love | ^sandwiches"
can be "intuitively" explained as the texts that include "love bergers" and exclude "hate sandwiches".
Depending on the random state, your topic results may be different. In the results below, we don't see clear indications of the two topics "love bergers" and "hate sandwiches". However, it does have topics such as Topic 3: ^bergers | love
, which means "love" but NOT "bergers".
Interestingly, we may also generate topics such as Topic 3: bergers | ^hate | ^love | sandwiches
, which captures "bergers" and "sandwiches" as a "food" topic.
print(find_topic(sample_texts["clearcut topics"], "svd", 4, vec_model="tf"))
Topic 0: bergers | hate | love | sandwiches | we Topic 1: bergers | ^hate | love | ^sandwiches Topic 2: bergers | hate | love | sandwiches | ^we Topic 3: ^bergers | love
print(find_topic(sample_texts["clearcut topics"], "svd", 4, vec_model="tfidf"))
Topic 0: bergers | hate | love | sandwiches | we Topic 1: bergers | ^hate | love | ^sandwiches Topic 2: bergers | hate | love | sandwiches | ^we Topic 3: bergers | ^hate | ^love | sandwiches
In the above examples, we set a larger number of topics than expected on purpose, because most of time you don't have the prior knowledge of how many topics there are in your texts. If we explicitly set the topic number = 2. We get the following result.
print(find_topic(sample_texts["clearcut topics"], "svd", 2, vec_model="tfidf"))
Topic 0: bergers | hate | love | sandwiches | we Topic 1: bergers | ^hate | love | ^sandwiches
When explaining the result of SVD, it's important to contrast each topic with previous ones, instead of looking at them separately. So the result above can be explained as that the major difference in the texts are (1) including "love bergers" and (2) excluding "hate sandwiches".
Let's try SVD on the unbalanced topic texts, to see how it performs on detecting minor groups - it performs quite well on this.
print(find_topic(sample_texts["unbalanced topics"], "svd", 3, vec_model="tf"))
Topic 0: hate | sandwiches | we Topic 1: bergers | ^hate | love | ^sandwiches | we Topic 2: bergers | hate | love | sandwiches | ^we
However, it did poorly on texts with noises - SVD treats each form of the same meaning differently and fails to capture any semantic connections.
print(find_topic(sample_texts["noisy topics"], "svd", 2, vec_model="tf"))
Topic 0: bergerx | bergexs | bergxrs | bexgers | bxrgers | hate | love | sandwicxes | sandwixhes | sandwxches | sandxiches | sanxwiches | saxdwiches | sxndwiches | we | xandwiches | xergers Topic 1: ^bergerx | ^bergexs | ^bergxrs | ^bexgers | ^bxrgers | hate | ^love | sandwicxes | sandwixhes | sandwxches | sandxiches | sanxwiches | saxdwiches | sxndwiches | we | xandwiches | ^xergers
In summary,
LDA is one of the most mentioned topic-finding models, due to its good performances on many different types of texts, and its intuitive interpretation as a "generative" process.
Intuitively, LDA finds topics as a group of words that have high co-occurrences among different documents. On the other side, documents from the similiar mixture of topics should also be similiar, such that they can be described by these topics in a "compact" way. So ideally the similiarity in the latent *topic space* would imply the the similiarity in both the observed *word space* as well as the *document space* - this is where the word "latent" in the name come from.
The LDA algorithms has two main parameters controlling
Later we will see how these parameters help with fining minor topics in a skewed distribution. Finding the right parameter values is mostly based on experimental experiences.
(*TODO: make it clearer on this part*)
Compared to SVD, the topics found by LDA is much more human understandable. This is shown as the results on the clearcut-topic texts below. Within each topic, there is a clear indication of the keywords' connections based on their co-occurrence. This is different from what we observed in the results of SVD, specially,
There is also a difference in combining it with different vectorizations - tfidf
if you don't want to see too many common words in the topics.
print(find_topic(sample_texts["clearcut topics"], "lda", 4, vec_model="tf"))
Topic 0: bergers | love | we Topic 1: bergers | love | we Topic 2: love | we Topic 3: hate | sandwiches | we
print(find_topic(sample_texts["clearcut topics"], "lda", 4, vec_model="tfidf"))
Topic 0: bergers | love | we Topic 1: bergers | love | we Topic 2: hate | sandwiches | we Topic 3: bergers | love | we
I have introduced how to tune the topic-skewness parameter to deal with unbalanced topic modelling. In the sklearn implementation, this parameter is topic_word_prior
. (and the other one is doc_topic_prior
that controls the sparseness of topics in each doc).
The default value of topic_word_prior
is $\frac{1}{n\_topics} $. which assumes an even distribution of topics. A smaller value will make it more "uneven". This is illustrated in the results below.
*The minor topic we love bergers
have been "glued" to a bigger one if the topic distribution is assumed to be symmetric.*
print(find_topic(sample_texts["unbalanced topics"], "lda", 4, vec_model="tf"))
Topic 0: hate | sandwiches | we Topic 1: bergers | hate | love | sandwiches | we Topic 2: bergers | hate | love | sandwiches | we Topic 3: hate | sandwiches | we
*Using a smaller topic_word_prior
value will help capture the minor topics, because now the topics are forced to be more sparse in choosing keywords.*
print(find_topic(sample_texts["unbalanced topics"], "lda", 4, vec_model="tf", topic_word_prior=1e-5))
Topic 0: hate | sandwiches | we Topic 1: bergers | love | we Topic 2: hate | sandwiches | we Topic 3: hate | sandwiches | we
Noisy texts is also a challenge for LDA. From below we can see LDA's result on noisy-topics is not clear because there is no clear connections between the different typos of the same words.
print find_topic(sample_texts["noisy topics"],"lda",3, vec_model = "tfidf")
Topic 0: bergerx | bergexs | bergxrs | berxers | bexgers | bxrgers | hate | love | sandwichex | sandwicxes | sandwixhes | sandwxches | sandxiches | sanxwiches | saxdwiches | sxndwiches | we | xandwiches | xergers Topic 1: bergerx | bergexs | bergxrs | berxers | bexgers | bxrgers | hate | love | sandwichex | sandwicxes | sandwixhes | sandwxches | sandxiches | sanxwiches | saxdwiches | sxndwiches | we | xandwiches | xergers Topic 2: bergerx | bergexs | bergxrs | berxers | bexgers | bxrgers | hate | love | sandwichex | sandwicxes | sandwixhes | sandwxches | sandxiches | sanxwiches | saxdwiches | sxndwiches | we | xandwiches | xergers
In summary,
NMF has been discussed as a special case of LDA. The theory behind their link might be complicated to understand. But in practice, NMF can be mostly seen as a LDA of which the parameters have been fixed to enforce a sparse solution. So it may not be as flexible as LDA if you want to find multiple topics in single documents, e.g., from long articles. But it could work very well out of box for corpora of short texts. This makes NMF attractive for short text analysis because its computation is usually much cheaper than LDA.
On the other hand, the most discussed weakness of NMF is the inconsistency of its results - when you set the number of topics to be too high than the reality in texts, NMF might generate some rubbish out of nowhere. LDA is more robust to a big variety of different topic numbers.
Let's first see an example of NMF being inconsistent. For clearcut topics texts, when we set topic number = 5, which is close to reality (= 2), the generated topics are of good quality.
print(find_topic(sample_texts["clearcut topics"], "nmf", 5, vec_model="tf"))
Topic 0: hate | sandwiches | we Topic 1: hate | sandwiches | we Topic 2: bergers | love | we Topic 3: hate | sandwiches | we Topic 4: hate | sandwiches | we
However, when we increase the number of topics to 25 (much larger than 2), some werid topics have started to jump out
print(find_topic(sample_texts["clearcut topics"], "nmf", 25, vec_model="tf"))
Topic 0: hate | sandwiches | we Topic 1: hate | sandwiches | we Topic 2: bergers | love | we Topic 3: we Topic 4: hate | sandwiches | we Topic 5: sandwiches Topic 6: bergers | love | we Topic 7: hate Topic 8: love | we Topic 9: we Topic 10: bergers Topic 11: hate | sandwiches | we Topic 12: hate | sandwiches | we Topic 13: hate Topic 14: bergers | love | we Topic 15: hate | sandwiches | we Topic 16: love | we Topic 17: hate | sandwiches | we Topic 18: bergers | love | we Topic 19: hate | sandwiches | we Topic 20: sandwiches | we Topic 21: hate | sandwiches Topic 22: we Topic 23: bergers | love | we Topic 24: hate | sandwiches | we
Running the same experiment on LDA, the results is much more consistent.
print(find_topic(sample_texts["clearcut topics"], "lda", 25, vec_model="tf"))
Topic 0: bergers | love | we Topic 1: bergers | love | we Topic 2: bergers | love | we Topic 3: hate | sandwiches | we Topic 4: hate | sandwiches | we Topic 5: bergers | hate | love | sandwiches | we Topic 6: bergers | love | we Topic 7: hate | sandwiches | we Topic 8: bergers | love | we Topic 9: bergers | love | we Topic 10: bergers | love | we Topic 11: bergers | love | we Topic 12: bergers | love | we Topic 13: bergers | hate | love | sandwiches | we Topic 14: bergers | hate | love | sandwiches | we Topic 15: bergers | love | we Topic 16: bergers | hate | love | sandwiches | we Topic 17: bergers | love | we Topic 18: bergers | love | we Topic 19: bergers | love | we Topic 20: bergers | love | we Topic 21: bergers | love | we Topic 22: bergers | love | we Topic 23: hate | sandwiches | we Topic 24: bergers | love | we
Set with an appropriate # of topics, NMF is also good at finding unbalanced topic distributions.
print(find_topic(sample_texts["unbalanced topics"], "nmf", 5, vec_model="tfidf"))
Topic 0: hate | sandwiches | we Topic 1: hate | sandwiches | we Topic 2: bergers | love | we Topic 3: hate | sandwiches | we Topic 4: hate | sandwiches | we
Impressively, NMF seems to be the only topic-finding model that can deal with "noisy texts" without a lot of fine-tuning. This is very useful for the first round of exploration of your data.
print find_topic(sample_texts["noisy topics"],"nmf",5, vec_model = "tfidf",)
Topic 0: bergxrs | berxers | bexgers | bxrgers | love | we Topic 1: hate | sandwichex | sandwicxes | sandwixhes | sandwxches | sanxwiches | saxdwiches | sxndwiches | we | xandwiches Topic 2: bergerx | berxers | bexgers | bxrgers | love | we Topic 3: hate | sandwichex | sandwxches | sandxiches | saxdwiches | sxndwiches | we Topic 4: bergexs | berxers | bexgers | bxrgers | love | we | xergers
In summary,
Clustering method such as KMeans can group documents based on their vector representations (or even directly based on their distance matrices). However it is not usually seen as a topic-finding method because it is hard to explain its results as groups of keywords.
However, when used together with tf/tfidf, the centers of the clusters can be interpreted as a probability over words in the same way as in LDA and NMF.
print find_topic(sample_texts["clearcut topics"],"kmeans",10, vec_model = "tf",)
Topic 0: hate | sandwiches | we Topic 1: bergers | love | we Topic 2: hate | sandwiches | we Topic 3: bergers | love | we Topic 4: bergers | love | we Topic 5: bergers | love | we Topic 6: bergers | love | we Topic 7: bergers | love | we Topic 8: bergers | love | we Topic 9: bergers | love | we
print find_topic(sample_texts["unbalanced topics"],"kmeans",10, vec_model = "tf",)
Topic 0: hate | sandwiches | we Topic 1: bergers | love | we Topic 2: hate | sandwiches | we Topic 3: hate | sandwiches | we Topic 4: hate | sandwiches | we Topic 5: hate | sandwiches | we Topic 6: hate | sandwiches | we Topic 7: hate | sandwiches | we Topic 8: hate | sandwiches | we Topic 9: hate | sandwiches | we
print find_topic(sample_texts["noisy topics"],"kmeans",10, vec_model = "tf",)
Topic 0: bergerx | berxers | bexgers | bxrgers | love | we Topic 1: hate | sandwichex | sandwxches | saxdwiches | sxndwiches | we Topic 2: bergxrs | love | we Topic 3: hate | sandwicxes | we Topic 4: bergexs | love | we Topic 5: hate | sandxiches | we Topic 6: hate | sanxwiches | we Topic 7: love | we | xergers Topic 8: hate | sandwixhes | we Topic 9: hate | we | xandwiches
In summary,
Just like NMF, KMeans also performs well on different types of short texts, including finding unbalanced topic distributions and dealing with noisy data. Even better, its results seem more consistent than NMF's to the setting of number of topics.
Furthermore, its computation is usually cheap and there are implementations that can scale up to very large datasets. Unlike LDA, the integration of clustering with other document-vectoization methods is much easier to implement. For example, if external large corpus is available to train a word-embedding model, topic-finding via clustering can be easily extended by using the word vectors that convey more semantic meanings.
Lastly, I want to briefly discuss another perspective of topic finding. In most cases we are interested in grouping documents according to their topic distributions and finding describing keywords for each topic. Another way to look at the topics is to see whether they can group "semantically connected" words into the same groups.
Most researchers agree that the "semantic" of a word is defined by its contexts, i.e., other words surrounding it. For example, "love" and "hate" can be seen as semantically connected because both words can be used in the same context "I _ apples." In fact, one of the most important focus of word embedding research is to find vector representations of the words, phrases, or even documents such that their semantic closeness is retained in the vector space.
Finding topics grouping "semantically similiar" words may not necessarily be the same as grouping words that co-occur frequently. From the results below, we can see that most methods discussed are generating co-occurance oriented topics instead of semantic ones. Only SVD shed some lights on it - both "bergers vs sandwiches" and "love vs hate" groups are generated.
But please bear in mind that these results are only based on very simple toy-like texts. I include them here only to further highlight the differences of models from another perspective.
print(find_topic(sample_texts["semantic topics"], "svd", 4, vec_model="tfidf"))
Topic 0: bergers | hate | love | sandwiches | we Topic 1: bergers | ^sandwiches Topic 2: ^hate | love Topic 3: ^bergers | ^hate | ^love | ^sandwiches | we
print(find_topic(sample_texts["semantic topics"], "nmf", 5, vec_model="tfidf"))
Topic 0: hate | sandwiches | we Topic 1: bergers | love | we Topic 2: bergers | we Topic 3: love | sandwiches | we Topic 4: hate | we
print(find_topic(sample_texts["semantic topics"], "lda", 5, vec_model="tfidf"))
Topic 0: bergers | love | we Topic 1: bergers | hate | we Topic 2: love | sandwiches | we Topic 3: bergers | love | we Topic 4: bergers | love | we
print(find_topic(sample_texts["semantic topics"], "kmeans", 5, vec_model="tfidf"))
Topic 0: hate | sandwiches | we Topic 1: love | sandwiches | we Topic 2: bergers | hate | we Topic 3: bergers | love | we Topic 4: hate | sandwiches | we
*Thank you for reading. I am sure there will be mistakes, inaccuracies. Please feel free to PR at my github.*