pyLDAvis

pyLDAvis is a python libarary for interactive topic model visualization. It is a port of the fabulous R package by Carson Sievert and Kenny Shirley. They did the hard work of crafting an effective visualization. pyLDAvis makes it easy to use the visualiziation from Python and, in particualr, IPython notebooks. To learn more about the method behind the visualization I suggest reading the original paper explaining it.

This notebook provides a quick overview of how to use pyLDAvis. Refer to the documenation for details.

BYOM - Bring your own model

pyLDAvis is agnostic to how your model was trained. To visualize it you need to provide the topic-term distribtuions, document-topic distributions, and basic information about the corpus which the model was trained on. The main function is the prepare function that will transform your data into the format needed for the visualization.

Below we load a model trained in R and then visualize it. The model was trained on a corpus of 2000 movie reviews parsed by Pang and Lee (ACL, 2004), originally gathered from the IMDB archive of the rec.arts.movies.reviews newsgroup.

In [10]:
import json
import numpy as np

def load_R_model(filename):
    with open(filename, 'r') as j:
        data_input = json.load(j)
    data = {'topic_term_dists': data_input['phi'], 
            'doc_topic_dists': data_input['theta'],
            'doc_lengths': data_input['doc.length'],
            'vocab': data_input['vocab'],
            'term_frequency': data_input['term.frequency']}
    return data

movies_model_data = load_R_model('data/movie_reviews_input.json')

print('Topic-Term shape: %s' % str(np.array(movies_model_data['topic_term_dists']).shape))
print('Doc-Topic shape: %s' % str(np.array(movies_model_data['doc_topic_dists']).shape))
Topic-Term shape: (20, 14567)
Doc-Topic shape: (2000, 20)

Now that we have the data loaded we use the prepare function:

In [11]:
import pyLDAvis
movies_vis_data = pyLDAvis.prepare(**movies_model_data)

Once you have the visualization data prepared you can do a number of things with it. You can save the vis to an stand-alone HTML file, serve it, or dispaly it in the notebook. Let's go ahead and display it:

In [12]:
pyLDAvis.display(movies_vis_data)
Out[12]:

Pretty, huh?! Again, you should be thanking the original LDAvis people for that. You may thank me for the IPython integartion though. :) Aside from being aesthetically pleasing this visualization more importantly represents a lot of information about the topic model that is hard to take in all at once with ad-hoc queries. To learn more about the visual elements and how they help you explore your model see this documentation from the original R project and this presentation (slides, video).

To see other models visualzied check out this notebook.

ProTip: To avoid tediously typing in display all the time use:

In [13]:
pyLDAvis.enable_notebook()

By default the topics are projected to the 2D plane using PCoA on a distance matrix created using the Jensen-Shannon divergence on the topic-term distritbuions. You can pass in a different multidimensional scaling function via the mds pararameter. In addition to pcoa other provided options are tsne and mmds which operate on the same JS-divergence distance matrix. Both tsne and mmds require that you have sklearn installed. Here is tnse in action:

In [14]:
pyLDAvis.prepare(mds='tsne', **movies_model_data)
Out[14]:

Here is mmds in action:

In [15]:
pyLDAvis.prepare(mds='mmds', **movies_model_data)
Out[15]:

Making the common case easy - Gensim and others!

Built on top of the generic prepare function are helper functions for gensim, scikit-learn, and GraphLab Create. To demonstrate below I am loading up a trained gensim model and coresponding dictionary and corpus (see this notebook for how these were created):

In [16]:
import gensim

dictionary = gensim.corpora.Dictionary.load('newsgroups.dict')
corpus = gensim.corpora.MmCorpus('newsgroups.mm')
lda = gensim.models.ldamodel.LdaModel.load('newsgroups_50.model')

In the dark ages in order to inspect our topics all we had was show_topics and friends:

In [17]:
lda.show_topics()
Out[17]:
['0.029*pat + 0.014*resurrection + 0.010*threw + 0.010*black + 0.009*temple + 0.009*article + 0.009*aaron + 0.008*front + 0.008*weight + 0.008*back',
 '0.016*palestinians + 0.012*win + 0.011*soldiers + 0.011*japanese + 0.011*republic + 0.010*dale + 0.010*libertarian + 0.010*democratic + 0.010*trade + 0.009*cultural',
 '0.050*year + 0.016*percent + 0.013*young + 0.012*neutral + 0.012*media + 0.010*record + 0.010*last + 0.008*league + 0.008*playoffs + 0.008*boston',
 '0.032*posting + 0.031*host + 0.028*nntp + 0.025*article + 0.022*edu + 0.022*university + 0.021*western + 0.018*occupied + 0.018*case + 0.016*usa',
 '0.025*israeli + 0.020*file + 0.011*windows + 0.009*program + 0.009*use + 0.008*ftp + 0.008*available + 0.008*files + 0.008*version + 0.007*software',
 '0.025*coverage + 0.015*good + 0.014*mit + 0.012*morris + 0.012*cover + 0.010*tie + 0.010*new + 0.009*hallam + 0.009*rangers + 0.008*xlib',
 '0.022*government + 0.020*gun + 0.016*article + 0.016*people + 0.015*guns + 0.014*clipper + 0.013*crime + 0.012*drugs + 0.009*country + 0.008*bill',
 '0.075*turkey + 0.022*rochester + 0.021*cyprus + 0.018*planes + 0.016*libertarians + 0.011*josh + 0.010*personnel + 0.009*train + 0.009*randy + 0.009*weaver',
 '0.013*card + 0.008*use + 0.008*video + 0.007*msg + 0.007*get + 0.007*one + 0.007*problem + 0.006*apple + 0.006*computer + 0.006*com',
 "0.017*would + 0.013*don't + 0.010*one + 0.010*think + 0.010*like + 0.009*people + 0.008*it's + 0.008*make + 0.007*much + 0.007*get"]

Thankfully, in addition to these still helpful functions, we can get a feel for all of the topics with this one-liner:

In [18]:
import pyLDAvis.gensim

pyLDAvis.gensim.prepare(lda, corpus, dictionary)
Out[18]: