pyLDAvis
is a python libarary for interactive topic model visualization.
It is a port of the fabulous R package by Carson Sievert and Kenny Shirley. They did the hard work of crafting an effective visualization. pyLDAvis
makes it easy to use the visualiziation from Python and, in particular, Jupyter notebooks. To learn more about the method behind the visualization I suggest reading the original paper explaining it.
This notebook provides a quick overview of how to use pyLDAvis
. Refer to the documenation for details.
pyLDAvis
is agnostic to how your model was trained. To visualize it you need to provide the topic-term distributions, document-topic distributions, and basic information about the corpus which the model was trained on. The main function is the prepare
function that will transform your data into the format needed for the visualization.
Below we load a model trained in R and then visualize it. The model was trained on a corpus of 2000 movie reviews parsed by Pang and Lee (ACL, 2004), originally gathered from the IMDB archive of the rec.arts.movies.reviews newsgroup.
import json
import numpy as np
def load_R_model(filename):
with open(filename, 'r') as j:
data_input = json.load(j)
data = {'topic_term_dists': data_input['phi'],
'doc_topic_dists': data_input['theta'],
'doc_lengths': data_input['doc.length'],
'vocab': data_input['vocab'],
'term_frequency': data_input['term.frequency']}
return data
movies_model_data = load_R_model('data/movie_reviews_input.json')
print('Topic-Term shape: %s' % str(np.array(movies_model_data['topic_term_dists']).shape))
print('Doc-Topic shape: %s' % str(np.array(movies_model_data['doc_topic_dists']).shape))
Topic-Term shape: (20, 14567) Doc-Topic shape: (2000, 20)
Now that we have the data loaded we use the prepare
function:
import pyLDAvis
movies_vis_data = pyLDAvis.prepare(**movies_model_data)
Once you have the visualization data prepared you can do a number of things with it. You can save the vis to an stand-alone HTML file, serve it, or display it in the notebook. Let's go ahead and display it:
pyLDAvis.display(movies_vis_data)
Pretty, huh?! Again, you should be thanking the original LDAvis people for that. You may thank me for the Jupyter integration though. :) Aside from being aesthetically pleasing this visualization more importantly represents a lot of information about the topic model that is hard to take in all at once with ad-hoc queries. To learn more about the visual elements and how they help you explore your model see this documentation from the original R project and this presentation (slides, video).
To see other models visualized check out this notebook.
ProTip: To avoid tediously typing in display
all the time use:
pyLDAvis.enable_notebook()
By default the topics are projected to the 2D plane using PCoA on a distance matrix created using the Jensen-Shannon divergence on the topic-term distributions. You can pass in a different multidimensional scaling function via the mds
parameter. In addition to pcoa
, other provided options are tsne
and mmds
which operate on the same JS-divergence distance matrix. Both tsne
and mmds
require that you have sklearn installed. Here is tnse
in action:
pyLDAvis.prepare(mds='tsne', **movies_model_data)
Here is mmds
in action:
pyLDAvis.prepare(mds='mmds', **movies_model_data)
Built on top of the generic prepare
function are helper functions for gensim, scikit-learn, and GraphLab Create. To demonstrate below, I am loading up a trained gensim model and corresponding dictionary and corpus (see this notebook for how these were created):
import gensim
dictionary = gensim.corpora.Dictionary.load('newsgroups.dict')
corpus = gensim.corpora.MmCorpus('newsgroups.mm')
lda = gensim.models.ldamodel.LdaModel.load('newsgroups_50.model')
In the dark ages, in order to inspect our topics all we had was show_topics
and friends:
lda.show_topics()
['0.029*pat + 0.014*resurrection + 0.010*threw + 0.010*black + 0.009*temple + 0.009*article + 0.009*aaron + 0.008*front + 0.008*weight + 0.008*back', '0.016*palestinians + 0.012*win + 0.011*soldiers + 0.011*japanese + 0.011*republic + 0.010*dale + 0.010*libertarian + 0.010*democratic + 0.010*trade + 0.009*cultural', '0.050*year + 0.016*percent + 0.013*young + 0.012*neutral + 0.012*media + 0.010*record + 0.010*last + 0.008*league + 0.008*playoffs + 0.008*boston', '0.032*posting + 0.031*host + 0.028*nntp + 0.025*article + 0.022*edu + 0.022*university + 0.021*western + 0.018*occupied + 0.018*case + 0.016*usa', '0.025*israeli + 0.020*file + 0.011*windows + 0.009*program + 0.009*use + 0.008*ftp + 0.008*available + 0.008*files + 0.008*version + 0.007*software', '0.025*coverage + 0.015*good + 0.014*mit + 0.012*morris + 0.012*cover + 0.010*tie + 0.010*new + 0.009*hallam + 0.009*rangers + 0.008*xlib', '0.022*government + 0.020*gun + 0.016*article + 0.016*people + 0.015*guns + 0.014*clipper + 0.013*crime + 0.012*drugs + 0.009*country + 0.008*bill', '0.075*turkey + 0.022*rochester + 0.021*cyprus + 0.018*planes + 0.016*libertarians + 0.011*josh + 0.010*personnel + 0.009*train + 0.009*randy + 0.009*weaver', '0.013*card + 0.008*use + 0.008*video + 0.007*msg + 0.007*get + 0.007*one + 0.007*problem + 0.006*apple + 0.006*computer + 0.006*com', "0.017*would + 0.013*don't + 0.010*one + 0.010*think + 0.010*like + 0.009*people + 0.008*it's + 0.008*make + 0.007*much + 0.007*get"]
Thankfully, in addition to these still helpful functions, we can get a feel for all of the topics with this one-liner:
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.gensimvis.prepare(lda, corpus, dictionary)
For examples on how to use scikit-learn's topic models with pyLDAvis see this notebook.
For GraphLab integration check out this notebook.
What are you waiting for? Go ahead and pip install pyldavis
.