#!/usr/bin/env python # coding: utf-8 # # pyLDAvis # [`pyLDAvis`](https://github.com/bmabey/pyLDAvis) is a python libarary for interactive topic model visualization. # It is a port of the fabulous [R package](https://github.com/cpsievert/LDAvis) by Carson Sievert and Kenny Shirley. They did the hard work of crafting an effective visualization. `pyLDAvis` makes it easy to use the visualiziation from Python and, in particular, Jupyter notebooks. To learn more about the method behind the visualization I suggest reading the [original paper](http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf) explaining it. # # This notebook provides a quick overview of how to use `pyLDAvis`. Refer to the [documenation](https://pyldavis.readthedocs.org/en/latest/) for details. # # ## BYOM - Bring your own model # # `pyLDAvis` is agnostic to how your model was trained. To visualize it you need to provide the topic-term distributions, document-topic distributions, and basic information about the corpus which the model was trained on. The main function is the [`prepare`](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.prepare) function that will transform your data into the format needed for the visualization. # # Below we load a model trained in R and then visualize it. The model was trained on a corpus of 2000 movie reviews parsed by [Pang and Lee (ACL, 2004)](http://www.cs.cornell.edu/people/pabo/movie-review-data/), originally gathered from the IMDB archive of the rec.arts.movies.reviews newsgroup. # In[10]: import json import numpy as np def load_R_model(filename): with open(filename, 'r') as j: data_input = json.load(j) data = {'topic_term_dists': data_input['phi'], 'doc_topic_dists': data_input['theta'], 'doc_lengths': data_input['doc.length'], 'vocab': data_input['vocab'], 'term_frequency': data_input['term.frequency']} return data movies_model_data = load_R_model('data/movie_reviews_input.json') print('Topic-Term shape: %s' % str(np.array(movies_model_data['topic_term_dists']).shape)) print('Doc-Topic shape: %s' % str(np.array(movies_model_data['doc_topic_dists']).shape)) # Now that we have the data loaded we use the `prepare` function: # In[11]: import pyLDAvis movies_vis_data = pyLDAvis.prepare(**movies_model_data) # Once you have the visualization data prepared you can do a number of things with it. You can [save the vis](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.save_html) to an stand-alone HTML file, [serve it](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.show), or [display it](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.display) in the notebook. Let's go ahead and display it: # In[12]: pyLDAvis.display(movies_vis_data) # Pretty, huh?! Again, you should be thanking the original [LDAvis people](https://github.com/cpsievert/LDAvis) for that. You may thank me for the Jupyter integration though. :) Aside from being aesthetically pleasing this visualization more importantly represents a lot of information about the topic model that is hard to take in all at once with ad-hoc queries. To learn more about the visual elements and how they help you explore your model see [this documentation](http://cran.r-project.org/web/packages/LDAvis/vignettes/details.pdf) from the original R project and this presentation ([slides](https://speakerdeck.com/bmabey/visualizing-topic-models), [video](https://www.youtube.com/watch?v=IksL96ls4o0)). # # # To see other models visualized check out [this notebook](http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/Movie%20Reviews,%20AP%20News,%20and%20Jeopardy.ipynb). # # *ProTip:* To avoid tediously typing in `display` all the time use: # In[13]: pyLDAvis.enable_notebook() # By default the topics are projected to the 2D plane using [PCoA](https://en.wikipedia.org/wiki/PCoA) on a distance matrix created using the [Jensen-Shannon divergence](https://en.wikipedia.org/wiki/Jensen–Shannon_divergence) on the topic-term distributions. You can pass in a different multidimensional scaling function via the `mds` parameter. In addition to `pcoa`, other provided options are `tsne` and `mmds` which operate on the same JS-divergence distance matrix. Both `tsne` and `mmds` require that you have sklearn installed. Here is `tnse` in action: # In[14]: pyLDAvis.prepare(mds='tsne', **movies_model_data) # Here is `mmds` in action: # In[15]: pyLDAvis.prepare(mds='mmds', **movies_model_data) # ## Making the common case easy - Gensim and others! # # Built on top of the generic `prepare` function are helper functions for [gensim](https://radimrehurek.com/gensim/), [scikit-learn](http://scikit-learn.org/stable/), and [GraphLab Create](https://dato.com/products/create/). To demonstrate below, I am loading up a trained gensim model and corresponding dictionary and corpus (see [this notebook](http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/Gensim%20Newsgroup.ipynb) for how these were created): # In[16]: import gensim dictionary = gensim.corpora.Dictionary.load('newsgroups.dict') corpus = gensim.corpora.MmCorpus('newsgroups.mm') lda = gensim.models.ldamodel.LdaModel.load('newsgroups_50.model') # In the dark ages, in order to inspect our topics all we had was `show_topics` and friends: # In[17]: lda.show_topics() # Thankfully, in addition to these *still helpful functions*, we can get a feel for all of the topics with this one-liner: # In[18]: import pyLDAvis.gensim_models as gensimvis pyLDAvis.gensimvis.prepare(lda, corpus, dictionary) # ## sklearn # # For examples on how to use # scikit-learn's topic models with pyLDAvis see [this notebook](http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/LDA%20model.ipynb). # # # ## GraphLab # # For GraphLab integration check out [this notebook](http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/GraphLab.ipynb#topic=7&lambda=0.41&term=). # # # ## Go forth and visualize! # # What are you waiting for? Go ahead and `pip install pyldavis`.