#!/usr/bin/env python
# coding: utf-8

# # pyLDAvis

# [`pyLDAvis`](https://github.com/bmabey/pyLDAvis) is a python libarary for interactive topic model visualization.
# It is a port of the fabulous [R package](https://github.com/cpsievert/LDAvis) by Carson Sievert and Kenny Shirley.  They did the hard work of crafting an effective visualization. `pyLDAvis` makes it easy to use the visualiziation from Python and, in particular, Jupyter notebooks. To learn more about the method behind the visualization I suggest reading the [original paper](http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf) explaining it.
# 
# This notebook provides a quick overview of how to use `pyLDAvis`. Refer to the [documenation](https://pyldavis.readthedocs.org/en/latest/) for details.
# 

# ## BYOM - Bring your own model
# 
# `pyLDAvis` is agnostic to how your model was trained. To visualize it you need to provide the topic-term distributions, document-topic distributions, and basic information about the corpus which the model was trained on. The main function is the [`prepare`](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.prepare) function that will transform your data into the format needed for the visualization.
# 
# Below we load a model trained in R and then visualize it. The model was trained on a corpus of 2000 movie reviews parsed by [Pang and Lee (ACL, 2004)](http://www.cs.cornell.edu/people/pabo/movie-review-data/), originally gathered from the IMDB archive of the rec.arts.movies.reviews newsgroup.

# In[10]:


import json
import numpy as np

def load_R_model(filename):
    with open(filename, 'r') as j:
        data_input = json.load(j)
    data = {'topic_term_dists': data_input['phi'], 
            'doc_topic_dists': data_input['theta'],
            'doc_lengths': data_input['doc.length'],
            'vocab': data_input['vocab'],
            'term_frequency': data_input['term.frequency']}
    return data

movies_model_data = load_R_model('data/movie_reviews_input.json')

print('Topic-Term shape: %s' % str(np.array(movies_model_data['topic_term_dists']).shape))
print('Doc-Topic shape: %s' % str(np.array(movies_model_data['doc_topic_dists']).shape))


# Now that we have the data loaded we use the `prepare` function:

# In[11]:


import pyLDAvis
movies_vis_data = pyLDAvis.prepare(**movies_model_data)


# Once you have the visualization data prepared you can do a number of things with it. You can [save the vis](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.save_html) to an stand-alone HTML file, [serve it](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.show), or [display it](https://pyldavis.readthedocs.org/en/latest/modules/API.html#pyLDAvis.display) in the notebook. Let's go ahead and display it:

# In[12]:


pyLDAvis.display(movies_vis_data)


# Pretty, huh?! Again, you should be thanking the original [LDAvis people](https://github.com/cpsievert/LDAvis) for that. You may thank me for the Jupyter integration though. :) Aside from being aesthetically pleasing this visualization more importantly represents a lot of information about the topic model that is hard to take in all at once with ad-hoc queries. To learn more about the visual elements and how they help you explore your model see [this documentation](http://cran.r-project.org/web/packages/LDAvis/vignettes/details.pdf) from the original R project and this presentation ([slides](https://speakerdeck.com/bmabey/visualizing-topic-models), [video](https://www.youtube.com/watch?v=IksL96ls4o0)).
# 
# 
# To see other models visualized check out [this notebook](http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/Movie%20Reviews,%20AP%20News,%20and%20Jeopardy.ipynb).
# 
# *ProTip:* To avoid tediously typing in `display` all the time use:

# In[13]:


pyLDAvis.enable_notebook()


# By default the topics are projected to the 2D plane using [PCoA](https://en.wikipedia.org/wiki/PCoA) on a distance matrix created using the [Jensen-Shannon divergence](https://en.wikipedia.org/wiki/Jensen–Shannon_divergence) on the topic-term distributions. You can pass in a different multidimensional scaling function via the `mds` parameter. In addition to `pcoa`, other provided options are `tsne` and `mmds` which operate on the same JS-divergence distance matrix. Both `tsne` and `mmds` require that you have sklearn installed. Here is `tnse` in action:

# In[14]:


pyLDAvis.prepare(mds='tsne', **movies_model_data)


# Here is `mmds` in action:

# In[15]:


pyLDAvis.prepare(mds='mmds', **movies_model_data)


# ## Making the common case easy - Gensim and others!
# 
# Built on top of the generic `prepare` function are helper functions for [gensim](https://radimrehurek.com/gensim/), [scikit-learn](http://scikit-learn.org/stable/), and [GraphLab Create](https://dato.com/products/create/). To demonstrate below, I am loading up a trained gensim model and corresponding dictionary and corpus (see [this notebook](http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/Gensim%20Newsgroup.ipynb) for how these were created):

# In[16]:


import gensim

dictionary = gensim.corpora.Dictionary.load('newsgroups.dict')
corpus = gensim.corpora.MmCorpus('newsgroups.mm')
lda = gensim.models.ldamodel.LdaModel.load('newsgroups_50.model')


# In the dark ages, in order to inspect our topics all we had was `show_topics` and friends:

# In[17]:


lda.show_topics()


# Thankfully, in addition to these *still helpful functions*, we can get a feel for all of the topics with this one-liner:

# In[18]:


import pyLDAvis.gensim_models as gensimvis

pyLDAvis.gensimvis.prepare(lda, corpus, dictionary)


# ## sklearn
# 
# For examples on how to use 
# scikit-learn's topic models with pyLDAvis see [this notebook](http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/LDA%20model.ipynb). 
# 
# 
# ## GraphLab
# 
# For GraphLab integration check out [this notebook](http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/GraphLab.ipynb#topic=7&lambda=0.41&term=).
# 
# 
# ## Go forth and visualize!
# 
# What are you waiting for? Go ahead and `pip install pyldavis`.