In [1]:
from lda2vec import preprocess, Corpus
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline

You must be using a very recent version of pyLDAvis to use the lda2vec outputs. As of this writing, anything past Jan 6 2016 or this commit 14e7b5f60d8360eb84969ff08a1b77b365a5878e should work. You can do this quickly by installing it directly from master like so:

In [2]:
# pip install -U git+[email protected]#egg=pyLDAvis
In [2]:
import pyLDAvis

Reading in the saved model story topics

After runnning script in examples/hacker_news/lda2vec directory topics.story.pyldavis.npz and will be created that contain the topic-to-word probabilities and frequencies. What's left is to visualize and label each topic from the it's prevalent words.

In [3]:
npz = np.load(open('topics.story.pyldavis.npz', 'r'))
dat = {k: v for (k, v) in npz.iteritems()}
dat['vocab'] = dat['vocab'].tolist()
In [4]:
top_n = 10
topic_to_topwords = {}
for j, topic_to_word in enumerate(dat['topic_term_dists']):
    top = np.argsort(topic_to_word)[::-1][:top_n]
    msg = 'Topic %i '  % (j+ 1)
    top_words = [dat['vocab'][i].strip()[:35] for i in top]
    msg += ' '.join(top_words)
    print msg
    topic_to_topwords[j] = top_words
Topic 1 rent control gentrification basic income more housing home ownership housing affordable housing gentrifying housing prices rents
Topic 2 trackpoint xmonad mbp. macports thinkpad mbp sizeup out_of_vocabulary crashplan mechanical keyboard
Topic 3 algebra calculus ebonics adhd reading speed math education meditations new words common core math classes
Topic 4 cree top gear charging stations model s b&n 1gbps mattresses at&t broder starz
Topic 5 google+ bing default search engine ddg g+ igoogle !g g+. google+. google reader
Topic 6 cyclists f-35 tesla's hyperloop cyclist electric cars nest protect pedestrians autonomous cars fuel costs
Topic 7 ender dawkins asperger ramanujan atheists savages gladwell isaacson alan turing psychopathy
Topic 8 bitcoins bitcoin btc bitcoin price mtgox bitcoin economy btc. index funds liquidity bitcoin exchanges
Topic 9 college education mba program idea guys business degree college dropouts gpa graduates higher education rock star grad schools
Topic 10 morning person melatonin cardio naps adderall sleep schedule caffeine pullups weight training little sleep
Topic 11 first language sicp. sicp ror. object orientation cormen category theory the good parts htdp learn you a
Topic 12 current salary hiring managers hiring manager technical interviews performance reviews 60+ hours interviewing interviewer interviewers recruiter
Topic 13 helmet cardio carbs fasting diet lasik biking soylent vitamin d veggies
Topic 14 horvath ortiz eich eich's swartz adria adria richards whistleblower kerr edward snowden
Topic 15 2fa gpg fastmail factor authentication abp lastpass factor auth https encrypt pgp
Topic 16 tau quantum effects neutrinos qm asimov particles galaxies consciousness particle cosine
Topic 17 asian parents grades ap courses gpa grade inflation college experience good grades khan majoring hs
Topic 18 factor authentication fbi icann search warrant tor encrypting passwords privacy rights encrypt us jurisdiction
Topic 19 apple pay apple music whatsapp at&t ad blockers moto g patreon fire phone google play music prime video
Topic 20 slicehost yes<p>willing seeking freelancer - remote request<p>email work - remote remote<p>i yes<p>technologies remote<p>i'm no<p>technologies work - remote<p>i
Topic 21 chargify padmapper spreedly godaddy merchant account namecheap recurly paypal free users cc details
Topic 22 monotouch wp7 .net. bizspark .net stack .net webos microsoft stack 3.3.1 tizen
Topic 23 <SKIP> nim go&#x27;s raii kotlin callbacks haskell&#x27;s generics nimrod ints
Topic 24 rel="nofollow">http:&#x2f;&#x2f;tur swiftstack relocation great communication skills esoterics small engineering team tight-knit team top-floor office backend engineers ca. full
Topic 25 current salary hourly rates equity elance hourly rate odesk freelancing living expenses invoice exploding offer
Topic 26 verdana semicolons rubular textarea whitespace selectors indent .vimrc indentation bookmarklet
Topic 27 holacracy apprenticeships common core phd&#x27;s &quot;cultural fit&quot vc&#x27;s basic income &quot;culture&quot open offices &quot;women
Topic 28 morning person shyness depression little sleep therapist naps work/life balance burnout introverts workaholism
Topic 29 prismatic new page karma high karma magcloud submit button karma system clickpass cornify average karma
Topic 30 start menu xfce kde 11.04 unity gnome 8.1 direct3d dual boot wayland
Topic 31 checked exceptions try/catch list comprehensions <SKIP> callbacks orm dependency injection static typing stm function pointers
Topic 32 great communication skills top-floor office swiftstack small engineering team backend engineers rel="nofollow">http:&#x2f;&#x2f;tur own trading strategies ca. full offer:<p><pre><code rel="nofollow">http:&#x2f;&#x2f;www
Topic 33 nuclear power thorium fossil fuels nuclear plants uranium gdp nuclear waste economic growth fiat currency climate change
Topic 34 <SKIP> es6 react&#x27;s ast react components callbacks javascript&#x27;s mixins go&#x27;s browserify
Topic 35 offer:<p><pre><code ca. full swiftstack ca<p>==========<p>aclima backend engineers laundry delivery service team.<p>we great communication skills top-floor office small engineering team
Topic 36 rim elop plurk zynga pincus patent system crunchpad software patents nortel patents htc
Topic 37 apple watch the surface pro 16:9 hdmi mac pro winamp good battery life upgradable big iphone steam box
Topic 38 snowden real terrorists nsa&#x27;s terrorism whistleblower edward snowden assange terrorists 9&#x2f;11 &quot;war
Topic 39 consolas st2 inconsolata .vimrc vim zsh vim bindings iterm2 arrow keys svg
Topic 40 cloudfront docker dockerfile docker container graphql gitlab docker containers coreos dokku gogs

Visualize story topics

In [5]:
import warnings
prepared_data_story = pyLDAvis.prepare(dat['topic_term_dists'], dat['doc_topic_dists'], 
                                       dat['doc_lengths'] * 1.0, dat['vocab'], dat['term_frequency'] * 1.0, sort_topics=False)
In [6]: