Note: This is best viewed on NBViewer. It is part of a series on Dirichlet Processes and Nonparametric Bayes.

Nonparametric Latent Dirichlet Allocation

Analysis of the topics of Econtalk

In 2003, a groundbreaking statistical model called "Latent Dirichlet Allocation" was presented by David Blei, Andrew Ng, and Michael Jordan.

LDA provides a method for summarizing the topics discussed in a document. LDA defines topics to be discrete probability distrbutions over words. For an introduction to LDA, see Edwin Chen's post.

The original LDA model requires the number of topics in the document to be specfied as a known parameter of the model. In 2005, Yee Whye Teh and others published a "nonparametric" version of this model that doesn't require the number of topics to be specified. This model uses a prior distribution over the topics called a hierarchical Dirichlet process. I wrote an introduction to this HDP-LDA model earlier this year.

For the last six months, I have been developing a Python-based Gibbs sampler for the HDP-LDA model. This is part of a larger library of "robust, validated Bayesian nonparametric models for discovering structure in data" known as Data Microscopes.

This notebook demonstrates the functionality of this implementation.

The Data Microscopes library is available on for Linux and OS X. microscopes-lda can be installed with:

$ conda install -c datamicroscopes -c distributions microscopes-lda 
In [1]:
%matplotlib inline
import pyLDAvis
import json
import sys
import cPickle

from microscopes.common.rng import rng
from microscopes.lda.definition import model_definition
from microscopes.lda.model import initialize
from microscopes.lda import utils
from microscopes.lda import model, runner

from numpy import genfromtxt 
from numpy import linalg
from numpy import array

dtm.csv contains a document-term matrix representation of the words used in Econtalk transcripts. The columns of the matrix correspond to the words in vocab.txt. The rows in the matrix correspond to the show urls in urls.txt.

Our LDA implementation takes input data as a list of lists of hashable objects (typically words). We can use a utility function to convert the document-term matrix to the list of tokenized documents.

In [2]:
vocab = genfromtxt('./econtalk-data/vocab.txt', delimiter=",", dtype='str').tolist()
dtm = genfromtxt('./econtalk-data/dtm.csv', delimiter=",", dtype=int)
docs = utils.docs_from_document_term_matrix(dtm, vocab=vocab)
urls = [s.strip() for s in open('./econtalk-data/urls.txt').readlines()]
In [3]:
dtm.shape[1] == len(vocab)
In [4]:
dtm.shape[0] == len(urls)

Here's a utility method to get the title of a webpage that we'll use later.

In [5]:
def get_title(url):
    """Scrape webpage title
    import lxml.html
    t = lxml.html.parse(url)
    return t.find(".//title").text.split("|")[0].strip()

Let's set up our model. First we created a model definition describing the basic structure of our data. Next we initialize an MCMC state object using the model definition, documents, random number generator, and hyper-parameters.

In [8]:
N, V = len(docs), len(vocab)
defn = model_definition(N, V)
prng = rng(12345)
state = initialize(defn, docs, prng,
                        dish_hps={"alpha": .6, "gamma": 2})
r = runner.runner(defn, docs, state, )

When we first create a state object, the words are randomly assigned to topics. Thus, our perplexity (model score) is quite high. After we start to run the MCMC, the score will drop quickly.

In [9]:
print "randomly initialized model:"
print " number of documents", defn.n
print " vocabulary size", defn.v
print " perplexity:", state.perplexity(), "num topics:", state.ntopics()
randomly initialized model:
 number of documents 454
 vocabulary size 16445
 perplexity: 16523.1820356 num topics: 9

Run one iteration of the MCMC to make sure everything is working.

In [17]:
%%time, 1)
CPU times: user 12.3 s, sys: 128 ms, total: 12.4 s
Wall time: 12.6 s

Now lets run 1000 generations of the MCMC.

Unfortunately, MCMC is slow going.

In [15]:
%%time, 500)
CPU times: user 2h 9min 8s, sys: 26.8 s, total: 2h 9min 35s
Wall time: 2h 10min 12s
In [16]:
with open('./econtalk-data/2015-10-07-state.pkl', 'w') as f:
    cPickle.dump(state, f)
In [17]:
%%time, 500)
CPU times: user 2h 17min 35s, sys: 30 s, total: 2h 18min 6s
Wall time: 2h 18min 50s
In [18]:
with open('./econtalk-data/2015-10-07-state.pkl', 'w') as f:
    cPickle.dump(state, f)

Now that we've run the MCMC, the perplexity has dropped significantly.

In [19]:
print "after 1000 iterations:"
print " perplexity:", state.perplexity(), "num topics:", state.ntopics()
after 1000 iterations:
 perplexity: 2363.65138771 num topics: 11

pyLDAvis projects the topics into two dimensions using techniques described by Carson Sievert.

In [21]:
vis = pyLDAvis.prepare(**state.pyldavis_data())

We can extract the term relevance (shown in the right hand side of the visualization) right from our state object. Here are the 10 most relevant words for each topic:

In [23]:
relevance = state.term_relevance_by_topic()
for i, topic in enumerate(relevance):
    print "topic", i, ":",
    for term, _ in topic[:10]:
        print term, 
topic 0 : banks fed bank money financial monetary debt inflation crisis rates
topic 1 : party republicans constitution vote democrats republican tax election president stalin
topic 2 : fat science diet eat insulin disease immune replication scientific eating
topic 3 : growth trade water cities china city development climate inequality oil
topic 4 : people think don just going like say things lot way
topic 5 : smith hayek moral economics society adam liberty coase theory rules
topic 6 : bitcoin internet software google technology store bitcoins computer machines company
topic 7 : prison health drug care drugs medicaid medical patients patient women
topic 8 : schools teachers school kids teacher education students parents teaching sports
topic 9 : bees honey pollination colony ants bee queen cheung ant colonies
topic 10 : museum museums art gallery galleries monet seating trustees admission director

We could assign titles to each of these topics. For example, Topic 5 appears to be about the foundations of classical liberalism. Topic 6 is obviously Bitcoin and Software. Topic 0 is the financial system and monetary policy. Topic 4 seems to be generic words used in most episodes; unfortunately, the prevalence of "don" is a result of my preprocessing which splits up the contraction "don't".

We can also get the topic distributions for each document.

In [24]:
topic_distributions = state.topic_distribution_by_document()

Topic 5 appears to be about the theory of classical liberalism. Let's find the 20 episodes which have the highest proportion of words from that topic.

In [25]:
austrian_topic = 5
foundations_episodes = sorted([(dist[austrian_topic], url) for url, dist in zip(urls, topic_distributions)], reverse=True)
for url in [url for _, url in foundations_episodes][:20]:
    print get_title(url), url
The Economics of Organ Donations
Klein on The Theory of Moral Sentiments, Episode 2--A Discussion of Part I
Boudreaux on Law and Legislation
Klein on The Theory of Moral Sentiments, Episode 4--A Discussion of Part III
Klein on The Theory of Moral Sentiments, Episode 5--A Discussion of Parts III  (cont.), IV, and V
Klein on The Theory of Moral Sentiments, Episode 3--A Discussion of Part II
Klein on The Theory of Moral Sentiments, Episode 1--An Overview
Klein on The Theory of Moral Sentiments, Episode 6--A Discussion of Parts VI and VII, and Summary
Wolfe on Liberalism
Boettke on Katrina and the Economics of Disaster
Richard Thaler on Libertarian Paternalism
Marglin on Markets and Community
Dan Klein on Coordination and Cooperation
Leeson on Pirates and the Invisible Hook
Vernon Smith on Markets and Experimental Economics
Boettke on Elinor Ostrom, Vincent Ostrom, and the Bloomington School
Postrel on Style
The Economics of Religion
Boettke on Mises
The Economics of Medical Malpractice

We could also find the episodes that have notable discussion of both politics AND the financial system.

In [29]:
topic_a = 0
topic_b = 1
joint_episodes = [url for url, dist in zip(urls, topic_distributions) if dist[0] > 0.18 and dist[1] > 0.18]
for url in joint_episodes:
    print get_title(url), url
Benn Steil on the Battle of Bretton Woods
Epstein on the Rule of Law
Shlaes on the Great Depression
Rabushka on the Flat Tax
Greg Mankiw on Gasoline Taxes, Keynes and Macroeconomics

We can look at the topic distributions as projections of the documents into a much lower dimension (16). We can try to find shows that are similar by comparing the topic distributions of the documents.

In [30]:
def find_similar(url, max_distance=0.2):
    """Find episodes most similar to input url.
    index = urls.index(url)
    for td, url in zip(topic_distributions, urls):
        if linalg.norm(array(topic_distributions[index]) - array(td)) < max_distance:
            print get_title(url), url

Which Econtalk episodes are most similar, in content, to "Mike Munger on the Division of Labor"?

In [31]:
Brink Lindsey on the Age of Abundance
Richard Epstein on Happiness, Inequality, and Envy
Sowell on Economic Facts and Fallacies
Yandle on the Tragedy of the Commons and the Implications for Environmental Regulation
McCraw on Schumpeter, Innovation, and Creative Destruction
Boudreaux on Market Failure, Government Failure and the Economics of Antitrust Regulation
Mike Munger on the Division of Labor

How about episodes similar to "Kling on Freddie and Fannie and the Recent History of the U.S. Housing Market"?

In [32]:
Irwin on the Great Depression and the Gold Standard
Rustici on Smoot-Hawley and the Great Depression
Reinhart on Financial Crises
Posner on the Financial Crisis
Sumner on Monetary Policy
Calomiris on the Financial Crisis
Gary Stern on Too Big to Fail
Cohan on the Life and Death of Bear Stearns
John Taylor on the Financial Crisis
Meltzer on Inflation
Boettke on the Austrian Perspective on Business Cycles and Monetary Policy
Selgin on Free Banking
Kling on Credit Default Swaps, Counterparty Risk, and the Political Economy of Financial Regulation
Kling on Freddie and Fannie and the Recent History of the U.S. Housing Market
John Taylor on Monetary Policy
Gene Epstein on Gold, the Fed, and Money
Meltzer on the Fed, Money, and Gold
Cowen on Monetary Policy

The model also gives us distributions over words for each topic.

In [33]:
word_dists = state.word_distribution_by_topic()

We can use this to find the topics a word is most likely to occur in.

In [34]:
def bars(x, scale_factor=10000):
    return int(x * scale_factor) * "="

def topics_related_to_word(word, n=10):
    for wd, rel in zip(word_dists, relevance):
        score = wd[word]
        rel_words = ' '.join([w for w, _ in rel][:n]) 
        if bars(score):
            print bars(score), rel_words

What topics are most likely to contain the word "Munger" (as in Mike Munger). The number of equal signs indicates the probability the word is generated by the topic. If a topic isn't shown, it's extremely unlikley to generate the word.

In [35]:
==== growth trade water cities china city development climate inequality oil
================== smith hayek moral economics society adam liberty coase theory rules
=== bitcoin internet software google technology store bitcoins computer machines company

Where does Munger come up? In discussing the moral foundations of classical liberalism and microeconomics!

How about the word "lovely"? Russ Roberts uses it often when talking about the Theory of Moral Sentiments. It looks like it also comes up when talking about schools.

In [36]:
= smith hayek moral economics society adam liberty coase theory rules

If you have feedback on this implementation of HDP-LDA, you can reach me on Twitter or open an issue on Github.