Topic Modeling of Twitter Followers

This notebook is associated to this article on my blog.

We use LDAvis to visualize several LDA modeling of the followers of the @alexip account.

The different LDAs were trained with the following parameters

  • 10 topics, 10 passes, alpha = 0.001
  • 50 topics, 50 passes, alpha = 0.01
  • 40 topics, 100 passes, alpha = 0.001

Extraction of the data from twitter was done via this python 2 script And the dictionary and corpus were created via this one

To see the best results, set lambda around [0.5, 0.6]. Lowering Lambda gives more importance to words that are discriminatory for the active topic, words that best define the topic.

You can skip the 2 first models and jump to the last model which is the best (40 topics)

A working version of this notebook is available on nbviewer

In [1]:
# Load the corpus and dictionary
from gensim import corpora, models
import pyLDAvis.gensim

corpus = corpora.MmCorpus('data/alexip_followers_py27.mm')
dictionary = corpora.Dictionary.load('data/alexip_followers_py27.dict')
In [2]:
# First LDA model with 10 topics, 10 passes, alpha = 0.001
lda = models.LdaModel.load('data/alexip_followers_py27_t10_p10_a001_b01.lda')
followers_data =  pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(followers_data)
Out[2]:

With K=10 topics, nearly all the topics are aggregated together and difficult to distinguish. And even singled out topics [4,9] are not very cohesive. #4 for instance has bitcoin, ruby/ rails and London mixed together. In the following example, we set K to 50 topics and increase the number of passes from 10 to 50.

In [3]:
lda = models.LdaModel.load('data/alexip_followers_py27_t50_p50_a001.lda')
followers_data =  pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(followers_data)
Out[3]: