Notebook

Topic Modeling of Twitter Followers¶

This python 2 notebook is a companion to the blog post Segmentation of Twitter Timelines via Topic Modeling where we explore a corpus of Twitter timelines composed of the followers of the @alexip account and compare the results obtained through Latent Semantic Allocation vs Latent Dirichlet Allocation (LDA). Below are the results for LDA on a set of 245 timelines.

Some of the best topics are:

T1 Software Development,
T2 Data Science,
T3 Conference in London, (open for interpretation)
T4 Fantasy Football,(mixed with international events)
T6 RSS feeds,
T8 PMP and Project Management,
T19 Martha's Vineyard
T31 Fenway
T33 Addiction and drugs

etc ...

In [6]:

from gensim import corpora, models
import pyLDAvis.gensim

corpus = corpora.MmCorpus('data/alexip_followers_v3.mm')
dictionary = corpora.Dictionary.load('data/alexip_followers_v3.dict')

lda = models.LdaModel.load('data/alexip_followers_v3_t40_p200_a001.lda')
followers_data =  pyLDAvis.gensim.prepare(lda,corpus, dictionary)
pyLDAvis.display(followers_data)

0.025

Out[6]:

For Best results set the $\lambda$ parameter between 0.5 and 0.6. Lowering $\lambda$ increases the relative importance of words that are discriminant to a certain topic.

We use the amazing LDAvis package for this visualization. LDa was carried out with the Gensim package. The data is available in a Json 3M gz file.