This python 2 notebook is a companion to the blog post Segmentation of Twitter Timelines via Topic Modeling where we explore a corpus of Twitter timelines composed of the followers of the @alexip account and compare the results obtained through Latent Semantic Allocation vs Latent Dirichlet Allocation (LDA). Below are the results for LDA on a set of 245 timelines.
Some of the best topics are:
etc ...
from gensim import corpora, models
import pyLDAvis.gensim
corpus = corpora.MmCorpus('data/alexip_followers_v3.mm')
dictionary = corpora.Dictionary.load('data/alexip_followers_v3.dict')
lda = models.LdaModel.load('data/alexip_followers_v3_t40_p200_a001.lda')
followers_data = pyLDAvis.gensim.prepare(lda,corpus, dictionary)
pyLDAvis.display(followers_data)
0.025
For Best results set the $\lambda$ parameter between 0.5 and 0.6. Lowering $\lambda$ increases the relative importance of words that are discriminant to a certain topic.
We use the amazing LDAvis package for this visualization. LDa was carried out with the Gensim package. The data is available in a Json 3M gz file.