Notebook

Topic Modeling of Twitter Followers¶

This notebook is associated to this article on my blog.

We use LDAvis to visualize several LDA modeling of the followers of the @alexip account.

The different LDAs were trained with the following parameters

10 topics, 10 passes, alpha = 0.001
50 topics, 50 passes, alpha = 0.01
40 topics, 100 passes, alpha = 0.001

Extraction of the data from twitter was done via this python 2 script And the dictionary and corpus were created via this one

To see the best results, set lambda around [0.5, 0.6]. Lowering Lambda gives more importance to words that are discriminatory for the active topic, words that best define the topic.

You can skip the 2 first models and jump to the last model which is the best (40 topics)

A working version of this notebook is available on nbviewer

In [1]:

# Load the corpus and dictionary
from gensim import corpora, models
import pyLDAvis.gensim

corpus = corpora.MmCorpus('data/alexip_followers_py27.mm')
dictionary = corpora.Dictionary.load('data/alexip_followers_py27.dict')

In [2]:

# First LDA model with 10 topics, 10 passes, alpha = 0.001
lda = models.LdaModel.load('data/alexip_followers_py27_t10_p10_a001_b01.lda')
followers_data =  pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(followers_data)

Out[2]:

With K=10 topics, nearly all the topics are aggregated together and difficult to distinguish. And even singled out topics [4,9] are not very cohesive. #4 for instance has bitcoin, ruby/ rails and London mixed together. In the following example, we set K to 50 topics and increase the number of passes from 10 to 50.

In [3]:

lda = models.LdaModel.load('data/alexip_followers_py27_t50_p50_a001.lda')
followers_data =  pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(followers_data)

Out[3]:

The topic spread is much better. And the topics are mostly pretty coherent.

1 is about the social web
2 is more colloquial about daily life with words like happy, lol, guys, sure, ever, day, ...
12 is all about data science with words like dataviz, spark, drivendata, ipython, machinelearning, ...
5 is about bitcoins and cryptocurrencies
17 most probably comes from French followers that did go through our filtering of non english accounts.
29 is about jobs in the UK

The final model has more topics (100) and was allowed to converge more with 100 passes.

In [4]:

lda = models.LdaModel.load('data/alexip_followers_py27_t40_p100_a001.lda')
followers_data =  pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(followers_data)

Out[4]:

1 is about working life in general: work, love, well, time, people, email
2 is about the social tech world in Boston with discriminant words like: sdlc, edcampldr, ... and the big social network company names
3 is more about brands and shopping: personal, things, brand, shop, love,
4 is about social media marketing: Marketing, hubspot, customers, ... (and leakage from Project Management (pmp, exam))
5 is a mess and shows french words that leaked through our lang filtering
6 is about bitcoins
7 is about data science: rstats, bigdata, machinelearning, python, dataviz, ...
8 is about rails and ruby and hosting. NewtonMa is also relevant as the ruby Boston meetup is in Newton MA.
9 is about casino and games
and 10 could be about learning analytics (less cohesive topic)
13 is about python, pydata, conda, .... (with a bit of angular mixed in)

etc ...

It appears that the last Model with K=40 topics and 100 passes is the best so far. The top 10 topics are relevant and cohesive.

In [6]:

lda.show_topics()

Out[6]:

[u'0.055*app + 0.045*team + 0.043*contact + 0.043*idea + 0.029*quote + 0.022*free + 0.020*development + 0.019*looking + 0.017*startup + 0.017*build',
 u'0.033*socialmedia + 0.022*python + 0.015*collaborative + 0.014*economy + 0.010*apple + 0.007*conda + 0.007*pydata + 0.007*talk + 0.007*check + 0.006*anaconda',
 u'0.053*week + 0.041*followers + 0.033*community + 0.030*insight + 0.010*follow + 0.007*world + 0.007*stats + 0.007*sharing + 0.006*unfollowers + 0.006*blog',
 u'0.014*thx + 0.010*event + 0.008*app + 0.007*travel + 0.006*social + 0.006*check + 0.006*marketing + 0.005*follow + 0.005*also + 0.005*time',
 u'0.044*docker + 0.036*prodmgmt + 0.029*product + 0.018*productmanagement + 0.017*programming + 0.012*tipoftheday + 0.010*security + 0.009*javascript + 0.009*manager + 0.009*containers',
 u'0.089*love + 0.035*john + 0.026*update + 0.022*heart + 0.015*peace + 0.014*beautiful + 0.012*beauty + 0.010*life + 0.010*shanti + 0.009*stories',
 u'0.033*geek + 0.009*architecture + 0.007*code + 0.007*products + 0.007*parts + 0.007*charts + 0.007*software + 0.006*cryptrader + 0.006*moombo + 0.006*book',
 u'0.049*stories + 0.046*network + 0.044*virginia + 0.044*entrepreneur + 0.039*etmchat + 0.025*etmooc + 0.021*etm + 0.015*join + 0.014*deis + 0.010*today',
 u'0.056*slots + 0.053*bonus + 0.052*fsiug + 0.039*casino + 0.031*slot + 0.024*online + 0.014*free + 0.013*hootchat + 0.010*win + 0.009*bonuses',
 u'0.056*video + 0.043*add + 0.042*message + 0.032*blog + 0.027*posts + 0.027*media + 0.025*training + 0.017*check + 0.013*gotta + 0.010*insider']

In [ ]: