This notebook is associated to this article on my blog.
We use LDAvis to visualize several LDA modeling of the followers of the @alexip account.
The different LDAs were trained with the following parameters
Extraction of the data from twitter was done via this python 2 script And the dictionary and corpus were created via this one
To see the best results, set lambda around [0.5, 0.6]. Lowering Lambda gives more importance to words that are discriminatory for the active topic, words that best define the topic.
You can skip the 2 first models and jump to the last model which is the best (40 topics)
A working version of this notebook is available on nbviewer
# Load the corpus and dictionary
from gensim import corpora, models
import pyLDAvis.gensim
corpus = corpora.MmCorpus('data/alexip_followers_py27.mm')
dictionary = corpora.Dictionary.load('data/alexip_followers_py27.dict')
# First LDA model with 10 topics, 10 passes, alpha = 0.001
lda = models.LdaModel.load('data/alexip_followers_py27_t10_p10_a001_b01.lda')
followers_data = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(followers_data)
With K=10 topics, nearly all the topics are aggregated together and difficult to distinguish. And even singled out topics [4,9] are not very cohesive. #4 for instance has bitcoin, ruby/ rails and London mixed together. In the following example, we set K to 50 topics and increase the number of passes from 10 to 50.
lda = models.LdaModel.load('data/alexip_followers_py27_t50_p50_a001.lda')
followers_data = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(followers_data)
The topic spread is much better. And the topics are mostly pretty coherent.
The final model has more topics (100) and was allowed to converge more with 100 passes.
lda = models.LdaModel.load('data/alexip_followers_py27_t40_p100_a001.lda')
followers_data = pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(followers_data)
1 is about working life in general: work, love, well, time, people, email
2 is about the social tech world in Boston with discriminant words like: sdlc, edcampldr, ... and the big social network company names
3 is more about brands and shopping: personal, things, brand, shop, love,
4 is about social media marketing: Marketing, hubspot, customers, ... (and leakage from Project Management (pmp, exam))
5 is a mess and shows french words that leaked through our lang filtering
6 is about bitcoins
7 is about data science: rstats, bigdata, machinelearning, python, dataviz, ...
8 is about rails and ruby and hosting. NewtonMa is also relevant as the ruby Boston meetup is in Newton MA.
9 is about casino and games
and 10 could be about learning analytics (less cohesive topic)
13 is about python, pydata, conda, .... (with a bit of angular mixed in)
etc ...
It appears that the last Model with K=40 topics and 100 passes is the best so far. The top 10 topics are relevant and cohesive.
lda.show_topics()
[u'0.055*app + 0.045*team + 0.043*contact + 0.043*idea + 0.029*quote + 0.022*free + 0.020*development + 0.019*looking + 0.017*startup + 0.017*build', u'0.033*socialmedia + 0.022*python + 0.015*collaborative + 0.014*economy + 0.010*apple + 0.007*conda + 0.007*pydata + 0.007*talk + 0.007*check + 0.006*anaconda', u'0.053*week + 0.041*followers + 0.033*community + 0.030*insight + 0.010*follow + 0.007*world + 0.007*stats + 0.007*sharing + 0.006*unfollowers + 0.006*blog', u'0.014*thx + 0.010*event + 0.008*app + 0.007*travel + 0.006*social + 0.006*check + 0.006*marketing + 0.005*follow + 0.005*also + 0.005*time', u'0.044*docker + 0.036*prodmgmt + 0.029*product + 0.018*productmanagement + 0.017*programming + 0.012*tipoftheday + 0.010*security + 0.009*javascript + 0.009*manager + 0.009*containers', u'0.089*love + 0.035*john + 0.026*update + 0.022*heart + 0.015*peace + 0.014*beautiful + 0.012*beauty + 0.010*life + 0.010*shanti + 0.009*stories', u'0.033*geek + 0.009*architecture + 0.007*code + 0.007*products + 0.007*parts + 0.007*charts + 0.007*software + 0.006*cryptrader + 0.006*moombo + 0.006*book', u'0.049*stories + 0.046*network + 0.044*virginia + 0.044*entrepreneur + 0.039*etmchat + 0.025*etmooc + 0.021*etm + 0.015*join + 0.014*deis + 0.010*today', u'0.056*slots + 0.053*bonus + 0.052*fsiug + 0.039*casino + 0.031*slot + 0.024*online + 0.014*free + 0.013*hootchat + 0.010*win + 0.009*bonuses', u'0.056*video + 0.043*add + 0.042*message + 0.032*blog + 0.027*posts + 0.027*media + 0.025*training + 0.017*check + 0.013*gotta + 0.010*insider']