#!/usr/bin/env python # coding: utf-8 # # Topic Modeling of Twitter Followers # # This notebook is associated to [this article on my blog](http://alexperrier.github.io/jekyll/update/2015/09/04/topic-modeling-of-twitter-followers.html). # # We use LDAvis to visualize several LDA modeling of the followers of the [@alexip](https://twitter.com/alexip) account. # # The different LDAs were trained with the following parameters # # * 10 topics, 10 passes, alpha = 0.001 # * 50 topics, 50 passes, alpha = 0.01 # * 40 topics, 100 passes, alpha = 0.001 # # Extraction of the data from twitter was done via [this python 2 script](https://github.com/alexperrier/datatalks/tree/master/twitter) # And the dictionary and corpus were created via [this one](https://github.com/alexperrier/datatalks/tree/master/twitter) # # To see the best results, set lambda around [0.5, 0.6]. Lowering Lambda gives more importance to words that are discriminatory for the active topic, words that best define the topic. # # You can skip the 2 first models and jump to the last model which is the best (40 topics) # # A working version of this notebook is available on [nbviewer](http://nbviewer.ipython.org/github/alexperrier/datatalks/blob/master/twitter/LDAvis.ipynb) # # In[1]: # Load the corpus and dictionary from gensim import corpora, models import pyLDAvis.gensim corpus = corpora.MmCorpus('data/alexip_followers_py27.mm') dictionary = corpora.Dictionary.load('data/alexip_followers_py27.dict') # In[2]: # First LDA model with 10 topics, 10 passes, alpha = 0.001 lda = models.LdaModel.load('data/alexip_followers_py27_t10_p10_a001_b01.lda') followers_data = pyLDAvis.gensim.prepare(lda, corpus, dictionary) pyLDAvis.display(followers_data) # With K=10 topics, nearly all the topics are aggregated together and difficult to distinguish. # And even singled out topics [4,9] are not very cohesive. #4 for instance has bitcoin, ruby/ rails and London mixed together. # In the following example, we set K to 50 topics and increase the number of passes from 10 to 50. # In[3]: lda = models.LdaModel.load('data/alexip_followers_py27_t50_p50_a001.lda') followers_data = pyLDAvis.gensim.prepare(lda, corpus, dictionary) pyLDAvis.display(followers_data) # The topic spread is much better. # And the topics are mostly pretty coherent. # # * 1 is about the social web # * 2 is more colloquial about daily life with words like happy, lol, guys, sure, ever, day, ... # * 12 is all about data science with words like dataviz, spark, drivendata, ipython, machinelearning, ... # * 5 is about bitcoins and cryptocurrencies # * 17 most probably comes from French followers that did go through our filtering of non english accounts. # * 29 is about jobs in the UK # # The final model has more topics (100) and was allowed to converge more with 100 passes. # # In[4]: lda = models.LdaModel.load('data/alexip_followers_py27_t40_p100_a001.lda') followers_data = pyLDAvis.gensim.prepare(lda, corpus, dictionary) pyLDAvis.display(followers_data) # # * 1 is about working life in general: work, love, well, time, people, email # * 2 is about the social tech world in Boston with discriminant words like: sdlc, edcampldr, ... and the big social network company names # * 3 is more about brands and shopping: personal, things, brand, shop, love, # * 4 is about social media marketing: Marketing, hubspot, customers, ... (and leakage from Project Management (pmp, exam)) # * 5 is a mess and shows french words that leaked through our lang filtering # * 6 is about bitcoins # * 7 is about data science: rstats, bigdata, machinelearning, python, dataviz, ... # * 8 is about rails and ruby and hosting. NewtonMa is also relevant as the ruby Boston meetup is in Newton MA. # * 9 is about casino and games # * and 10 could be about learning analytics (less cohesive topic) # # * 13 is about python, pydata, conda, .... (with a bit of angular mixed in) # # etc ... # # It appears that the last Model with K=40 topics and 100 passes is the best so far. # The top 10 topics are relevant and cohesive. # # # In[6]: lda.show_topics() # In[ ]: