#!/usr/bin/env python
# coding: utf-8

# # Topic Modeling of Twitter Followers
# 
# This notebook is associated to [this article on my blog](http://alexperrier.github.io/jekyll/update/2015/09/04/topic-modeling-of-twitter-followers.html).
# 
# We use LDAvis to visualize several LDA modeling of the followers of the [@alexip](https://twitter.com/alexip) account.
# 
# The different LDAs were trained with the following parameters
# 
# * 10 topics, 10 passes, alpha = 0.001
# * 50 topics, 50 passes, alpha = 0.01
# * 40 topics, 100 passes, alpha = 0.001
# 
# Extraction of the data from twitter was done via [this python 2 script](https://github.com/alexperrier/datatalks/tree/master/twitter)
# And the dictionary and corpus were created via [this one](https://github.com/alexperrier/datatalks/tree/master/twitter)
# 
# To see the best results, set lambda around [0.5, 0.6]. Lowering Lambda gives more importance to words that are discriminatory for the active topic, words that best define the topic. 
# 
# You can skip the 2 first models and jump to the last model which is the best (40 topics)
# 
# A working version of this notebook is available on [nbviewer](http://nbviewer.ipython.org/github/alexperrier/datatalks/blob/master/twitter/LDAvis.ipynb)
# 

# In[1]:


# Load the corpus and dictionary
from gensim import corpora, models
import pyLDAvis.gensim

corpus = corpora.MmCorpus('data/alexip_followers_py27.mm')
dictionary = corpora.Dictionary.load('data/alexip_followers_py27.dict')


# In[2]:


# First LDA model with 10 topics, 10 passes, alpha = 0.001
lda = models.LdaModel.load('data/alexip_followers_py27_t10_p10_a001_b01.lda')
followers_data =  pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(followers_data)


# With K=10 topics, nearly all the topics are aggregated together and difficult to distinguish. 
# And even singled out topics [4,9] are not very cohesive. #4 for instance has bitcoin, ruby/ rails and London mixed together.
# In the following example, we set K to 50 topics and increase the number of passes from 10 to 50.

# In[3]:


lda = models.LdaModel.load('data/alexip_followers_py27_t50_p50_a001.lda')
followers_data =  pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(followers_data)


# The topic spread is much better.
# And the topics are mostly pretty coherent.
# 
# * 1 is about the social web
# * 2 is more colloquial about daily life with words like happy, lol, guys, sure, ever, day, ...
# * 12 is all about data science with words like dataviz, spark, drivendata, ipython, machinelearning, ...
# * 5 is about bitcoins and cryptocurrencies
# * 17 most probably comes from French followers that did go through our filtering of non english accounts.
# * 29 is about jobs in the UK
# 
# The final model has more topics (100) and was allowed to converge more with 100 passes.
# 

# In[4]:


lda = models.LdaModel.load('data/alexip_followers_py27_t40_p100_a001.lda')
followers_data =  pyLDAvis.gensim.prepare(lda, corpus, dictionary)
pyLDAvis.display(followers_data)


# 
# * 1 is about working life in general: work, love, well, time, people, email
# * 2 is about the social tech world in Boston with discriminant words like: sdlc, edcampldr, ... and the big social network company names     
# * 3 is more about brands and shopping: personal, things, brand, shop, love, 
# * 4 is about social media marketing: Marketing, hubspot, customers, ... (and leakage from Project Management (pmp, exam))
# * 5 is a mess and shows french words that leaked through our lang filtering
# * 6 is about bitcoins
# * 7 is about data science: rstats, bigdata, machinelearning, python, dataviz, ...
# * 8 is about rails and ruby and hosting. NewtonMa is also relevant as the ruby Boston meetup is in Newton MA.
# * 9 is about casino and games
# * and 10 could be about learning analytics (less cohesive topic)
# 
# * 13 is about python, pydata, conda, .... (with a bit of angular mixed in)
# 
# etc ...
# 
# It appears that the last Model with K=40 topics and 100 passes is the best so far.
# The top 10 topics are relevant and cohesive. 
# 
# 

# In[6]:


lda.show_topics()


# In[ ]: