Data science on twitter

Twitter is an indispensable resource for data scientists as well as for the broader data science community. With the right connections, you can use twitter to learn data science, discover new technologies, computational tools and methodologies, and you can contribute to and build a community of data scientists working for the social good. This type of value is generally only available to attendees of top data science conferences on disruptive data science, open data science and data science for good. Indeed, with a good twitter list, you can bring much of this content directly to your twitter feed!

Data science is a highly diverse and interdisciplinary field, but does data science twitter chatter reflect its interdisciplinary nature? Are there distinct communities of data scientists that interact with and cater to distinct sub-fields? To begin seeking an answer to this question, we will walk you through the simple analysis of a week's worth data science related tweets.

A data science twitter network

Tweets were collected using a tweepy listener (see here1 for a tutorial on building a twitter listener), and stored in a text file named "data_science_twitter.txt". Let's first load the tweets and extract user mentions to take a quick look at the volume of data science tweets from this week.

In [3]:
import os
import sys
import json

def tweets_n_edges(tweet_file):
    tweets=[]
    edges=[]

    for i in open(tweet_file,"r"):
        if i=="\n":
            next
        else:
            try:
                tweet = json.JSONDecoder().raw_decode(i)[0]
                usr_mentions= tweet['entities']['user_mentions']
                if len(usr_mentions)>0:
                    for ii in usr_mentions:
                        if tweet['user']['screen_name'] != ii['screen_name']:
                            edges.append((tweet['user']['screen_name'], ii['screen_name']))
                tweets.append(tweet)
            except: # if no user mentions, or something unexpected
                continue

    return (tweets,edges)


tweets,edges = tweets_n_edges("data_science_twitter.txt")

Tweets and network edges (links between twitter users) were gathered based on user mentions. How many tweets and user mentions were there?

In [4]:
print "There are %s tweets about data science this week, and %s user mentions!" % ( len(tweets), len(edges) )
There are 159600 tweets about data science this week, and 162070 user mentions!

The data science twitter community is incredibly active; we saw almost 160,000 tweets within a single week! And, there seems to be just as much interaction within the community, as there is about the same number of user mentions, not including self-mentions.

But what does the network look actually like? To build a network and find the most influential data science twitter uses, we will use the NetworkX2 package to create a directed graph and to calculate eigenvector centrality (a measure of network influence) among the nodes (twitter users). The resulting network is plotted using Gephi3.

In [5]:
import networkx as nx

G=nx.DiGraph() # initiate a directed graph
G.add_edges_from(edges) # add edges to the graph from user mentions
ev_cent=nx.eigenvector_centrality(G,max_iter=10000) # compute eigenvector centrality

ev_tuple = []
for i in ev_cent.keys():
    ev_tuple.append((i,ev_cent[i]))
    
zip(range(1,11)[::-1],sorted(ev_tuple,key=lambda x: x[1])[-10:])[::-1] # get the top 10 network influencers
Out[5]:
[(1, (u'GilPress', 0.38942565243403915)),
 (2, (u'KirkDBorne', 0.30906334335611996)),
 (3, (u'Forbes', 0.23035596746895132)),
 (4, (u'BernardMarr', 0.21142119479688257)),
 (5, (u'bobehayes', 0.2072355059058224)),
 (6, (u'kdnuggets', 0.15597621686762647)),
 (7, (u'Ronald_vanLoon', 0.15518713444196847)),
 (8, (u'LinkedIn', 0.12561861905035457)),
 (9, (u'DataScienceCtrl', 0.11756733241544594)),
 (10, (u'BoozAllen', 0.11138358070618962))]

Nodes represent twitter handles and the edges between the nodes represent user mentions. The size and color of the nodes correspond to eigenvector centrality values, which, again, is one measure of network influence. Let's take a quick peek at the top 10 influencers (who are also plotted above):

  1. GilPress
  2. KirkDBorne
  3. Forbes
  4. BernardMarr
  5. bobehayes
  6. kdnuggets
  7. Ronald_vanLoon
  8. LinkedIn
  9. DataScienceCtrl
  10. BoozAllen

The top 10 influencers include some of the most respected individuals and organizations in data science, and so their influence among data scientists on twitter is not at all surprising.

However, data science is a highly interdisciplinary field. Different communities may have different topic foci and different community influencers. For data scientists working in different sub-fields or in different spheres of data science, it is important to know who the most influential figures in the various sub-domains are, as these will be the people/handles to follow for the most up-to date news, analyses, methods and tools. To find distinct data science communities, we will use a community detection algorithm implemented to work on top of the NetworkX package, Community4. It implements the louvain method5 for community detection.

In [6]:
import community

def get_communities(tweets, edges):
    G_un=nx.Graph()
    G_un.add_edges_from(edges)
    parts = community.best_partition(G_un)
    values = [parts.get(node) for node in G_un.nodes()]

    communities = {}

    for i in tweets:
        screen_name = i['user']['screen_name'].encode("ascii","ignore")
        raw_text = i['text'].encode("ascii","ignore")
        if screen_name in parts.keys() and i['lang'] in ('en','und'): # get english tweets
            comm_num = parts[screen_name]
            if comm_num in communities.keys():
                if screen_name in communities[comm_num].keys():
                    text = communities[comm_num][screen_name]['raw_text']
                    communities[comm_num][screen_name]['n_tweets'] += 1
                    communities[comm_num][screen_name]['raw_text'] = ' '.join([text, raw_text]) 
                else:
                    communities[comm_num][screen_name] = {
                        'raw_text' : raw_text,
                        'n_tweets' : 1 
                    }
            else:
                communities[comm_num] = {}
                communities[comm_num][screen_name] = {
                    'raw_text' : raw_text,
                    'n_tweets' : 1 
                }
        else:
            continue

    return communities

communities = get_communities(tweets,edges)
In [7]:
community_size = []
for i in communities.keys():
    community_size.append((i,len(communities[i].keys())))

print "%s distinct communities were detected \n" % len(communities.keys())

print "Here are the top 10 most populous communities:\n"
for i,j in sorted(community_size,key=lambda x: x[1])[::-1][:10]:
    print "Community %s has %s members" % (i,j)
1234 distinct communities were detected 

Here are the top 10 most populous communities:

Community 25 has 2883 members
Community 3 has 2841 members
Community 11 has 2564 members
Community 22 has 2027 members
Community 13 has 1629 members
Community 17 has 1619 members
Community 39 has 1442 members
Community 38 has 822 members
Community 19 has 785 members
Community 45 has 776 members

Chatter among data science communities

We see that there are a number of highly populous communities detected in the larger network, and many more communities that are smaller in size. Let's take a quick look at a few of the most populous communities. We will look to see who the most influential users are among each of the interrogated communities, and try to find popular topics that the community focuses on using topic modeling. Our analyses will focus on communities 11, 13 and 38.

Let's start by visually inspecting the sub-network associated with community 11:

We see a number of influential handles in this subnetwork, but the top 5 are:

  1. BernardMarr
  2. DataScienceCtrl
  3. EvanSinar
  4. Datafloq
  5. kdnuggets

But what is this data science community talking about? To take a quick look at the types of topics that this community is interested in, we will use the topic modeling package Topik6 from Continuum. Topik gives a high-level interface to wildly popular topic modeling libraries in Python.

First we want to set up a directory structure for Topik to read from. We make each twitter user in the community a document that Topik can read:

In [8]:
import re

def make_dir_struc(communities):
    os.makedirs("communities")

    for i in communities.keys():
        os.makedirs("./communities/"+str(i))
        for ii in communities[i].keys():
            if communities[i][ii]['n_tweets']>2:
                raw_text = communities[i][ii]['raw_text']

                # try to get rid of links
                taw_text = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', raw_text)
                raw_text = ' '.join([iii for iii in raw_text.split() if iii[:4] !="http"])

                # try to get rid of hashtags and user mentions 
                raw_text = ' '.join([iii for iii in raw_text.split() if "#" not in iii])
                raw_text = ' '.join([iii for iii in raw_text.split() if "@" not in iii])

                # clean up
                raw_text = raw_text.encode("ascii","ignore").replace('\n', ' ')
                if len(raw_text.split()) > 100:
                    comm_user = open("./communities/"+str(i)+"/"+ii,"w")
                    comm_user.write(raw_text)
                    comm_user.close()
                    
make_dir_struc(communities)

Let's now build a topic model for community 11 and visualize the result. Topik enables us to do so in a very streamlined way. We will simply tokenize the data, input a list of stop words and the number of topics to search for, then build the model and visualize the results using Topik.

In [2]:
import nltk
from topik import read_input, tokenize, vectorize, run_model, visualize
from topik.visualizers.termite_plot import termite
from bokeh.plotting import figure, output_file, show
from bokeh.io import output_notebook

output_notebook()

def topic_model(directory, stopwords, ntopics):
    raw_data = read_input(directory)
    content_field = "text"
    raw_data = ((hash(item[content_field]), item[content_field]) for item in raw_data)
    tokenized_corpus = tokenize(raw_data,stopwords=stopwords)
    vectorized_corpus = vectorize(tokenized_corpus)
    model = run_model(vectorized_corpus, ntopics=ntopics)
    return model
Loading BokehJS ...
In [23]:
stopwords=['amp','get','got','hey','hmm','hoo','hop','iep','let','ooo','par',
            'pdt','pln','pst','wha','yep','yer','aest','didn','nzdt','via',
            'one','com','new','like','great','make','top','awesome','best',
            'good','wow','yes','say','yay','would','thanks','thank','going','ht',
            'new','use','should','could','best','really','see','want','nice', 'rt',
            'while','know','big','data','bigdatablogs']

stopwords=set(stopwords+nltk.corpus.stopwords.words("english"))
ntopics = 40

directory = "./communities/11/" # start with community 11
model = topic_model(directory, stopwords, ntopics)
show(termite(model))
Out[23]:

<Bokeh Notebook handle for In[23]>

The termite plot7 is a nice way to visualize topic modeling results. The x axis lists the topic numbers and the y axis lists frequent topic words. The size of the circle corresponds to the frequency of that word with respect to a topic. The termite plot for community 11 seems to shows us that the twitter chatter for this community includes broad data science topics like machine intelligence, analytics, data mining, but also includes a substantial amount of chatter about data science related blogs, blog posts or stories, as well as data science conferences such as ODSC Boston, tutorials, online classes and careers.

This community seems to reflect the general data science community, but also twitter handles who are influential community builders that routinely tweet about data science blogging, reporting and other community-focused topics, such as training, conferences and careers. Of course, it is no surprise then that DataScienceCtrl and kdnuggets are among most influential handles in this network. Not only are kdnuggest and DataScienceCtrl regularly the most active and respected sources of data science news and blog postings, but BernardMarr, EvanSinar and Datafloq are all highly respected and influential in the broader data science community.

If we look at the next community, community 13 (above) we see that the most influential handles include

  1. BoozAllen
  2. LaurenNealPhD
  3. wendykan
  4. petrguerra
  5. kaggle
In [28]:
directory = "./communities/13/"
model = topic_model(directory, stopwords, ntopics)
show(termite(model))