Exploring graph properties of the Twitter stream

An interactive narrative built with twython, NetworkX and IPython

Author: Fernando Perez, @fperez_org.

Location: this example is available as a public Gist that contains both this notebook and the accompanying Python text_utils.py module.

In this example, we use the IPython notebook to mine data from Twitter with the twython library. Once we have fetched the raw stream for a specific query, we will at first do some basic word frequency analysis on the results using Python's builtin dictionaries, and then we will use the excellent NetworkX library developed at Los Alamos National Laboratory to look at the results as a network and understand some of its properties.

Using NetworkX, we aim to answer the following questions: for a given query, which words tend to appear together in tweets, and global pattern of relationships between these words emerges from the entire set of results?

Obviously the analysis of text corpora of this kind is a complex topic at the intersection of natural language processing, graph theory and statistics, and here we do not pretend to provide an exhaustive coverage of it. Rather, we want to show you how with a small amount of easy to write code, it is possible to do a few non-trivial things based on real-time data from the Twitter stream. Hopefully this will serve as a good starting point; for further reading you can find in-depth discussions of analysing social network data in Python in the book Mining the Social Web.

Note: for this you'll need to have NetworkX as well as twython installed. You can do so in Ubuntu with the command

sudo apt-get install python-networkx
sudo pip install twython

or in other systems with

sudo pip install networkx
sudo pip install twython

where you should simply omit the sudo command if you are using Windows.

Initialization and libraries

We start by loading the pylab plot support as well as importing NetworkX and the free Twython library to query Twitter's stream:

In [22]:
%pylab inline
import networkx as nx
from twython import Twython
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.kernel.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

Now, we load a local library with some analysis utilities whose code is a bit long to display inline. If you downloaded the complete gist you should already have it.

In [2]:
import text_utils as tu  # shorthand for convenience

And we create the main Twitter object we'll use later for all queries:

In [3]:
twitter = Twython()

Query declaration

Here we define which query we want to perform, as well as which words we want to filter out from our analysis because they appear very commonly and we're not interested in them.

Typically you want to run the query once, and after seeing what comes out, fine-tune the removal list, as the definition of which words are considered 'noise' is fairly query-specific (and also changes over time, depending on what's happening out there on Twitter):

In [12]:
query = "Big Data"
words_to_remove = """with what some your just have from it's /via &
that they your there this"""

Perform query to Twitter servers

This is the cell that actually fetches data from Twitter. We limit the output to the first 30 pages of search max (though typically Twitter stops returning results before that).

In [6]:
n_pages = 30

results = []
retweets = []
for page in range(1, n_pages+1):
    search = twitter.search(q=query+' lang:en', page=str(page))
    res = search['results']
    if not res:
        print 'No more results returned, stopping at page:', page
    for t in res:
        if t['text'].startswith('RT '):
tweets = [t['text'] for t in results]

# Quick summary
print 'Query:   ', query
print 'Results: ', len(results)
print 'Retweets:', len(retweets)
print 'Variable `tweets` has a list of all the tweet texts'
No more results returned, stopping at page: 23
Query:    Big Data
Results:  219
Retweets: 111
Variable `tweets` has a list of all the tweet texts

Text statistics

Let's see what the first 10 tweets look like:

In [13]:
[u'If enough people played the game, they could really muck up BIG DATA, with wrong information. Yes!',
 u'Most girls with big bums are facially challenged, there is statistical data to prove this.',
 u'The Real Reason #Hadoop Is Such a Big Deal in #BigData http://t.co/H3kAyFl3mV via @TheTechScribe, @RWW',
 u'@Mark_Goldberg Hey Mark, what are your top 3 concerns for Big Data? Take the poll.\nhttp://t.co/XtUmoPXVev',
 u"Weird! RT A Big Day Out at... the Guardian's New 'Data-Driven' Coffee Shop! | VICE United Kingdom http://t.co/C3VyMNuSXH via @VICEUK",
 u'CW500: Forget the fancy graphics: how to make big data work for ordinary people \xa0 http://t.co/ge5Xhi3KDN #b2b',
 u'Big data needs thick data - http://t.co/7q3HZRsOkC',
 u'Big Data And Simulations Are Transforming Marketing http://t.co/f8LpZrlDlm"',
 u'Without Analytics, Big Data is Just Noise @briansolis - http://t.co/wW1WFKOnKF',
 u'Toyota announces upcoming launch in Japan of the Big Data Traffic Information Service http://t.co/N9j0xSojnk']

Now we do some cleanup of the common words above, so that we can then compute some basic statistics:

In [14]:
remove = tu.removal_set(words_to_remove, query)
lines = tu.lines_cleanup([tweet['text'].encode('utf-8') for tweet in results], remove=remove)
words = '\n'.join(lines).split()

Compute frequency histogram:

In [15]:
wf = tu.word_freq(words)
sorted_wf = tu.sort_freqs(wf)

Let's look at a summary of the word frequencies from this dataset:

In [16]:
Number of unique words: 1138

10 least frequent words:
 http://t.co/o68p8nbtnl -> 1
                   shop -> 1
                managed -> 1
 http://t.co/obrc49sxle -> 1
              @rnrworks -> 1
https://t.co/5wosulam2t -> 1
                 #aapor -> 1
                   four -> 1
              avalanche -> 1
                 @klout -> 1

10 most frequent words:
       real -> 13
       like -> 13
     tracks -> 14
   ventures -> 14
  discovery -> 14
    obama's -> 14
      funds -> 14
#humanswarm -> 16
  analytics -> 17
     google -> 17

Now we can plot the histogram of the n_words most frequent words:

In [17]:
n_words = 15
tu.plot_word_histogram(sorted_wf, n_words,"Frequencies for %s most frequent words" % n_words);

Above we trimmed the historgram to only show n_words because the distribution is very sharply peaked; this is what the histogram for the whole word list looks like:

In [18]:
tu.plot_word_histogram(sorted_wf, 1.0, "Frequencies for entire word list");

Co-occurrence graph

An interesting question to ask is: which pairs of words co-occur in the same tweets? We can find these relations and use them to construct a graph, which we can then analyze with NetworkX and plot with Matplotlib.

We limit the graph to have at most n_nodes (for the most frequent words) just to keep the visualization easier to read.

In [19]:
n_nodes = 10
popular = sorted_wf[-n_nodes:]
pop_words = [wc[0] for wc in popular]
co_occur = tu.co_occurrences(lines, pop_words)
wgraph = tu.co_occurrences_graph(popular, co_occur, cutoff=1)
wgraph = nx.connected_component_subgraphs(wgraph)[0]

An interesting summary of the graph structure can be obtained by ranking nodes based on a centrality measure. NetworkX offers several centrality measures, in this case we look at the Eigenvector Centrality:

In [20]:
centrality = nx.eigenvector_centrality_numpy(wgraph)
Graph centrality
         google: 0.424
       ventures: 0.421
          funds: 0.421
         tracks: 0.421
      discovery: 0.421
           like: 0.33
        obama's: 0.0398

And we can use this measure to provide an interesting view of the structure of our query dataset:

In [21]:
print "Graph visualization for query:", query
tu.plot_graph(wgraph, tu.centrality_layout(wgraph, centrality), plt.figure(figsize=(8,8)),
    title='Centrality and term co-occurrence graph, q="%s"' % query)
Graph visualization for query: Big Data
In [ ]: