Author: Fernando Perez, @fperez_org.
Location: this example is available as a public Gist that contains both this notebook and the accompanying Python text_utils.py
module.
In this example, we use the IPython notebook to mine data from Twitter with the twython library. Once we have fetched the raw stream for a specific query, we will at first do some basic word frequency analysis on the results using Python's builtin dictionaries, and then we will use the excellent NetworkX library developed at Los Alamos National Laboratory to look at the results as a network and understand some of its properties.
Using NetworkX, we aim to answer the following questions: for a given query, which words tend to appear together in tweets, and global pattern of relationships between these words emerges from the entire set of results?
Obviously the analysis of text corpora of this kind is a complex topic at the intersection of natural language processing, graph theory and statistics, and here we do not pretend to provide an exhaustive coverage of it. Rather, we want to show you how with a small amount of easy to write code, it is possible to do a few non-trivial things based on real-time data from the Twitter stream. Hopefully this will serve as a good starting point; for further reading you can find in-depth discussions of analysing social network data in Python in the book Mining the Social Web.
Note: for this you'll need to have NetworkX as well as twython installed. You can do so in Ubuntu with the command
sudo apt-get install python-networkx
sudo pip install twython
or in other systems with
sudo pip install networkx
sudo pip install twython
where you should simply omit the sudo
command if you are using Windows.
We start by loading the pylab plot support as well as importing NetworkX and the free Twython library to query Twitter's stream:
%pylab inline
import networkx as nx
from twython import Twython
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.kernel.zmq.pylab.backend_inline]. For more information, type 'help(pylab)'.
Now, we load a local library with some analysis utilities whose code is a bit long to display inline. If you downloaded the complete gist you should already have it.
import text_utils as tu # shorthand for convenience
And we create the main Twitter object we'll use later for all queries:
twitter = Twython()
Here we define which query we want to perform, as well as which words we want to filter out from our analysis because they appear very commonly and we're not interested in them.
Typically you want to run the query once, and after seeing what comes out, fine-tune the removal list, as the definition of which words are considered 'noise' is fairly query-specific (and also changes over time, depending on what's happening out there on Twitter):
query = "Big Data"
words_to_remove = """with what some your just have from it's /via &
that they your there this"""
This is the cell that actually fetches data from Twitter. We limit the output to the first 30 pages of search max (though typically Twitter stops returning results before that).
n_pages = 30
results = []
retweets = []
for page in range(1, n_pages+1):
search = twitter.search(q=query+' lang:en', page=str(page))
res = search['results']
if not res:
print 'No more results returned, stopping at page:', page
break
for t in res:
if t['text'].startswith('RT '):
retweets.append(t)
else:
results.append(t)
tweets = [t['text'] for t in results]
# Quick summary
print 'Query: ', query
print 'Results: ', len(results)
print 'Retweets:', len(retweets)
print 'Variable `tweets` has a list of all the tweet texts'
No more results returned, stopping at page: 23 Query: Big Data Results: 219 Retweets: 111 Variable `tweets` has a list of all the tweet texts
Let's see what the first 10 tweets look like:
tweets[:10]
[u'If enough people played the game, they could really muck up BIG DATA, with wrong information. Yes!', u'Most girls with big bums are facially challenged, there is statistical data to prove this.', u'The Real Reason #Hadoop Is Such a Big Deal in #BigData http://t.co/H3kAyFl3mV via @TheTechScribe, @RWW', u'@Mark_Goldberg Hey Mark, what are your top 3 concerns for Big Data? Take the poll.\nhttp://t.co/XtUmoPXVev', u"Weird! RT A Big Day Out at... the Guardian's New 'Data-Driven' Coffee Shop! | VICE United Kingdom http://t.co/C3VyMNuSXH via @VICEUK", u'CW500: Forget the fancy graphics: how to make big data work for ordinary people \xa0 http://t.co/ge5Xhi3KDN #b2b', u'Big data needs thick data - http://t.co/7q3HZRsOkC', u'Big Data And Simulations Are Transforming Marketing http://t.co/f8LpZrlDlm"', u'Without Analytics, Big Data is Just Noise @briansolis - http://t.co/wW1WFKOnKF', u'Toyota announces upcoming launch in Japan of the Big Data Traffic Information Service http://t.co/N9j0xSojnk']
Now we do some cleanup of the common words above, so that we can then compute some basic statistics:
remove = tu.removal_set(words_to_remove, query)
lines = tu.lines_cleanup([tweet['text'].encode('utf-8') for tweet in results], remove=remove)
words = '\n'.join(lines).split()
Compute frequency histogram:
wf = tu.word_freq(words)
sorted_wf = tu.sort_freqs(wf)
Let's look at a summary of the word frequencies from this dataset:
tu.summarize_freq_hist(sorted_wf)
Number of unique words: 1138 10 least frequent words: http://t.co/o68p8nbtnl -> 1 shop -> 1 managed -> 1 http://t.co/obrc49sxle -> 1 @rnrworks -> 1 https://t.co/5wosulam2t -> 1 #aapor -> 1 four -> 1 avalanche -> 1 @klout -> 1 10 most frequent words: real -> 13 like -> 13 tracks -> 14 ventures -> 14 discovery -> 14 obama's -> 14 funds -> 14 #humanswarm -> 16 analytics -> 17 google -> 17
Now we can plot the histogram of the n_words
most frequent words:
n_words = 15
tu.plot_word_histogram(sorted_wf, n_words,"Frequencies for %s most frequent words" % n_words);
Above we trimmed the historgram to only show n_words
because the distribution is very sharply peaked; this is what the histogram for the whole word list looks like:
tu.plot_word_histogram(sorted_wf, 1.0, "Frequencies for entire word list");
An interesting question to ask is: which pairs of words co-occur in the same tweets? We can find these relations and use them to construct a graph, which we can then analyze with NetworkX and plot with Matplotlib.
We limit the graph to have at most n_nodes
(for the most frequent words) just to keep the visualization easier to read.
n_nodes = 10
popular = sorted_wf[-n_nodes:]
pop_words = [wc[0] for wc in popular]
co_occur = tu.co_occurrences(lines, pop_words)
wgraph = tu.co_occurrences_graph(popular, co_occur, cutoff=1)
wgraph = nx.connected_component_subgraphs(wgraph)[0]
An interesting summary of the graph structure can be obtained by ranking nodes based on a centrality measure. NetworkX offers several centrality measures, in this case we look at the Eigenvector Centrality:
centrality = nx.eigenvector_centrality_numpy(wgraph)
tu.summarize_centrality(centrality)
Graph centrality google: 0.424 ventures: 0.421 funds: 0.421 tracks: 0.421 discovery: 0.421 like: 0.33 obama's: 0.0398
And we can use this measure to provide an interesting view of the structure of our query dataset:
print "Graph visualization for query:", query
tu.plot_graph(wgraph, tu.centrality_layout(wgraph, centrality), plt.figure(figsize=(8,8)),
title='Centrality and term co-occurrence graph, q="%s"' % query)
Graph visualization for query: Big Data