Wikipedia Network Analysis

By Brian Keegan, Ph.D. -- October 4, 2014

Released under a CC-BY-SA 3.0 License.

Web; @bkeegan; GitHub

Importing libraries we'll want to use throughout the analysis right away.

In [3]:
# Standard packages for data analysis
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# pandas handles tabular data
import pandas as pd

# networkx handles network data
import networkx as nx

# json handles reading and writing JSON data
import json

# To visualize webpages within this webpage
from IPython.display import HTML

# To run queries against MediaWiki APIs
from wikitools import wiki, api

# Some other helper functions
from collections import Counter
from operator import itemgetter

Write a basic query against the Wikipedia API

Running a test query in the browser

We are going to make a query using the list=users action. This page contains all the documentation for the different queries you can run from Wikipedia's MediaWiki API. For our first test query, we'll want to get the information about a single user. Search for "list=users". You can also find similar information about this specific query action here in the general MediaWiki documentation.

We can actually write a test query in the URL which will return results if the parameters are all valid. Use the example given on the api.php page of documentation:|Jimbo_Wales&usprop=blockinfo|groups|editcount|registration|gender

There are four parameters in this API call, separated by & signs:

  • action - We pass a query option here to differentiate it from other actions we can run on the API like parse. But action=query will be what we use much of the time.
  • list - This is one of several parameters we can use to make a query; search for "action=query" for others besides list. We pass a users option to list because we want to generate information about users. This lets us run the sub-options detailed in the documentation below.
  • ususers - Here we list the names of Wikipedia users we want to get information about. We can pass more than one name by adding a pipe "|" between names. The documentation says we can only pass up to 50 names per request. Here we pass two names Madcoverboy for yours truly and Jimbo_Wales for the founder of Wikipedia.
  • usprop - Here we pass a list of options detailed under the list=users about information we can obtain about any user. Again we use pipes to connect multiple options together. We are going to get information about whether a user is currently blocked (blockinfo), what powers the user has (groups), their total number of edits (editcount), the date and time they registered their account (registration), and their self-reported gender (gender).

In summary, this API request is going to perform a query action that expects us to pass a list of user names and will return information about the users. We have given the query the names of the users we want information about as well as the specific types of information about each of these users.

The codeblock below shows what clicking the URL should return.

In [27]:
MediaWiki API Result
You are looking at the HTML representation of the XML format.
HTML is good for debugging, but is unsuitable for application use.
Specify the format parameter to change the output format.
To see the non HTML representation of the XML format, set format=xml.
See the complete documentation, or API help for more information.
<?xml version="1.0"?>
      <user userid="304994" name="Madcoverboy" editcount="12348" registration="2005-06-21T13:52:16Z" gender="male">
      <user userid="24" name="Jimbo Wales" editcount="11768" registration="2001-03-27T20:47:31Z" gender="unknown">

There's a lot of padding and fields from the XML markup this returns by default, but the data are all in there. My userid is "304994", my username is "Madcoverboy" (which we already knew), I have 12,348 edts, I registered my account on June 21, 2005 at 1:52:16pm GMT, I identify as male, and I'm a member of four "groups" corresponding to my editing privileges: reviewer, *, user, and autoconfirmed.

Running the same query in Python

Clicking on the link will run the query and return the results in your web browser. However, the point of using an API is not for you to make queries with a URL and then copy-paste the results into Python. We're going to run the query within Python (rather than the web browser) and return the data back to us in a format that we can continue to use for analysis.

First we need to write a function that will accept something that corresponds to the query we want to run, goes out and connects to the English Wikipedia's MediaWiki API "spigot", formats our query for this API to understand, runs the query until all the results come back, and then returns the results to us as some data object. The function below does all of those things, but it's best to just treat it as a black box for now that accepts queries and spits out the results from the English Wikipedia.

(If you want to use another another MediaWiki API, replace the current URL following site_url with the corresponding API location. For example, Memory Alpha's is

In [5]:
def wikipedia_query(query_params,site_url=''):
    site = wiki.Wiki(url=site_url)
    request = api.APIRequest(site, query_params)
    result = request.query()
    return result[query_params['action']]

We can write the exact same query as we used above using a dictionary for all the same request parameters as key value pairs and save the dictionary as user_query. For example, where we used action=query in the URL above, we use 'action':'query' as a key-value pair of strings (make sure to include the quotes marking these as strings, rather than variables!) in the query dictionary. Then we can pass this query dictionary to the wikipedia_query black box function defined above to get the exact same information out. We save the output in query_results and can look at the results by calling this variable.

In [6]:
user_query = {'action':'query',
              'ususers':'Madcoverboy|Jimbo Wales'}

query_results = wikipedia_query(user_query)
{u'users': [{u'editcount': 12348,
   u'gender': u'male',
   u'groups': [u'reviewer', u'*', u'user', u'autoconfirmed'],
   u'name': u'Madcoverboy',
   u'registration': u'2005-06-21T13:52:16Z',
   u'userid': 304994},
  {u'editcount': 11778,
   u'gender': u'unknown',
   u'groups': [u'checkuser',
   u'name': u'Jimbo Wales',
   u'registration': u'2001-03-27T20:47:31Z',
   u'userid': 24}]}

The data structure that is returned is a dictionary keyed by 'users' which returns a list of dictionaries. Knowing that the data corresponding to Jimbo Wales is the second element in the list of dictionaries (remember Python indices start at 0, so the 2nd element corresponds to 1), we can access his edit count.

In [7]:

Write a function to simplify the process

Instead of writing each query manually, we can define a function get_user_properties that accepts a user name(s), and returns ther results of the query used above but replaces "Madcoverboy" and "Jimbo Wales" with the user name(s) passed.

In [8]:
def get_user_properties(user):
    result = wikipedia_query({'action':'query',
    return result

We can test this function on another user, "Koavf" who is the most active user on the English Wikipedia. We'll save his results to koavf_query_results.

In [9]:
koavf_query_results = get_user_properties('Koavf')
{u'users': [{u'editcount': 1335895,
   u'gender': u'male',
   u'groups': [u'accountcreator',
   u'name': u'Koavf',
   u'registration': u'2005-03-05T20:22:47Z',
   u'userid': 205121}]}

Write the data we collected to disk

All the data we've collected in query_results and koavf_query_results exist only in memory. Once we shut this notebook down, these data will cease to exist. So we'll want to save these data to disk by "serializing" into a format that other programs can use. Two very common file formats are JavaScript Object Notation (JSON) and Comma-separated Values (CSV). JSON is better for more complex data that contains a mixture of strings, arrays (lists), dictionaries, and booleans while CSV is better for "flatter" data that you might want to read into a spreadsheet.

We can save a the koavf_query_results as a JSON file by creating and opening a file named koavf_query_results.json and referring to this connection as f. We use the json.dump function to translate all the data in the koavf_query_results dictionary into the file and once this process is done, the file automatically closes itself so that other programs can access it.

In [10]:
with open('koavf_query_results.json','wb') as f:

Check to make sure this data was properly exported by reading it back in as loaded_koavf_query_results.

In [11]:
with open('koavf_query_results.json','rb') as f:
    loaded_koavf_query_results = json.load(f)
{u'users': [{u'editcount': 1335895,
   u'gender': u'male',
   u'groups': [u'accountcreator',
   u'name': u'Koavf',
   u'registration': u'2005-03-05T20:22:47Z',
   u'userid': 205121}]}

The query_results data has two "observations" corresponding to "Madcoverboy" and "Jimbo Wales". We could create a CSV with the columns corresponding to the field names (editcount, gender, groups, name, registration, userid) and then two rows containing the corresponding values for each user.

Using a powerful library called "pandas" (short for "panel data", not the cute bears), we can pass the list of data inside query_results and pandas will attempt to convert it to a tabular format called a DataFrame that can be exported to CSV. We save this as df and then use the to_csv function to write this DataFrame to a CSV file. We use to extra options, declaring quotation marks to make sure the data in groups, which already contains commas, doesn't get split up later. We don't care about the row numbers (the index), so we declare index=False so they aren't exported.

In [12]:
[{u'editcount': 12348,
  u'gender': u'male',
  u'groups': [u'reviewer', u'*', u'user', u'autoconfirmed'],
  u'name': u'Madcoverboy',
  u'registration': u'2005-06-21T13:52:16Z',
  u'userid': 304994},
 {u'editcount': 11778,
  u'gender': u'unknown',
  u'groups': [u'checkuser',
  u'name': u'Jimbo Wales',
  u'registration': u'2001-03-27T20:47:31Z',
  u'userid': 24}]
In [13]:
df = pd.DataFrame(query_results['users'])
editcount gender groups name registration userid
0 12348 male [reviewer, *, user, autoconfirmed] Madcoverboy 2005-06-21T13:52:16Z 304994
1 11778 unknown [checkuser, founder, oversight, sysop, *, user... Jimbo Wales 2001-03-27T20:47:31Z 24

Check to make sure this data was properly exported by reading it back in.

In [14]:
editcount gender groups name registration userid
0 12348 male [u'reviewer', u'*', u'user', u'autoconfirmed'] Madcoverboy 2005-06-21T13:52:16Z 304994
1 11778 unknown [u'checkuser', u'founder', u'oversight', u'sys... Jimbo Wales 2001-03-27T20:47:31Z 24


In this section, we've covered the basics for how:

  • to look up documentation about the MediaWiki API
  • format a basic query and test to make sure it works within a web browser
  • running the same query within Python
  • writing a function to simplify querying data
  • writing the results of these queries to files so we can access these data later.

In the next sections, we'll use other queries to get more interesting data about relationships and more advanced data manipulation techniques to prepare these data for social network analysis.

Create hyperlink network

List of articles currently linked from an article

We are going to use the prop=links query to identify a list of articles that are currently linked from an article. We will use the article for "Hillary Rodham Clinton". The general MediaWiki documentation for this query is here. We will specify a query using the action=query to define a general class of query, the prop=links to indicate we want the current links from a page, and then pass the name of a page with titles=Hillary Rodham Clinton.

There are many "namespaces" of Wikipedia pages that reflect different kinds of pages for articles, article talk pages, user pages, user talk pages, and other administrative pages. Links to and from a Wikipedia article can come from all these name spaces, but because the Wikipedia articles that 99% of us ever read are located inside the "0" namespace, we'll only want to limit ourselves to links in that namespace rather than these "backchannel" links. We enforce this limit with the plnamespace=0 option.

There could potentially be hudreds of links from a single article but the API will only return some number per request. The wikitools library takes care of automatically generating additional requests if there is more data to obtain after the first request. Ideally, we could specify a large number like 10,000 to make sure we get all the links with a single request, but the API enforces a limit of 500 links per request and defaults to only 10 per request. We use the pllimit=500 to make sure we get the maximum number of links per request instead of issuing 50 requests.

In [16]:
outlink_query = {'action': 'query',
                 'prop': 'links',
                 'titles': 'Hillary Rodham Clinton',
                 'pllimit': '500',

hrc_outlink_data = wikipedia_query(outlink_query)
In [78]:
[{u'ns': 0, u'title': u'Virginia Clinton Kelley'},
 {u'ns': 0, u'title': u'2008 Democratic National Convention'},
 {u'ns': 0, u'title': u'Scottish people'},
 {u'ns': 0, u'title': u'United Press International'},
 {u'ns': 0, u'title': u'Edmund Hillary'}]

The data returned by this query is a dictionary of dictionaries that you'll need to dive "into" more deeply to access the data itself. The top dictionary contains a single key 'pages' which returns a dictionary containing a single key u'5043192' corresponding to the page ID for the article. Once you're inside this dictionary, you can access the list of links, which are unfortunately a list of dictionaries! Using something called a "list comprehension", I can clean this data up to get a nice concise list of links, which we save as hrc_outlink_list. I also print out the number of links in this list and 10 examples of these links.

In [19]:
hrc_outlink_list = [link['title'] for link in hrc_outlink_data['pages'][u'5043192']['links']]
print "There are {0} links from the Hillary Rodham Clinton article".format(len(hrc_outlink_list))
There are 1419 links from the Hillary Rodham Clinton article
[u'Virginia Clinton Kelley',
 u'2008 Democratic National Convention',
 u'Scottish people',
 u'United Press International',
 u'Edmund Hillary',
 u'Madeleine Albright',
 u'Alan Keyes presidential campaign, 2008',
 u'Margaret Bourke-White',
 u'Anna Harrison',
 u'Jeremiah S. Black']

A note on redirects

Note that there is an article for "Hillary Clinton" as well, but this article is a redirect. In other words, this article exists and has data that can be accessed from the API, but it's suspiciously sparse and just points to "Hillary Rodham Clinton".

In [20]:
outlink_query_hc = {'action': 'query',
                    'prop': 'links',
                    'titles': 'Hillary Clinton',
                    'pllimit': '500',
                    'plnamespace': '0'}

hc_outlink_data = wikipedia_query(outlink_query_hc)
{u'pages': {u'39797486': {u'links': [{u'ns': 0,
     u'title': u'Hillary Rodham Clinton'}],
   u'ns': 0,
   u'pageid': 39797486,
   u'title': u'Hillary Clinton'}}}

The MediaWiki API has a redirects option that lets us ignore these placeholder redirect pages and will follow the redirect to take us to the intended page. Adding this option to the query but specifying the same Hillary Clinton value for the titles parameter that previously led to a redirect now returns all the data at the "Hillary Rodham Clinton" article. We'll make sure to use this redirects option in future queries.

In [21]:
outlink_query_hc_redirect = {'action': 'query',
                             'prop': 'links',
                             'titles': 'Hillary Clinton', # still "Hillary Clinton"
                             'pllimit': '500',
                             'plnamespace': '0',
                             'redirects': 'True'} # redirects parameter added

hcr_outlink_data = wikipedia_query(outlink_query_hc_redirect)
hcr_outlink_list = [link['title'] for link in hcr_outlink_data['pages'][u'5043192']['links']]
print "There are {0} links from the Hillary Clinton article".format(len(hcr_outlink_list))
There are 1419 links from the Hillary Clinton article
[u'Virginia Clinton Kelley',
 u'2008 Democratic National Convention',
 u'Scottish people',
 u'United Press International',
 u'Edmund Hillary',
 u'Madeleine Albright',
 u'Alan Keyes presidential campaign, 2008',
 u'Margaret Bourke-White',
 u'Anna Harrison',
 u'Jeremiah S. Black']

List of articles currently linking to an article

We are going to use the prop=linkshere query to identify a list of articles that currently link to the Hillary Rodham Clinton article. The parameters for this query are a bit different. We still use a namespace limitation so we are only getting pages in the article namespace by specifying lhnamespace=0 and we want to maximize the number of links per query that the API allows by specifying lhlimit=500. However, we don't want to include redirects that point to this article (e.g., "Hillary Clinton" points to "Hillary Rodham Clinton") by specifying lhshow=!redirect. Finally we only want the names of the articles, rather than less important information like "pageid" or "redirects", so we can limit this by specifying lhprop=title.

In [22]:
inlink_query_hrc = {'action': 'query',
                    'redirects': 'True',
                    'prop': 'linkshere',
                    'titles': 'Hillary Rodham Clinton',
                    'lhlimit': '500',
                    'lhnamespace': '0',
                    'lhshow': '!redirect',
                    'lhprop': 'title'}

hrc_inlink_data = wikipedia_query(inlink_query_hrc)

Again some data processing and cleanup is necessary to drill down into the dictionaries of dictionaries to extract the list of links from the data returned by the query. I use a similar list comprehension as above to get this list of links out. Again, I count the number of links in this list and give an example of 10 links.

In [24]:
hrc_inlink_list = [link['title'] for link in hrc_inlink_data['pages'][u'5043192']['linkshere']]
print "There are {0} links to the Hillary Rodham Clinton article".format(len(hrc_inlink_list))
There are 1835 links to the Hillary Rodham Clinton article
[u'Virginia Clinton Kelley',
 u'2011 attack on the British Embassy in Iran',
 u'James Watson (New York)',
 u'International security',
 u'Shannon County, Missouri',
 u'Margaret Bourke-White',
 u'2008 Democratic National Convention',
 u'Alice Paul',
 u'Foreign relations of Mexico',
 u'HRC: State Secrets and the Rebirth of Hillary Clinton']

Combining queries

In the previous two sections, we came up with two separate queries to get both the links from an article and the links to an article. However, much to the credit of the MediaWiki API engineers, you can combine both queries into one. We'll need all the same parameter information that we had included before (pllimit, lhlimit, etc.), but we can combine the queries together by combining prop=links and prop=linkshere with a pipe (like we did with user names in the very first query), prop=links|linkshere.

In [25]:
alllinks_query_hrc = {'action': 'query',
                      'redirects': 'True',
                      'prop': 'links|linkshere', #combined both prop calls with a pipe
                      'titles': 'Hillary Rodham Clinton',
                      'pllimit': '500', #still need the "prop=links" "pl" parameters and below
                      'plnamespace': '0',
                      'lhlimit': '500', #still need the "prop=linkshere" "lh" parameters and below
                      'lhnamespace': '0',
                      'lhshow': '!redirect',
                      'lhprop': 'title'}

hrc_alllink_data = wikipedia_query(alllinks_query_hrc)

Again, we need to do some data processing and cleanup to get the lists of links out. However, there are now two different sub-dictionaries within hrc_alllink_data object reflecting the output from the links and the linkshere calls.

In [26]:
hrc_alllink_outlist = [link['title'] for link in hrc_alllink_data['pages'][u'5043192']['links']]
hrc_alllink_inlist = [link['title'] for link in hrc_alllink_data['pages'][u'5043192']['linkshere']]
print "There are {0} out links from and {1} in links to the Hillary Rodham Clinton article".format(len(hrc_alllink_outlist),len(hrc_alllink_inlist))
There are 1419 out links from and 1835 in links to the Hillary Rodham Clinton article

We can also write a function get_article_links that takes an article name as an input and returns the lists containing all the in and out links for that article. We use the combined query described above, but replace Hillary's article title with a generic article variable, run the query, pull out the page_id, and then do the data processing and cleanup to produce a list of outlinks and a list of inlinks, both of which are passed back out of the function. Again, this query will only pull out the current links on the article, not historical links.

In [27]:
def get_article_links(article):
    query = {'action': 'query',
             'redirects': 'True',
             'prop': 'links|linkshere',
             'titles': article, # the article variable is passed into here
             'pllimit': '500',
             'plnamespace': '0',
             'lhlimit': '500',
             'lhnamespace': '0',
             'lhshow': '!redirect',
             'lhprop': 'title'}
    results = wikipedia_query(query) # do the query
    page_id = results['pages'].keys()[0] # get the page_id
    if 'links' in results['pages'][page_id].keys(): #sometimes there are no links
        outlist = [link['title'] for link in results['pages'][page_id]['links']] # clean up outlinks
        outlist = [] # return empty list if no outlinks
    if 'linkshere' in results['pages'][page_id].keys(): #sometimes there are no links
        inlist = [link['title'] for link in results['pages'][page_id]['linkshere']] # clean up inlinks
        inlist = [] # return empty list if no inlinks
    return outlist,inlist

We can test this on Bill Clinton's article, for example.

In [28]:
bc_out, bc_in = get_article_links("Bill Clinton")
print "There are {0} out links from and {1} in links to the Bill Clinton article".format(len(bc_out),len(bc_in))
There are 1249 out links from and 9460 in links to the Bill Clinton article

Lets put the data for both these queries into a dictionary called clinton_link_data so it's easier to access and save. We'll save this data to disk as a JSON as well so we can access it in the future.

In [30]:
clinton_link_data = {"Hillary Rodham Clinton": {"In": hrc_alllink_inlist,
                                                "Out": hrc_alllink_outlist},
                     "Bill Clinton": {"In": bc_in,
                                      "Out": bc_out}

with open('clinton_link_data.json','wb') as f:

Make a network

Having collected data about the neighboring articles that are linked to or from one article, we can turn these data into a network. Using the NetworkX library (shortened to nx on import at the top), we will create a DiGraph object called hrc_g and then fill it with the connection data we just collected. We do this by iterating over the lists of links (hrc_alllink_outlist and hrc_alllink_inlist) and adding a directed edge between each neighbor and the original article. It's important to pay attention to edge direction as the out links should start at "Hillary Rodham Clinton" and end at the neighboring article whereas the in links should start at the neighboring article and end at "Hillary Rodham Clinton".

In [31]:
[u'Virginia Clinton Kelley',
 u'2008 Democratic National Convention',
 u'Scottish people',
 u'United Press International',
 u'Edmund Hillary']
In [32]:
hrc_g = nx.DiGraph()

for article in hrc_alllink_outlist:
    hrc_g.add_edge("Hillary Rodham Clinton",article)
for article in hrc_alllink_inlist:
    hrc_g.add_edge(article,"Hillary Rodham Clinton")

We can compute some basic statistics about the network such as the number of nodes.

In [34]:
len(hrc_alllink_outlist) + len(hrc_alllink_inlist)
In [33]:
In [122]:
print "There are {0} edges and {1} nodes in the network".format(hrc_g.number_of_edges(), hrc_g.number_of_nodes())
There are 3254 edges and 2647 nodes in the network

We might also ask how many of these hyperlink edges are reciprocated, or link in both directions. We start with an empty container reciprocal_edges we'll use to fill with edges that are reciprocated. Next, we iterate through all the edges in the graph (hrc_g.edges() returns a list of all edges) and check two things. The first check is for whether the graph contains an edge the goes in the opposite direction. So given an edge (i,j), we check if there's also a (j,i). The second check is to make sure we haven't already added this edge to the reciprocal_edges list. If both these conditions are true, then we can add the edge to reciprocal_edges.

In [35]:
reciprocal_edges = list()
for (i,j) in hrc_g.edges():
    if hrc_g.has_edge(j,i) and (j,i) not in reciprocal_edges:
reciprocation_fraction = round(float(len(reciprocal_edges))/hrc_g.number_of_edges(),3)
print "There are {0} reciprocated edges out of {1} edges in the network, giving a reciprocation fraction of {2}.".format(len(reciprocal_edges),hrc_g.number_of_edges(),reciprocation_fraction)
There are 608 reciprocated edges out of 3254 edges in the network, giving a reciprocation fraction of 0.187.

We can compare this to the network for Bill Clinton. There are many more edges in his network, but a much smaller fraction of these edges are reciprocated. This suggests that there are fewer articles expressing some similarity or relationship with Bill Clinton that his article also acknowledges by linking. This in turn invites questions about:

  • how the rate of reciprocity differs among biographies versus geographic entities
  • contemporary versus historical elites
  • high versus low quality articles
  • how these rates of reciprocation change over the evolution of the article's collaboration

With the query we've covered above, you canbegin to answer these open questions.

In [36]:
bc_g = nx.DiGraph()

for article in bc_out:
    bc_g.add_edge("Bill Clinton",article)
for article in bc_in:
    bc_g.add_edge(article,"Bill Clinton")

bc_reciprocal_edges = list()
for (i,j) in bc_g.edges():
    if bc_g.has_edge(j,i) and (j,i) not in bc_reciprocal_edges:
bc_reciprocation_fraction = round(float(len(bc_reciprocal_edges))/bc_g.number_of_edges(),3)
print "There are {0} reciprocated edges out of {1} edges in the network, giving a reciprocation fraction of {2}.".format(len(bc_reciprocal_edges),bc_g.number_of_edges(),bc_reciprocation_fraction)
There are 926 reciprocated edges out of 10709 edges in the network, giving a reciprocation fraction of 0.086.

This is a pretty basic "star"-shaped network that contains Hillary's article at the center and is surrounded by all the articles linking to and from it. In particular, we could "snowball" out from the the articles that link to and are linked from a given page and visit each of those articles and create their local networks. We could continue to do this until we traverse the whole hyperlink network, but that would take a very long time, involve a lot of data, and would be an abusive use of the API (if you want the whole hyperlink network, you can download the data directly here by clicking a backup date and searching for "Wiki page-to-page link records.").

We could also create the "1.5-step ego" hyperlink network around a given page that consists of the focal article, all the articles that link to or from it, and then whether these neighboring articles are linked to each other. This could provide a better picture of which neighboring articles link to which other articles.

Unfortunately, even the scrape for the 2-step ego hyperlink network could take over an hour of data collection and generate hundreds of megabytes of data. Furthermore, Wikipedia article also contain templates which creates lots of "redundant" links between articles that share templates even those these links don't appear in the body of the article itself. You'll need to do much more advanced text parsing of wiki-markup to actually get links in the body of an article, but that's beyond the scope of the present tutorial.

I don't recommend crawling more than the immediate (1-step) neighbors of Wikipedia articles.

Links from historical versions of an article

The queries above only looked at the links coming from the current version of the article. However Wikipedia archives every version of the article, so we can rewind the tape all the way back to the first version of Hillary's article back in 2001, a few months after Wikipedia was created. Specific versions of a Wikipedia article are identified with a revid, which is also called an oldid in some contexts. In subsequent sections, we'll go into more detail on how to get a list of all revisions to an article and find the oldest revision. But for the time being, just trust me that revid "256189" is the oldest version of the Hillary Rodham Clinton article. Take a peek at what the article looked like back then below:

In [37]:
HTML('<iframe src= width=700 height=350></iframe>')

The MediaWiki API allows us to extract the out links from this old version of the article. Here we'll perform a different kind of action on the API than the previous query parameter we've used. The action=parse will extract information from a given version of an article, such as the links. We can specify that links should be parsed out with the prop=links parameter. Finally, we pass the oldid=256189 so that this specific revision is parsed.

In [38]:
oldest_outlinks_query_hrc = {'action': 'parse', #query changes to parse
                             'prop': 'links',
                             'oldid': '256189'}

oldest_outlinks_data = wikipedia_query(oldest_outlinks_query_hrc)
URLError: <urlopen error [Errno 65] No route to host> trying request again in 5 seconds
{u'links': [{u'*': u'Baby Boom', u'exists': u'', u'ns': 0},
  {u'*': u'Bill Clinton', u'exists': u'', u'ns': 0},
  {u'*': u'First Lady', u'exists': u'', u'ns': 0},
  {u'*': u'New York', u'exists': u'', u'ns': 0},
  {u'*': u'October 26', u'exists': u'', u'ns': 0},
  {u'*': u'Senators Of The United States', u'exists': u'', u'ns': 0},
  {u'*': u'United States/President', u'exists': u'', u'ns': 0},
  {u'*': u'United States Senate', u'exists': u'', u'ns': 0},
  {u'*': u'Watergate', u'exists': u'', u'ns': 0}],
 u'revid': 256189,
 u'title': u'Hillary Rodham Clinton'}

Again, data processing and cleanup using a list comprehension is necessary to get a list of links from this result.

In [86]:
oldest_outlink_list = [link['*'] for link in oldest_outlinks_data['links']]
print "There are {0} out links from the Hillary Rodham Clinton article".format(len(oldest_outlink_list))
There are 9 out links from the Hillary Rodham Clinton article
[u'Baby Boom',
 u'Bill Clinton',
 u'First Lady',
 u'New York',
 u'October 26',
 u'Senators Of The United States',
 u'United States/President',
 u'United States Senate',

So now we can also extract links from historical versions of the article. However, it's much more difficult to get the history of what links in to an article (e.g., linkshere) as this would require potentially looking at the history of every other article to check if a link was ever made from that article to another article. This is not impossible, just very very time-consuming.


In this section we learned to write and combine queries to get us the links to and from the current version of an article, clean the output of these queries up into lists of links, use these lists of links to make a network object, and did some preliminary analysis of an article's ego network. There are some limitations on the specificity of the links that the API passes back which limits our ability to generate more complex networks using this query. We also showed that it's possible to get the out links from a historical version of an article using a new kind of API action called a parse. Using the out links from all the changes to an article could let us look at the evolution of what the article linked to over time. We'll go into how to get all the changes to an article in the next section.

Get all the revisions made to an article

The previous section showed how to make a basic network from the current hyperlinks to and from a Wikipedia article. It also alluded to the fact that Wikipedia captures the history of every change made to the article since it was created as well as who made these changes and when (among other meta-data). In this section, we'll explore some queries around how to extract the "revision history" of an article from the API. We'll do some exploratory analysis using these data to understand patterns in the distribution of editors' activity, changes in content, and the persistence of revisions. Additionally, we'll construct a co-authorship network of what editors made a change to the article.

Starting with a basic query, we'll get every change that's been made to the "Hillary Rodham Clinton" article. We'll use action=query and prop=revisions to get the list of changes to an article (see detailed documentation here). There are many options to specify here. We pass several options to rvprop to get the revision ids, timestamp, user, user ID, revision comment, and the size of the article; "max" to rvlimit to get all the revisions; "newer" to rvdir so the revisions come back in chronological order (oldest to newest). There are many other options that can be specified such as rvprop=content to get the content of each revision or rvstart and rvend to get revisions within a specific timeframe, and rvexcludeuser to omit changes from bots for example.

In [135]:
revisions_query_hrc = {'action': 'query',
                      'redirects': 'True',
                      'prop': 'revisions',
                      'titles': "Hillary Rodham Clinton",
                      'rvprop': 'ids|user|timestamp|userid|comment|size',
                      'rvlimit': '500',
                      'rvdir': 'newer'}

revisions_data_hrc = wikipedia_query(revisions_query_hrc)

There's a lot of data in there and you can already expect that we'll need to do some data processing and cleaning to get it into a more usable form.

  • As before, the query returns the list of revisions buried deep within a dictionary so we extract that out and, like we did with the user information in the very first section, we pass convert this list of revisions to a "pandas" DataFrame object that we'll call hrc_rv_df.
  • The timestamp column inside of this new DataFrame are still strings rather than meaningful dates that we can sort on, so we need to convert them using the to_datetime function and passing a strftime formatting magic so that we know which string sequences correspond to meaningful years, months, days, hours, minutes, and seconds values.
  • The anon column has a strange mixture of NaN and empty strings corresponding to whether the revision was made by a registered account or now. The replace method swaps the NaNs out with False and the strings with True booleans to make this more interpretable.
  • We sort the DataFrame on these newly-meaningful timestamp values, reset the index (row numbers) so they correspond to the revision count, and label this index as "revision".
  • Finally, we save the data to disk as a CSV file named hrc_revisions.csv making sure that we encode non-ASCII characters in "utf8". You'll want to make a habit out of doing this.
In [271]:
# Extract and convert to DataFrame
hrc_rv_df = pd.DataFrame(revisions_data_hrc['pages']['5043192']['revisions']) 

# Make it clear what's being edited
hrc_rv_df['page'] = [u'Hillary Rodham Clinton']*len(hrc_rv_df)

# Clean up timestamps
hrc_rv_df['timestamp'] = pd.to_datetime(hrc_rv_df['timestamp'],format="%Y-%m-%dT%H:%M:%SZ",unit='s')

# Clean up anon column
hrc_rv_df = hrc_rv_df.replace({'anon':{np.nan:False,u'':True}})

# Sort the data on timestamp and reset the index
hrc_rv_df = hrc_rv_df.sort('timestamp').reset_index(drop=True) = 'revision'
hrc_rv_df = hrc_rv_df.reset_index()

# Set the index to a MultiIndex

# Save the data to disk

# Show the first 5 rows
anon comment parentid revid size timestamp user userid
page revision
Hillary Rodham Clinton 0 False * 0 256189 380 2001-08-01 20:21:17 Koyaanis Qatsi 90
1 True *added a bit on Hillary Clinton 256189 256190 697 2001-12-07 01:58:07 0
2 False Took out the slander 256190 256191 663 2001-12-07 02:14:20 Paul Drye 6
3 True Automated conversion 256191 72270 877 2002-02-25 15:51:15 Conversion script 0
4 True * 72270 72271 920 2002-05-18 16:37:57 0

User activity

We might be interested in looking at the most active editors over the history of the article. We can perform a groupby operation that effectively creates a mini-DataFrame for each user's revisions. We use the aggregate function to collection information (len gets us the number of revisions they made) across all these mini-DataFrames and returns a Series object with the username and the number of their revisions. Sorting these revisions is descending order and then look at the top-5 revisions, which exhibits variation over nearly two orders of magnitude.

In [272]:
hrc_rv_gb_user = hrc_rv_df.groupby('user')
hrc_user_revisions = hrc_rv_gb_user['revid'].aggregate(len).sort(ascending=False,inplace=False)
print "There are {0} unique users who have made a contribution to the article.".format(len(hrc_user_revisions))
There are 3567 unique users who have made a contribution to the article.
Wasted Time R      2189
LukeTH              656
Tvoz                296
K157                137
Mark Miller          81
Anythingyouwant      78
StuffOfInterest      74
Ohnoitsjamie         56
Gamaliel             52
Kelw                 48
Name: revid, dtype: int64

Given the wide variation among the number of contributions from users, we can create a kind of "histogram" that plots how many users made how many revisions. Because there is so much variation in the data, we use logged axes. In the upper left, there are several thousand editors who made only a single contribution. In the lower right, are the single editors listed above who made several hundred revisions to this article.

In [273]:
revisions_counter = Counter(hrc_user_revisions.values)
plt.ylabel('Number of users',fontsize=15)
plt.xlabel('Number of revisions',fontsize=15)

We'll add some information to the DataFrame about the cumulative number of unique users who've ever edited the article. This should give us a sense of how the size of the collaboration changed over time. Starting with empty lists for unique_users that we will add the names of users to as they make their first edit and unique_count that counts the number of unique users at each point in time. We add the unique_count list to the DataFrame under the unique_users column.

In [307]:
def count_unique_users(user_series):
    unique_users = []
    unique_count = []
    for user in user_series.values:
        if user not in unique_users:
    return unique_count
hrc_rv_df['unique_users'] = count_unique_users(hrc_rv_df['user'])

We can look at changes to the contribution patterns on the article over time. First we need to do some data processing to convert the timestamps into generic dates. Then we group the activity by date together and use aggregate to create a new DataFrame called activity_by_day that contains the number of unique users and number of revisions made on each day. Finally, plot the distribution of this activity over time.

Looking at the blue line for the number of unique users, we see the collaboration is initially small through 2004, but then between 2005 and 2008 undergoes rapid growth from a few hundred editors to over 3,000 editors. After 2008 however, the number of new news grows much more slowly and constantly. This is somewhat surprising as this timeframe includes a number of historic events like Hillary's campaign for president in 2008 as well as her tenure as Secretary of State.

Looking at the green line for the number of revisions made per day, there is a lot of variation in daily editing activity, but much of it seems to again occur between 2005 and 2009, and slows down substantially thereafter. Peaks might correspond to major news events (like nominations) or to edit wars (editors fighting over content).

In [275]:
hrc_rv_df['date'] = hrc_rv_df['timestamp'].apply(lambda
activity_by_day = hrc_rv_df.groupby('date').aggregate({'unique_users':max,'revid':len})
ax = activity_by_day.plot(lw=1,secondary_y=['revid'])
<matplotlib.text.Text at 0x2c58f5f8>

Changes in article size

We can also look at the distribution in changes to the article's size. In other words, how much content (in bytes) was introduced or removed from the article by an editor's changes? We see there is a very wide (axes are still on log scales) and mostly symmetrical distribution in additions and removals of content. In other words, the most frequent changes are extremely minor (-1 to 1 bytes) and the biggest changes (dozens of kilobytes) are very rare --- and likely the result of vandalism and reversion of vandalism. Nevertheless it's the case that this Wikipedia article's history is as much about the removal of content as it is about the addition of content.

In [276]:
hrc_rv_df['diff'] = hrc_rv_df['size'].diff()
diff_counter = Counter(hrc_rv_df['diff'].values)
plt.xlabel('Difference (bytes)',fontsize=15)
plt.ylabel('Number of revisions',fontsize=15)

Re-compute the activity_by_day DataFrame to include the diff variable computed above using the np.median method to get the median change in the article on a given day. Substantively, this means that we can track how much content was added or removed on each day. This is noisy, so we can smooth using rolling_mean and specifying a 60-day window. There's a general tendency for the articl to grow on any given day, but there are a few time periods when the article shrinks drastically, likely reflecting sections of an article being split out into sub-articles.

In [277]:
activity_by_day = hrc_rv_df.groupby('date').aggregate({'unique_users':max,

# Compute a 60-day rolling average to remove spikiness, plot
plt.ylabel('Difference (bytes)',fontsize=15)
<matplotlib.lines.Line2D at 0x2b78fd68>

Distribution of edit latencies

We can also explore how long an edit persists on the article before another edit is subsequently made. The average edit only persists for ~34,500 seconds (~9.5 hours) but the median edit only persists for 881 seconds (~15 minutes).

In [278]:
# The diff returns timedeltas, but dividing by a 1-second timedelta returns a float
# Round these numbers off to smooth out the distribution and add 1 second to everything to make the plot behave
hrc_rv_df['latency'] = [round(i/np.timedelta64(1,'s'),-1) + 1 for i in hrc_rv_df['timestamp'].diff().values]
diff_counter = Counter(hrc_rv_df['latency'].values)
plt.xlabel('Latency time (seconds)',fontsize=15)
plt.ylabel('Number of changes',fontsize=15)
In [279]:
count       12045.000000
mean        34469.446658
std        216675.835396
min             1.000000
25%           141.000000
50%           881.000000
75%         13591.000000
max      10993011.000000
dtype: float64

As we did above, we can recompute activity_by_day to include daily median changes in the latency between edits. There is substantial variation in how long edits persist. Again, the pre-2006 era is marked by content that goes days or weeks without changes, but between 2006 and 2009 the time between edits becomes much shorter, presumably corresponding with the attention around her presidential campaign. After 2008, the time between changes increases again and stabilizes at its (smoothed) current value of around 2 days between edits.

In [280]:
activity_by_day = hrc_rv_df.groupby('date').aggregate({'unique_users':max,

# Compute a 60-day rolling average to remove spikiness, plot
plt.ylabel('Latency time (seconds)',fontsize=15)
<matplotlib.text.Text at 0x2a1032e8>

Co-authorship network

We previously created a direct network of hyperlinks where the nodes were all articles and the edges indicated the direction of the link(s) between the central article and its neighbors. In this section, we're going to construct a different kind of network that contains a mixture of editors and articles and the edges indicate whether an editor contributed to an article. For simplicity's sake, we're going to start with the 1-step ego co-authorship network with the "Hillary Rodham Clinton" article and the set of editors who have ever made changes to it. Because there are two-types of nodes in this network (articles and editors) and editors can't edit editors and articles can't edit articles, we call this network a "bipartite network" (also known as an "affiliation" or "two-mode" network).

Even though bipartite networks are traditionally undirected, we're going to use a directed network because NetworkX does some wacky things when using an undirected network with bipartite properties. We're also going to make this a weighted network where the edges have values that correspond to the number of times an editor made a change to the article. This basically replicates the analysis we did above in "User Activity" but is an example of the information from the revision history that we might want to include in the network representation.

We go over every user in the user column inside hrc_rv_df and first check whether or not a (user,"Hillary Rodham Clinton") edge exists. If one already exists, then we increment its weight attribute by 1. Otherwise if there is no such edge in the network, we add a (user,"Hillary Rodham Clinton") edge with a weight of 1. We can inspect five of the edges to make sure this worked.

In [289]:
hrc_bg = nx.DiGraph()

for user in hrc_rv_df['user'].values:
    if hrc_bg.has_edge(user,u'Hillary Rodham Clinton'):
        hrc_bg[user][u'Hillary Rodham Clinton']['weight'] += 1
        hrc_bg.add_edge(user,u'Hillary Rodham Clinton',weight=1)

print "There are {0} nodes and {1} edges in the network.".format(hrc_bg.number_of_nodes(),hrc_bg.number_of_edges())

There are 3568 nodes and 3567 edges in the network.
[(u'Mansfieldkelly', u'Hillary Rodham Clinton', {'weight': 7}),
 (u'', u'Hillary Rodham Clinton', {'weight': 2}),
 (u'Ottava Rima', u'Hillary Rodham Clinton', {'weight': 2}),
 (u'JudithSouth', u'Hillary Rodham Clinton', {'weight': 1}),
 (u'Haroldandkumar', u'Hillary Rodham Clinton', {'weight': 2})]

Co-authorship network of the hyperlink neighborhood

Based on everything we did in the previous analysis to query the revisions, reshape and clean up the data, and extract new features for analysis, we are now going to write a big function that does all of this automatically. The function get_revision_df will accept an article name, perform the query, and proceed to do many of the steps outlined above, and returns a cleaned DataFrame at the end.

In [39]:
def get_revision_df(article):
    revisions_query = {'action': 'query',
                      'redirects': 'True',
                      'prop': 'revisions',
                      'titles': article,
                      'rvprop': 'ids|user|timestamp|user|userid|comment|size',
                      'rvlimit': '500',
                      'rvdir': 'newer'}

    revisions_data = wikipedia_query(revisions_query)
    page_id = revisions_data['pages'].keys()[0]

    # Extract and convert to DataFrame. Try/except for links to pages that don't exist
        df = pd.DataFrame(revisions_data['pages'][page_id]['revisions'])
    except KeyError:
        print u"{0} doesn't exist!".format(article)

    # Make it clear what's being edited
    df['page'] = [article]*len(df)

    # Clean up timestamps
    df['timestamp'] = pd.to_datetime(df['timestamp'],format="%Y-%m-%dT%H:%M:%SZ",unit='s')

    # Clean up anon column. If/else for articles that have all non-anon editors
    if 'anon' in df.columns:
        df = df.replace({'anon':{np.nan:False,u'':True}})
        df['anon'] = [False] * len(df)

    # Sort the data on timestamp and reset the index
    df = df.sort('timestamp').reset_index(drop=True) = 'revision'
    df = df.reset_index()

    # Set the index to a MultiIndex
    # Compute additional features
    df['date'] = df['timestamp'].apply(lambda
    df['diff'] = df['size'].diff()
    df['unique_users'] = count_unique_users(df['user'])
    df['latency'] = [round(i/np.timedelta64(1,'s'),-1) + 1 for i in df['timestamp'].diff().values]
    # Don't return random other columns
    df = df[[u'anon',u'comment',u'parentid',
             u'date', u'diff', u'latency']]
    return df

Try this out on "Bill Clinton".

In [79]:
bc_rv_df = get_revision_df("Bill Clinton")
Server lag, sleeping for 6 seconds
NameError                                 Traceback (most recent call last)
<ipython-input-79-39fe8fbb441e> in <module>()
----> 1 bc_rv_df = get_revision_df("Bill Clinton")
      2 bc_rv_df.head()

<ipython-input-39-f79189062d9b> in get_revision_df(article)
     41     df['date'] = df['timestamp'].apply(lambda
     42     df['diff'] = df['size'].diff()
---> 43     df['unique_users'] = count_unique_users(df['user'])
     44     df['latency'] = [round(i/np.timedelta64(1,'s'),-1) + 1 for i in df['timestamp'].diff().values]

NameError: global name 'count_unique_users' is not defined

We've created a DataFrame for both Hillary's revision history (hrc_rv_df) as well as Bill's revision history (bc_df). We can now combine both of these together (cross your fingers!!!) using the concat method. We can check to make sure that they both made it into the dataframe by checking the first level of the index and we see they're both there. We also save all the data we've scraped and cleaned to disk --- the resulting file takes up just under 5 MB.

In [321]:
clinton_df = pd.concat([bc_rv_df,hrc_rv_df])

print clinton_df.index.levels[0]
print "There are a total of {0} revisions across both the Hillary and Bill Clinton articles.".format(len(clinton_df))

Index([u'Bill Clinton', u'Hillary Rodham Clinton'], dtype='object')
There are a total of 26952 revisions across both the Hillary and Bill Clinton articles.

Clinton co-authorship network

We are going to use these data to create a coauthorship network of all the editors who contributed to both these articles. If we've already crawled this data, we can just load it from disk, specifying options to make sure we have the right encoding, the columns are properly indexed, and the dates are parsed.

In [41]:
clinton_df = pd.read_csv('clinton_revisions.csv',
anon comment parentid revid size timestamp user userid unique_users date diff latency
page revision
Bill Clinton 0 True builing -> building* 330742655 331410539 6851 2001-10-26 16:25:00 0 1 2001-10-26 NaN NaN
1 True * 331410539 238014 7006 2001-11-17 22:51:50 Wmorrow 0 2 2001-11-17 155 1924011
2 True * 238014 238015 7046 2001-12-08 22:14:42 0 3 2001-12-08 40 1812171
3 True brady bill, partial birth abortion veto, DOMA ... 238015 238016 7317 2001-12-09 03:01:09 Alan_D 0 4 2001-12-09 271 17191
4 True DADT was implemented in 1993 238016 238017 7323 2001-12-09 08:15:36 Dmerrill 0 5 2001-12-09 6 18871

We want to create an "edgelist" that contains all the (editor, article) pairs of who contributed to which articles. This could by done by looping over the list, but this is inefficient on larger datasets like the one we crawled. Instead, we'll use a groupby approach to not only count the number of times an editor contributed to an article (the weight we defined previous), but a whole host of other potentially interesting attributes.

We use the agg method on the data that's been grouped by page and user to aggregate the information into nice summary statistics. We count the number of revisions using len and relabel this variable weight. For the timestamp, diff, latency, and revision variables, we compute new summary statistics for the minimum, median, and maximum values. This operation returns a new DataFrame, indexed by (page, user) with columns corresponding to labels like weight, ts_min, etc. Each row in this DataFrame will become attributes in the graph object we make below. This operation creates a weird multi-column, so we drop the redundant 0-level of the column to have a nice concise column.

We're going to do something different to the timestamp data because these data are stored as Timestamp objects that don't always place nicely with other functions. Instead, we're going to convert these data to counts for the amount of time (in days) since January 16, 2001, the date that Wikipedia was founded. In effect, we're counting how "old" Wikipedia was when an action occurred and this float count will work better in subsequent steps.

In [43]:
clinton_gb_edge = clinton_df.reset_index().groupby(['page','user'])
clinton_edgelist = clinton_gb_edge.agg({'revid':{'weight':len},

# Drop the legacy/redundant column names
clinton_edgelist.columns = clinton_edgelist.columns.droplevel(0)

# Convert the ts_min and ts_max to floats for the number of days since Wikipedia was founded
clinton_edgelist['ts_min'] = (clinton_edgelist['ts_min'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
clinton_edgelist['ts_max'] = (clinton_edgelist['ts_max'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')

ts_min ts_max revision_min revision_max weight latency_min latency_median latency_max diff_median diff_max diff_min
page user
Bill Clinton 00666 2083.873866 2083.941644 9970 9972 3 1551 3751 4311 0.0 5072 -72
1.21 jigwatts 2336.214977 2903.078646 11337 12762 2 211 10951 21691 47380.0 94780 -20
100110100 2098.175683 2098.207037 10096 10101 6 31 656 42181 -0.5 23 -33 3483.056007 3483.056007 13668 13668 1 41581 41581 41581 4.0 4 4
10shistory 1861.760359 1864.803692 5663 5676 2 461 29186 57911 56.0 124 -12

The nodes in this bipartite network also have attributes we can extract from the data. Remember, because this is a bipartite network, we'll need to generate attribute data for both the users and the pages. We can perform an analogous groupby operation as we used above, but simply group on either the user or the page values. After each of these groupby operations, we can perform similar agg operations to aggregate the data into summary statistics. In the case of the user, these summary statistics are across all articles in the data. Thus the clinton_usernodelist summarizes how many total edits a user made, their first and last observed edits, and the distribution of their diff, latency, and revision statistics. The clinton_pagenodelist summarizes how many total edits were made to the page, the date of the first and last edit, and so on.

In [44]:
# Create the usernodelist by grouping on user and aggregating
clinton_gb_user = clinton_df.reset_index().groupby(['user'])
clinton_usernodelist = clinton_gb_user.agg({'revid':{'revisions':len},

# Clean up the columns and convert the timestamps to counts
clinton_usernodelist.columns = clinton_usernodelist.columns.droplevel(0)
clinton_usernodelist['ts_min'] = (clinton_usernodelist['ts_min'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
clinton_usernodelist['ts_max'] = (clinton_usernodelist['ts_max'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')

# Create the usernodelist by grouping on page and aggregating
clinton_gb_page = clinton_df.reset_index().groupby(['page'])
clinton_pagenodelist = clinton_gb_page.agg({'revid':{'revisions':len},

# Clean up the columns and convert the timestamps to counts
clinton_pagenodelist.columns = clinton_pagenodelist.columns.droplevel(0)
clinton_pagenodelist['ts_min'] = (clinton_pagenodelist['ts_min'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
clinton_pagenodelist['ts_max'] = (clinton_pagenodelist['ts_max'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')

ts_min ts_max revision_min revision_max revision_median revisions latency_min latency_median latency_max diff_median diff_max diff_min
Bill Clinton 283.684028 5008.568900 0 14905 7452.5 14906 1 891 6267951 1 417533 -417533
Hillary Rodham Clinton 197.848113 5003.015405 0 12045 6022.5 12046 1 881 10993011 3 1878770 -1878770

Now that we've created all this rich contextual data about edges, pages, and editors, we can load it all into a NetworkX DiGraph object called clinton_g. We start by looping over the index in the clinton_edgelist dataframe that corresponds to the edges in the network, convert the edge attributes to a dictionary for NetworkX to better digest, and then add this edge and all its data to the clinton_g graph object. This creates placeholder nodes, but we want to add the rich node data we created above as well. We can loop over the clinton_usernodelist, convert the node attributes to a dictionary, and then overwrite the placeholder nodes by adding the data-rich user nodes to the clinton_g graph object. We do the same for the clinton_pagenodelist, then check the number of nodes and edges in the network, and finally print out a few examples of the data-rich nodes and edges.

In [45]:
clinton_g = nx.DiGraph()
# Add the edges and edge attributes
for (article,editor) in iter(clinton_edgelist.index.values):
    edge_attributes = dict(clinton_edgelist.ix[(article,editor)])

# Add the user nodes and attributes
for node in iter(clinton_usernodelist.index):
    node_attributes = dict(clinton_usernodelist.ix[node])

# Add the page nodes and attributes
for node in iter(clinton_pagenodelist.index):
    node_attributes = dict(clinton_pagenodelist.ix[node])
print "There are {0} nodes and {1} edges in the network.".format(clinton_g.number_of_nodes(),clinton_g.number_of_edges())

There are 8539 nodes and 9270 edges in the network.
  u'Hillary Rodham Clinton',
  {'diff_max': 9285.0,
   'diff_median': 218.0,
   'diff_min': 29.0,
   'latency_max': 106201.0,
   'latency_median': 121.0,
   'latency_min': 21.0,
   'revision_max': 6392.0,
   'revision_min': 6381.0,
   'ts_max': 2307.2359837962963,
   'ts_min': 2307.2230671296297,
   'weight': 7.0}),
  u'Bill Clinton',
  {'diff_max': 199.0,
   'diff_median': 3.0,
   'diff_min': 0.0,
   'latency_max': 74931.0,
   'latency_median': 641.0,
   'latency_min': 371.0,
   'revision_max': 3008.0,
   'revision_min': 2989.0,
   'ts_max': 1706.7711458333333,
   'ts_min': 1705.1043865740742,
   'weight': 3.0}),
  u'Bill Clinton',
  {'diff_max': -57410.0,
   'diff_median': -57410.0,
   'diff_min': -57410.0,
   'latency_max': 49391.0,
   'latency_median': 49391.0,
   'latency_min': 49391.0,
   'revision_max': 1879.0,
   'revision_min': 1879.0,
   'ts_max': 1566.8687615740741,
   'ts_min': 1566.8687615740741,
   'weight': 1.0})]

Now it's time to do a really audacious data scrape. We're going to get the revision histories for all 2,646 articles linked to and from Hillary's article. The data will be stored in the dataframe_dict dictionary that will be keyed by article title and the values will be the dataframes themselves. We'll use a for loop to go over every article in the all_links and call the get_revision_df function we defined and tested above to get the cleaned revision DataFrame and store it in the dataframe_dict object. Because this scrape may take a while, we're going to put in some exception handling (try, except) so that if an error occurs, we don't lose all our progress. When an exception occurs, we'll add the article name to the errors list so we can go back and check what happened.

We'll concatenate all these DataFrames together into a gigantic DataFrame containing all the data we've scraped and then save it. This is a 485 MB file!

This will take a long time and a lot of memory!!! To prevent you from accidentally executing this, the block below is in a "raw" format that you'll need to convert to "Code" from the dropdown above.

# List of DataFrames dataframe_dict = {u'Bill Clinton': bc_rv_df, u'Hillary Rodham Clinton': hrc_rv_df} # Set operations all_links = list(set(hrc_alllink_outlist) | set(hrc_alllink_inlist)) # Start the scrape errors = list() for article in all_links: try: df = get_revision_df(article) dataframe_dict[article] = df except: errors.append(article) pass gigantic_df = pd.concat(dataframe_dict.values()) gigantic_df.to_csv('gigantic_df.csv',encoding='utf8')

And there are nearly 3 millions revisions in the dataset!

In [338]:

Make a coauthorship network

The analysis can start again here by loading the CSV file rather than having to re-scrape the data from above. Loading the file to gigantic_df, there are a few rows that seem to be broken, so we'll use drop to remove them. We also use to_datetime to make sure the timestamp information is using the appropriate units.

In [46]:
gigantic_df = pd.read_csv('gigantic_df.csv',

gigantic_df = gigantic_df.drop(("[[History of the United States]] at [[History of the United States#British colonization|British Colonization]]. ([[WP:TW|TW]])",589285361))
gigantic_df = gigantic_df.drop(("United States",32868))

gigantic_df['timestamp'] = pd.to_datetime(gigantic_df['timestamp'],unit='s')
gigantic_df['date'] = pd.to_datetime(gigantic_df['date'],unit='d')

/Users/brianckeegan/anaconda/lib/python2.7/site-packages/pandas/io/ DtypeWarning: Columns (2,4,5) have mixed types. Specify dtype option on import or set low_memory=False.
  data =
anon comment parentid revid size timestamp user userid unique_users date diff latency
page revision
Yucaipa Companies 0 False create org-stub 0 38555542 799 2006-02-07 02:16:04 Rj 43158 1 2006-02-07 NaN NaN
1 False hq 38555542 38557692 878 2006-02-07 02:32:44 Rj 43158 1 2006-02-07 79 1001
2 False [[GameSpy]] 38557692 38565610 920 2006-02-07 03:42:00 Rj 43158 1 2006-02-07 42 4161
3 False Disambiguate [[Franchise]] to [[Franchising]] ... 38565610 40811579 932 2006-02-23 04:12:35 Deville 364144 2 2006-02-23 12 1384241
4 True NaN 40811579 49638872 1025 2006-04-22 19:51:18 0 3 2006-04-22 93 5067521

Now do all the groupby and agg operations to create the edgelists and nodelists we'll need to make a network as well as the data cleanup steps we did above.

In [47]:
edge_agg_function = {'revid':{'weight':len},

# Create the edgelist by grouping on both page and user 
gigantic_gb_edge = gigantic_df.reset_index().groupby(['page','user'])
gigantic_edgelist = gigantic_gb_edge.agg(edge_agg_function)

# Drop the legacy/redundant column names
gigantic_edgelist.columns = gigantic_edgelist.columns.droplevel(0)

# Convert the ts_min and ts_max to floats for the number of days since Wikipedia was founded
gigantic_edgelist['ts_min'] = (gigantic_edgelist['ts_min'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
gigantic_edgelist['ts_max'] = (gigantic_edgelist['ts_max'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')

print "There are {0} edges in the network.".format(len(gigantic_edgelist))
There are 1198564 edges in the network.
In [48]:
node_agg_function = {'revid':{'revisions':len},

# Create the usernodelist by grouping on user and aggregating
gigantic_gb_user = gigantic_df.reset_index().groupby(['user'])
gigantic_usernodelist = gigantic_gb_user.agg(node_agg_function)

# Clean up the columns and convert the timestamps to counts
gigantic_usernodelist.columns = gigantic_usernodelist.columns.droplevel(0)
gigantic_usernodelist['ts_min'] = (gigantic_usernodelist['ts_min'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
gigantic_usernodelist['ts_max'] = (gigantic_usernodelist['ts_max'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')

print "There are {0} editor nodes in the network.".format(len(gigantic_usernodelist))

There are 609386 editor nodes in the network.
ts_min ts_max revision_min revision_max revision_median revisions latency_min latency_median latency_max diff_median diff_max diff_min
!!2011WorldProtests!! 3667.980139 3671.951539 473 4059 3106.0 33 31 111 5031 57.0 324 -22
!!2011WorldProtests!!Appletart!! 3667.906146 3667.917222 2459 2466 2462.5 2 151 301 451 13.5 27 0
!"£$ 2103.991366 2103.991366 664 664 664.0 1 94341 94341 94341 113.0 113 113
!1029qpwoalskzmxn 3032.998449 3032.998449 17228 17228 17228.0 1 15381 15381 15381 185.0 185 185
!ComputerAlert! 3328.060347 3328.060347 1266 1266 1266.0 1 426451 426451 426451 21.0 21 21
In [49]:
# Create the usernodelist by grouping on page and aggregating
gigantic_gb_page = gigantic_df.reset_index().groupby(['page'])
gigantic_pagenodelist = gigantic_gb_page.agg(node_agg_function)

# Clean up the columns and convert the timestamps to counts
gigantic_pagenodelist.columns = gigantic_pagenodelist.columns.droplevel(0)
gigantic_pagenodelist['ts_min'] = (gigantic_pagenodelist['ts_min'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
gigantic_pagenodelist['ts_max'] = (gigantic_pagenodelist['ts_max'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')

print "There are {0} page nodes in the network.".format(len(gigantic_pagenodelist))

There are 2644 page nodes in the network.
ts_min ts_max revision_min revision_max revision_median revisions latency_min latency_median latency_max diff_median diff_max diff_min
10 Janpath 1952.730394 4918.583762 0 86 43.0 87 21 142856 34497461 13.5 568 -548
107th United States Congress 1068.466620 4866.595417 0 497 248.5 498 11 40651 11445051 9.0 95328 -95328
11/22/63 3697.706644 4997.840532 0 955 477.5 956 1 2041 4650531 5.0 34146 -34146
111th United States Congress 1647.791111 5001.887164 0 3410 1705.0 3411 11 1476 27124741 1.0 26558 -17513
14 Women 2805.139792 4828.690787 0 39 19.5 40 21 2127091 17709771 16.0 705 -95

Having created the edge and node lists in the previous step, we can now add these data to a NetworkX DiGraph object we'll call gigantic_g. As before, we add the edges and edge attributes from gigantic_edgelist and then add the nodes and node attribtues from gigantic_usernodelist and gigantic_pagenodelist. We perform a dictionary comprehension to convert the values of the attribtues in the dictionary to float data type rather than the numpy.float64 which doesn't play nicely with the graph writing function in NetworkX. And then we can do the "grand reveal" to describe the coauthorship network of the articles in the hyperlink network neighborhood of Hillary's article.

In [70]:
gigantic_g = nx.DiGraph()
# Add the edges and edge attributes
for (article,editor) in iter(gigantic_edgelist.index.values):
    edge_attributes = dict(gigantic_edgelist.ix[(article,editor)])
    edge_attributes = {k:float(v) for k,v in edge_attributes.iteritems()}

# Add the user nodes and attributes
for node in iter(gigantic_usernodelist.index):
    node_attributes = dict(gigantic_usernodelist.ix[node])
    node_attributes = {k:float(v) for k,v in node_attributes.iteritems()}

# Add the page nodes and attributes
for node in iter(gigantic_pagenodelist.index):
    node_attributes = dict(gigantic_pagenodelist.ix[node])
    node_attributes = {k:float(v) for k,v in node_attributes.iteritems()}
print "There are {0} nodes and {1} edges in the network.".format(gigantic_g.number_of_nodes(),gigantic_g.number_of_edges())

There are 612010 nodes and 1198564 edges in the network.
  u'John McCain',
  {'diff_max': 33.0,
   'diff_median': 13.5,
   'diff_min': -6.0,
   'latency_max': 71.0,
   'latency_median': 51.0,
   'latency_min': 31.0,
   'revision_max': 858.0,
   'revision_min': 857.0,
   'ts_max': 1893.7183449074073,
   'ts_min': 1893.7175578703705,
   'weight': 2.0}),
  u'John Marshall',
  {'diff_max': 50.0,
   'diff_median': 50.0,
   'diff_min': 50.0,
   'latency_max': 46211.0,
   'latency_median': 46211.0,
   'latency_min': 46211.0,
   'revision_max': 1164.0,
   'revision_min': 1164.0,
   'ts_max': 2809.7745601851852,
   'ts_min': 2809.7745601851852,
   'weight': 1.0}),
  {'diff_max': 3.0,
   'diff_median': 1.0,
   'diff_min': -1.0,
   'latency_max': 1126091.0,
   'latency_median': 563116.0,
   'latency_min': 141.0,
   'revision_max': 199.0,
   'revision_min': 198.0,
   'ts_max': 3592.6432523148146,
   'ts_min': 3592.6416435185183,
   'weight': 2.0})]

Finally, having gone through all this effort to make a co-authorship network with such rich attributes and complex properties, we should save our work. There are many different file formats for storing network objects to disk, but the two I use the most are "graphml" and "gexf". They do slightly different things, but they're generally interoperable and compatible with many programs for visualizing networks like Gephi.

In [71]:

Analyze the gigantic_g network

Now let's perform some basic network analyses on this gigantic graph we've created. An extremely easy and important metric to compute is the degree centrality of nodes in the network: how well-connected a node is based on the number of edges it has to other nodes. We use the directed nature of the edges to distinguish between articles (which receive links in) and editors (which send links out) to compute the in- and out-degree centralities respectively with the nx.in_degree_centrality and nx.out_degree_centrality functions. These functions return a normalized degree centrality, where the values aren't the integer count of the number of connected edges, but rather the fraction of other nodes connected to which it's connected. The values are recorded in a dictionary keyed by the node ID (article title or user name), which are saved as g_idc and g_odc.

In [51]:
g_idc = nx.in_degree_centrality(gigantic_g)
g_odc = nx.out_degree_centrality(gigantic_g)

We can use a fancy bit of programming called itemgetter to quickly sort these dictionaries and return the 10-top connected articles and users. Hillary, despite being the central node we started at, is not actually the best-connected article, but rather other major people and entities. The top editors, interestingly enough are not actually people, but automated bots who perform a variety of maintenance and cleanup tasks across articles.

In [58]:
sorted(g_idc.iteritems(), key=itemgetter(1),reverse=True)[:10]
[(u'George W. Bush', 0.023424492123481844),
 (u'United States', 0.015543889060454993),
 (u'World War II', 0.012124004712349002),
 (u'Chicago', 0.011112581677720425),
 (u'Barack Obama', 0.010483505961513638),
 (u'India', 0.010083185051200228),
 (u'Ronald Reagan', 0.009625675439413473),
 (u'Diana, Princess of Wales', 0.009333196080449796),
 (u'Bill Clinton', 0.009318490414356652),
 (u'New York City', 0.00905215446178079)]
In [53]:
sorted(g_odc.iteritems(), key=itemgetter(1),reverse=True)[:10]
[(u'SmackBot', 0.0032515861694844355),
 (u'Addbot', 0.002730352004627383),
 (u'Cydebot', 0.002475453792346191),
 (u'Yobot', 0.0024378726456637076),
 (u'RjwilmsiBot', 0.0023659782780972175),
 (u'AnomieBOT', 0.0019640234048845687),
 (u'ClueBot NG', 0.0019199064066051316),
 (u'Rjwilmsi', 0.00177775163437139),
 (u'ClueBot', 0.0017140270813010919),
 (u'FrescoBot', 0.001547362865578774)]

We can plot a histogram of connectivity patterns for the articles and editors, which shows a very skewed distribution: most editors edit only a single article while there are single editors who make thousands of contributions. The distribution for articles shows a less severe but still very long-tailed distribution of contribution patterns.

In [54]:
g_size = gigantic_g.number_of_nodes()
g_idc_counter = Counter([v*(g_size-1) for v in g_idc.itervalues() if v != 0])
g_odc_counter = Counter([v*(g_size-1) for v in g_odc.itervalues() if v != 0])

plt.xlabel('Number of connections',fontsize=15)
plt.ylabel('Number of nodes',fontsize=15)
plt.legend(loc='upper right',scatterpoints=1)
<matplotlib.legend.Legend at 0x1403ce890>

We can also look at the distribution of edge weights, or the number of times that an editor contributed to an article. We could do this using the gigantic_edgelist DataFrame, but let's practice using the data attributes we've stored in the graph object. Using a list comprehension as before, we iterate (note the use of edes_iter(data=True) to both be more memory efficient and to return the edge attributes) over the edges which return a tuple (i,j,attributes_dict). We access the tuples' weights and store them in the list weights. Proceed with a Counter operation and then plot the results.

In [55]:
weights = [attributes['weight'] for i,j,attributes in gigantic_g.edges_iter(data=True)]
weight_counter = Counter(weights)

plt.xlabel('Number of contributions',fontsize=15)
plt.ylabel('Number of edges',fontsize=15)
<matplotlib.text.Text at 0x114ab9090>

We can compute another degree-related metric called "assortativity" that measures how well connected your neighbors are on average. We compute this statistic on the set of article and editor nodes using the nx.assortativity.average_degree_connectivity function with special attention to the direction of the ties as well as limiting the nodes to those in the set of pages or users, respectively. Plotting the distribution, both the articles and the editors exhibit negative correlations. In other words, for those editors (articles) connected with few articles (editors), those articles (editors) have a tendency to be well-connected to other nodes in the set. Conversely, for those editors (articles) connected with many articles (editors), those articles (editors) have a tendency to be poorly connected to other nodes in the set. Articles exhibit a stronger correlation than editors.

In [56]:
article_nn_degree = nx.assortativity.average_degree_connectivity(gigantic_g,source='in',target='out',nodes=gigantic_pagenodelist.index)
editor_nn_degree = nx.assortativity.average_degree_connectivity(gigantic_g,source='out',target='in',nodes=gigantic_usernodelist.index)

plt.ylabel('Average neighbor degree',fontsize=15)
plt.legend(loc='upper right',scatterpoints=1)
<matplotlib.legend.Legend at 0x13f0e9b50>

Testing other things out

Just trying other things.

In [72]:
 u'John Marshall',
 {'diff_max': 50.0,
  'diff_median': 50.0,
  'diff_min': 50.0,
  'latency_max': 46211.0,
  'latency_median': 46211.0,
  'latency_min': 46211.0,
  'revision_max': 1164.0,
  'revision_min': 1164.0,
  'ts_max': 2809.7745601851852,
  'ts_min': 2809.7745601851852,
  'weight': 1.0})
In [75]:
edge_weight_centrality = [list(),list()]
for (i,j,attributes) in gigantic_g.edges_iter(data=True):
    edge_weight_centrality[0].append(g_idc[j] - g_idc[i])
In [77]:

Appendix - There be dragons here

Throughout this section, we've gotten the links for a single article. As we did with the user information in the previous section, we can wrap these queries in a function so that they're easier to run. Once we do this, we can do more interesting things like examine the hyperlink ego-network surrounding a single article.

We take the lists of linked articles we extracted from Hillary's article and iterate over them, getting the lists of articles for each of them. We need to place the output of these into a larger data object that will hold everything. I'll use a dictionary keyed by article name that returns a dictionary containing the lists of links for that article. We'll put Hillary's data in there to start it up, but we'll add more.

Next we come up with a list of articles we're going to iterate over. We could just add the outlist and inlist articles together, but there might be redundancies in there. Instead we'll cast these lists into set containing only unique article names and the union of these sets creates a master set of all unique article names. Then convert this joined set back into a list called all_links so we can iterate over it.

This set of unique links has 2,646 articles in it, which will take some time to scrape all the data. This may take over an hour to run and will generate ~190MB of data: convert the cell below back to "Code" if you really want to execute it

# Start up the data structure link_data = {u'Hillary Rodham Clinton': {'Out':hrc_alllink_outlist, 'In':hrc_alllink_inlist}} # Set operations all_links = list(set(hrc_alllink_outlist) | set(hrc_alllink_inlist)) # Start the scrape for article in all_links: try: _out_links,_in_links = get_article_links(article) link_data[article] = {'Out':_out_links, 'In':_in_links} except: print article pass # Save the data with open('link_data.json','wb') as f: json.dump(link_data,f)
In [ ]:
dtype_dict = {'page':unicode,