Social Graph and Interactions - Final Project

The Lord of the Rings books, scripts and wiki analysis

1 - Motivation

Why The Lord Of The Rings? Well, we could go on for a long time with this answer, but we will keep it short in three points.

  • Because it has everything. The first question we asked ourselves was whether we could actually use this as a project. Otherwise, if data is not sufficient, the work cannot be rewarding. The answer was yes! The three books are easily available online, and to analyse the movies we could easily find the three complete scripts. For the network, the wiki provides a great starting point; it doesn't have a lot of nodes, though, so even if a lot of results are interesting we decided to analyse two more networks extracted from the scripts, as you will see later. With such a huge amount of text, all the tools from NLTK are applicable with good results. We have dug into different online sources to make sure to pick the best books and script versions; and we believe we did! One thing we have learned from this course, though, is that no data is perfect, but chosing the best option can save at least some time! Here are our sources:

  • Because it is complex. Well, we didn't really fear to be left with too few things to do. But that's a question that we asked ourselves as well. Maybe we will just find Frodo popping up everywhere, "fellowship" and "ring" will be the first word in every count, and guess which word will be the biggest one in Gollum's wordcloud? Well, not the one you would expect! Our dataset is never obvious; the good old JRR has set up a good solid structure, and his fans have built up even more data to work on, well connected and interesting! This makes up for a great dataset!

  • Because we are nerd(ish). Well, we may not be the real kind of nerds that one could think of, but we are pretty passionate about fantasy. We wanted to try and explore a universe that is interesting for us; and we may not be able to use it for the new groundbreaking start-up. But we can do that in the future: when does it happen again to have the opportunity to do a course project on "Lord of the Rings"? Furthermore, we may discover some interesting facts...

So, let's begin our journey!

It's the job that's never started that takes longest to finish.


2 - Basic stats. Let's understand the dataset better

2.1 - The Wiki

The first network generated from the pages of the Wiki was built using this dataset, which provides, through a well-formatted XML file, the whole content of the Wiki. The size of the dataset is around 50 MB. We extracted the relevant information from the XML file and generated a network of characters, cities and name of battles, where all these represent nodes, while the links (edges, in a formal network syntax) are represented by the Wiki links inside each page. Wiki links are nothing but simple links with a specific syntax commonly used by many Wikis including the most famous Wikipedia.

2.2 - The Books

The books are fully available at the links above; they are not that easy to handle though! The pages have a lot of good features (the line breaks are correct and the text is quite plain), but some other things are not coherent. For example, the HTML tags are often different throughout the trilogy, and the header and footer of the HTML pages differ as well (which can make it hard to distil the text to analyse and to find the HTML titles of the chapters).

Each book has between 160000 and 250000 words, but the truth is that in total the trilogy has 6 so-called books, each having not less than 10 chapters. So it's even more structured than it looks like!

During the book handling we had some problems with encoding and recognising accents, but in the end we manage to analyse the books and take out the text (... not the elfic runes, though!).

2.3 - The Scripts

The scripts are a bit of a mess. Each one is formatted differently, and frequently enough a non-ascii character pops up (sometimes with a meaning, e.g. the accent on Theoden's first 'e', but sometimes totally random, as some centered-dots). We decided to work with the unidecoded version, that transforms each fancy unicode character to its closest ASCII counterpart. This allows to search for a name in the wiki data (and make our datasets communicate), as well as guaranteeing that the three scripts have just the same format regarding names and spellings.

Note that the scripts have a different encoding than the books: whereas the formers are encoded in Latin1 (iso-8859-1), the latters are encoded in cp1251 (what?) which gave us some problems in handling the data at some point. To get only the script from the web page, on the other hand, we could easily pick the content of the HTML-tag pre.

The script lengths are between 1100 and 2300 lines (yes, the Return of the King is long), with around 25000-30000 words in total for each movie. The characters that speak in the movies are around 100, and no two female ever talk to each other (... Or do they? Read on!).


3 - Tools, theory and analysis. Describe the process of theory to insight

Path setup to have the notebook work in different machines

In [1]:
fontpath = None # 'C:\\Windows\\Fonts\\RINGM__.ttf'
general_path = "../"
path_to_pictures = "pictures/"

Useful imports

In [2]:
# parsing and encoding
import json, ujson, xmltodict, re, urllib2, codecs
from HTMLParser import HTMLParser
from unidecode import unidecode

# networking and plotting
import networkx as nx
import community
from matplotlib import pyplot as plt
%matplotlib inline

# advanced type handling
import operator, math, string, random
import numpy as np
from toolz import dissoc
from __future__ import division
from string import punctuation

# printing and imaging
import pprint
from prettytable import PrettyTable
from PIL import Image

# language processing and sentiment
import nltk
from nltk.corpus import stopwords
import wordcloud
stopwords_EN = nltk.corpus.stopwords.words('english')
from bs4 import BeautifulSoup
#from BeautifulSoup import BeautifulSoup, Comment

The following function will be used to display output in a nicer way, using PTable:

In [3]:
def print_table(data,column_names):
    table = PrettyTable()
    table.field_names = column_names
    for row in data:
        table.add_row(row)
    print table

Parsing the XML file in order to get a dictionary, which is easier to handle:

In [4]:
with open(general_path+"lotr_pages_current.xml") as f:
    parsed = xmltodict.parse(f.read())

Here we define a function for cleaning a raw text and extract the relevant wiki links, which will be returned as a list:

In [5]:
 #Removing wiki links to filter out
def clean_up(raw):
    raw = re.sub(r'\[\[(?:.+?\|)?[cC]ategory.*?\]\]','',raw)
    raw = re.sub(r'\[\[(?:.+?\|)?[lL]ist.*?\]\]','',raw)
    raw = re.sub(r'\[\[(.*Hobbits.*?)\]\]','',raw)
    raw = re.sub(r'\[\[(.*Peter\ Jackson.*?)\]\]','',raw)
    raw = re.sub(r'\[\[(.*Tolkien.*?)\]\]','',raw)
    raw = re.sub(r'\[\[(?:.+?\|)?#.*?\]\]','',raw)
    raw = re.sub(r'\[\[([a-z].+?)\]\]','',raw)
    raw = re.sub(r'\[\[(File.+?)\]\]','',raw)
    matches = re.findall(r'\[\[(.+?)\]\]',raw)
    return matches

#Removing duplicate pages due to REDIRECTS
def find_redirects(raw):
    if '#REDIRECT' in raw or '#redirect' in raw:
        return True
    else:
        return False

A convenient way to get all the necessary nodes and wiki links to build the network, is to access those pages that list all the elements we are interested in, like list of characters, battles, cities and deaths and clean up their raw text in order to get all the pages of each category. We save this elements in four different lists, and we'll use these lists to build the nodes of the network. Special treatment for the list of deaths, that we'll use later on.

In [6]:
#Dictionary with all the pages as keys
all_pages = parsed['mediawiki']['page']
raw_text = {}

#For each category (characters,battles,cities,deaths) we save the raw text
for page in all_pages:
    if page['title']=='List of characters':
        raw_text['characters'] = page['revision']['text']['#text']        
    elif page['title']=='Battles':
        raw_text['battles'] = page['revision']['text']['#text']        
    elif page['title']=='List of deaths in the Lord of the Rings films':
        raw_text['deaths'] = page['revision']['text']['#text']        
    elif page['title']=='Cities':
        raw_text['cities'] = page['revision']['text']['#text']

#Removing possible duplicates and cleaning up the raw text of each category
list_of_characters = list(set(clean_up(raw_text['characters'])))
list_of_battles = list(set(clean_up(raw_text['battles'])))
list_of_deaths = list(set(clean_up(raw_text['deaths'])))
list_of_cities = list(set(clean_up(raw_text['cities'])))

#REMOVING REDIRECTS
list_of_characters = [elm for elm in list_of_characters
                      if not find_redirects(page['revision']['text']['#text'])
                                             for page in all_pages
                                             if page['title'] == elm]
list_of_battles = [elm for elm in list_of_battles
                      if not find_redirects(page['revision']['text']['#text'])
                                             for page in all_pages
                                             if page['title'] == elm]
list_of_cities = [elm for elm in list_of_cities
                      if not find_redirects(page['revision']['text']['#text'])
                                             for page in all_pages
                                             if page['title'] == elm]
list_of_deaths = [elm for elm in list_of_deaths
                      if not find_redirects(page['revision']['text']['#text'])
                                             for page in all_pages
                                             if page['title'] == elm]

Now that we saved all the titles of the pages we're interested in analyzing, we need to analyze their content and extract the wiki links. We'll iterate again through all the pages and save the raw text of each page. Next, we'll clean up again this text and extract the wiki links as we did before.

In [7]:
wiki_links = {}
wiki_links['characters'] = {}
wiki_links['cities'] = {}
wiki_links['battles'] = {}
edges = []

for page in all_pages:
    if page['title'] in list_of_characters:
        wiki_links['characters'][page['title']] = page['revision']['text']['#text']
    elif page['title'] in list_of_cities:
        wiki_links['cities'][page['title']] = page['revision']['text']['#text']
    elif page['title'] in list_of_battles:
        wiki_links['battles'][page['title']] = page['revision']['text']['#text']

for key in wiki_links:
    for page in wiki_links[key]:
        wiki_links[key][page] = clean_up(wiki_links[key][page])
        # Worth saying that we keep those wiki links that are contained in one of those 
        #  lists we created in the previous cell
        wiki_links[key][page] = [elm for elm in wiki_links[key][page]
                               if elm in list_of_characters
                               or elm in list_of_battles
                               or elm in list_of_cities]
        edges.extend([(page,elm) for elm in wiki_links[key][page]])
edges = list(set(edges))

3.2 - Network Analysis

Now we build the directed graph; as you can easily see, the nodes of the graph will be characters, cities and names of battles. Both the color and the size represent the node degree (the higher the degree,the bigger the node and the darker the color)

In [8]:
G = nx.DiGraph()
G.add_nodes_from(list_of_characters)
G.add_nodes_from(list_of_battles)
G.add_nodes_from(list_of_cities)
G.add_edges_from(edges)
In [9]:
print 'The graph has',len(G.nodes()),'nodes',
a = len(nx.isolates(G))
if a == 0:
    print 'and is connected'
else:
    print '(%d isolated)' % a
print 'The graph has',len(G.edges()),'edges'
plt.figure(figsize=(12,8))

nodes = sorted(G.nodes(), key = G.degree, reverse=True)
degrees = sorted([G.degree(node) for node in nodes], reverse=True)
pos = nx.spring_layout(G,k=1/(np.sqrt(len(nodes))/20))
nx.draw(G,
        pos,
        cmap=plt.cm.YlOrBr_r,
        edge_color='gray',
        nodelist=nodes,
        node_size=[20*(d+1) for d in degrees],
        node_color=range(len(G.nodes())),
        arrows=False)
plt.savefig(path_to_pictures+'full_network.png')
plt.show()
The graph has 653 nodes (32 isolated)
The graph has 3971 edges

It would be interesting to see which of tese big nodes are in fact characters! So let's add some color to show that (no modification in the size)...

In [10]:
plt.figure(figsize=(12,8))

colors = []
for n in nodes:
    if n in list_of_battles:
        colors.append('#CC4C39')
    elif n in list_of_characters:
        colors.append('#5899FF')
    elif n in list_of_cities:
        colors.append('#94B23B')
    else:
        colors.append('#FFFFFF')

# Hack to add a legend
plt.hlines(0,0,0,label='Characters',colors='#5899FF', linewidth=4)
plt.hlines(0,0,0,label='Cities',colors='#94B23B',linewidth=4)
plt.hlines(0,0,0,label='Battles',colors='#CC4C39',linewidth=4)

plt.legend()
nx.draw(G,
        pos,
        edge_color='gray',
        nodelist=nodes,
        node_size=[20*(d+1) for d in degrees],
        node_color=colors,
        arrows=False)
plt.savefig(path_to_pictures+'full_network_and_categories.png')
plt.show()

3.2.1 - Centrality and graph statistics

We will need some graph statistics in various parts of the notebook, so we wrap them in a function for convenience.

In [11]:
def graph_nodes_stats(Gr):
    data = sorted(nx.betweenness_centrality(Gr).iteritems(),key=operator.itemgetter(1),reverse=True)[:10]
    print_table(data,['Node','Betweenness Centrality'])

    data = sorted(nx.eigenvector_centrality(Gr).iteritems(),key=operator.itemgetter(1),reverse=True)[:10]
    print_table(data,['Node','Eigenvector Centrality'])

    all_degrees_values = Gr.degree().values()
    all_degrees = sorted(Gr.degree().iteritems(), key=operator.itemgetter(1), reverse=True)
    
    if isinstance(Gr, nx.DiGraph):
        in_deg_values = sorted(Gr.in_degree().values())
        in_degree = sorted(Gr.in_degree().iteritems(), key=operator.itemgetter(1), reverse=True)

        out_deg_values = sorted(Gr.out_degree().values())
        out_degree = sorted(Gr.out_degree().iteritems(), key=operator.itemgetter(1), reverse=True)

    print 'Average degree:', np.average(all_degrees_values)
    print 'Median Degree:', np.median(all_degrees_values)
    if isinstance(Gr, nx.DiGraph):
        print 'Average in-degree:', np.average(in_deg_values)
        print 'Average out-degree:', np.average(out_deg_values)
    print '\nTop degree values:\n'
    print_table(all_degrees[:10],['Node','Degree'])
    if isinstance(Gr, nx.DiGraph):
        print '\nTop in-degree values:\n'
        print_table(in_degree[:10],['Node','Degree'])
        print '\nTop out-degree values:\n'
        print_table(out_degree[:10],['Node','Degree'])

... and now let's see what are the numbers for our wiki graph!

In [12]:
graph_nodes_stats(G)
+----------------------+------------------------+
|         Node         | Betweenness Centrality |
+----------------------+------------------------+
|        Sauron        |     0.143590578864     |
|       Gandalf        |    0.0741412431694     |
|  Aragorn II Elessar  |    0.0614710296562     |
|      Galadriel       |    0.0514085102322     |
|       Isildur        |    0.0457204460073     |
|        Elrond        |    0.0449688443148     |
|     Minas Tirith     |    0.0431990441545     |
|        Umbar         |    0.0410685370346     |
|       Saruman        |    0.0368799242238     |
| Witch-king of Angmar |    0.0355133218318     |
+----------------------+------------------------+
+--------------+------------------------+
|     Node     | Eigenvector Centrality |
+--------------+------------------------+
|    Sauron    |     0.366451927533     |
|    Elrond    |     0.24497381971      |
|   Gandalf    |     0.216553483135     |
|  Galadriel   |     0.208281097753     |
|  Rivendell   |     0.190434733593     |
|   Saruman    |     0.165068868452     |
| Minas Tirith |     0.162934843502     |
|    Gimli     |     0.152667926063     |
|   Lúthien    |     0.149038191822     |
|   Legolas    |     0.147504052641     |
+--------------+------------------------+
Average degree: 12.1623277182
Median Degree: 7.0
Average in-degree: 6.08116385911
Average out-degree: 6.08116385911

Top degree values:

+---------------+--------+
|      Node     | Degree |
+---------------+--------+
|     Sauron    |  156   |
|    Gandalf    |  104   |
|   Galadriel   |   90   |
|     Elrond    |   89   |
|    Saruman    |   79   |
|   Rivendell   |   76   |
| Frodo Baggins |   72   |
|  Minas Tirith |   69   |
|     Melkor    |   65   |
|     Fëanor    |   62   |
+---------------+--------+

Top in-degree values:

+---------------+--------+
|      Node     | Degree |
+---------------+--------+
|     Sauron    |  125   |
|    Gandalf    |   66   |
|   Rivendell   |   64   |
|     Elrond    |   59   |
|  Minas Tirith |   53   |
|    Saruman    |   49   |
|   Galadriel   |   42   |
|     Gimli     |   39   |
|     Fëanor    |   39   |
| Frodo Baggins |   38   |
+---------------+--------+

Top out-degree values:

+--------------------+--------+
|        Node        | Degree |
+--------------------+--------+
|     Galadriel      |   48   |
| Aragorn II Elessar |   41   |
|      Gandalf       |   38   |
|       Melkor       |   34   |
|   Frodo Baggins    |   34   |
|       Sauron       |   31   |
|      Saruman       |   30   |
|       Elrond       |   30   |
|       Arwen        |   29   |
|      Celeborn      |   28   |
+--------------------+--------+

Let us compute and plot the giant connected component and plot it with some labels that popped out in the previous tables...

In [13]:
gcc = sorted(nx.connected_component_subgraphs(G.to_undirected()),key=len,reverse=True)[0]
print 'The giant connected component has',len(gcc.nodes()),'nodes'
print 'The giant connected component has',len(gcc.edges()),'edges'
plt.figure(figsize=(12,8))

nodes = sorted(gcc.nodes(), key = gcc.degree, reverse=True)
degrees = sorted([gcc.degree(node) for node in nodes], reverse=True)
pos = nx.spring_layout(gcc,k=1/(np.sqrt(len(nodes))/45))

nx.draw(gcc,
        pos,
        cmap=plt.cm.YlOrBr_r,
        edge_color='gray',
        nodelist=nodes,
        node_size=[50*(d+1) for d in degrees],
        node_color=range(len(gcc.nodes())),
        arrows=False)
labels={node:node for node in ['Minas Tirith','Sauron','Frodo Baggins','Gandalf','Gollum',
                               'Galadriel','Aragorn II Elessar','Elrond','Saruman']}
nx.draw_networkx_labels(gcc,pos,labels,font_size=16,font_weight='bold')
plt.savefig(path_to_pictures+'giant_component.png')
plt.show()
The giant connected component has 610 nodes
The giant connected component has 3099 edges