Notebook

Introduction to the Global Knowledge Graph¶

David Masad ¶

david.masad[at]gmail[dot]com / @badnetworker ¶

The Global Knowledge Graph (GKG) is a companion product to GDELT, created by Kalev Leetaru and generated from the same firehose of daily media reports as GDELT. Instead of events, GKG focuses on entities: people, places, organizations, even themes. GKG also computes tonal data for articles, estimating what fraction of the source text contains 'positive' or 'negative' words, as well as 'active' words and names of entities. Finally, it also contains counts: number of people affected by a hurricane, for example, or participating in a demonstration.

Each row of GKG data contains all the entities and themes extracted from some documents, along with the tonal data for them. Essentially, you can think of GKG as an extremely multipartite graph. You can use it to see which individuals are mentioned together -- or which individuals are linked by theme. Or which themes are associated with different organizations. And how these change over time. The possibilities are massive -- so let's get started!

NOTE: GKG is still considered to be an experimental Alpha-release product. The information in this tutorial (and in GKG itself) is likely to change substantially between now and when it is considered finalized.

Getting the data¶

GKG comes in two sets of files: count files, which contain only the event-linked counts and some supporting information, and the full GKG files, which add all the entity information. Those are the files we'll focus on for now.

Here is some quick and dirty code that will download all the GKG data currently available. Note that as of now, it only goes back to the beginning of October, updated (mostly) daily. Eventually, there will be data available all the way back to 1979.

Start running the code below -- and if you get bored waiting for it to download, skip below to read about the data generating process.

In [2]:

import os
import datetime as dt
import time
import io

import numpy as np
import requests

In [3]:

URL = "http://gdelt.utdallas.edu/data/gkg/" # The GKG data directory
PATH = "/Users/dmasad/Data/GDELT/GKG/" # The local directory to store the data

In [ ]:

# Specify the start and end date for your data
start_date = dt.datetime(2013, 10, 5)
end_date = dt.datetime.today()
date = start_date

# For each date in between, download the corresponding file
while date <= end_date:
    filename = date.strftime("%Y%m%d") + ".gkg.csv.zip"
    req = requests.get(URL + filename)
    dl = io.open(PATH + filename, "wb")
    for chunk in req.iter_content(chunk_size=1024):
        if chunk:
            dl.write(chunk)
    dl.close()
    time.sleep(30) # Be nice and don't overload the server.
    date += dt.timedelta(days=1)

When all the downloads are done, unzip the files. If you're on a Mac or Linux machine, you can unzip them all at once from the command line by going to the directory where the files are and running:

> unzip \*.zip

The Data-Generating Process¶

GKG is generated from the same set of documents as GDELT -- mostly English-language news sources, plus reports provided by BBC Monitoring and (experimentally) machine-translated foreign-language sources.

The GKG parser extracts the entity names and themes (via keyword-matching) that appear in each document. It then groups them by namesets -- unique combinations of entities and themes. For example, all the documents mentioning Barack Obama, John Kerry and Hassan Rouhani (and matching themes and entities) will be grouped together, and separately from articles mentioning Obama, Kerry, Rouhani AND Benjamin Netanyahu. Each daily nameset is one row in GKG.

The GKG parser isn't perfect yet -- for example, it at least occasionally identifies a person named Al Qaeda. Some of these issues will be ironed out in time (remember, alpha release), but any dataset this large is all but guarenteed to have noise.

The Data Format¶

GKG is stored in uniquely-formatted tab-delimited files, where some columns are small sub-tables. Let's open one up and take a look:

In [4]:

f = open(PATH + "20131001.gkg.csv")

Conventiently, each GKG file has headers included (at least for now) as the first row:

In [5]:

headers = f.readline()
print headers.split("\t")

['DATE', 'NUMARTS', 'COUNTS', 'THEMES', 'LOCATIONS', 'PERSONS', 'ORGANIZATIONS', 'TONE', 'CAMEOEVENTIDS', 'SOURCES', 'SOURCEURLS\n']

Now let's see the actual data. I'm going to cheat a bit here and go to an entry I know is interseting.

In [6]:

f.readline() # Skip a row
row = f.readline()
row = row.split("\t")
for entry in row:
    print entry

20131001
1
KILL#309##1#Pakistan#PK#PK#30#70#PK;WOUND#500##1#Pakistan#PK#PK#30#70#PK
NATURAL_DISASTER;NATURAL_DISASTER_NATURAL_DISASTERS;MEDIA_MSM;NATURAL_DISASTER_EARTHQUAKE;KILL;WOUND;ARMEDCONFLICT;TAX_FNCACT;TAX_FNCACT_BABY;TAX_FNCACT_CHILD;AFFECT;TAX_FNCACT_STUDENTS
1#Egypt#EG#EG#27#30#EG;1#Syria#SY#SY#35#38#SY;4#Awaran, Balochistan, Pakistan#PK#PK02#26.4555#65.2312#-2755131;1#Pakistan#PK#PK#30#70#PK;1#China#CH#CH#35#105#CH;1#Ethiopia#ET#ET#8#38#ET;1#Iran#IR#IR#32#53#IR;1#Sudan#SU#SU#15#30#SU;1#Oman#MU#MU#21#57#MU;1#Afghanistan#AF#AF#33#65#AF
joel osteen;mohammed al balushi
united nations
-1.27388535031847,2.07006369426752,3.34394904458599,5.4140127388535,23.7261146496815,0.318471337579618
269967042,269967401
main.omanobserver.om
http://main.omanobserver.om/?p=17731&c=8oNJ4q9WWDBTCPNQDVVAm5a7sg99Rx0ggsXaeQ4orZI&mkt=en-us

Whoa, what's going on here? Each entry isn't just a straightforward data column -- some are lists of items, separated by different characters.

The first two entries are straightforward -- the date the row is coming from, and the number of articles included in this particular nameset.

The next row is the COUNTS data. Essentially, this is a small sub-table. Count entries are separated by semicolons, and 'columns' within those rows are separated by hashmarks. So:

In [7]:

for entry in row[2].split(";"):
    print entry.split("#")

['KILL', '309', '', '1', 'Pakistan', 'PK', 'PK', '30', '70', 'PK']
['WOUND', '500', '', '1', 'Pakistan', 'PK', 'PK', '30', '70', 'PK']

So this nameset is dealing with an event in Pakistan which is described as having 309 killed and 500 wounded. The first entry is the COUNTTYPE -- what's being counted; next is the NUMBER, the actual count; next (missing here) is the OBJECTTYPE, describing who was affected. The next records are geographic the GEO_TYPE (1 here indicates a country), followed by the GEO_FULLNAME, the GEO_COUNTRYCODE and GEO_ADM1CODE (the FIPS codes for the location), the latitude and longitude (just the centroid of Pakistan, here) and a FEATUREID.

The next entry is just a list of THEMES (see the full theme spreadsheet for their interpretation), separated by semicolons.

In [8]:

print row[3].split(";")

['NATURAL_DISASTER', 'NATURAL_DISASTER_NATURAL_DISASTERS', 'MEDIA_MSM', 'NATURAL_DISASTER_EARTHQUAKE', 'KILL', 'WOUND', 'ARMEDCONFLICT', 'TAX_FNCACT', 'TAX_FNCACT_BABY', 'TAX_FNCACT_CHILD', 'AFFECT', 'TAX_FNCACT_STUDENTS']

This suggests that the event in question is an earthquake, though the article also mentions Armed Conflict, and the functional actors (role designations, denoted with TAX_FNACT prefix) BABY, CHILD, and STUDENTS.

LOCATIONS are another sub-table, listing all the locations mentioned in the article. Like COUNTS, its 'rows' are split with a semicolon, and 'columns' with a hash-sign.

In [9]:

for entry in row[4].split(";"):
    print entry.split("#")

['1', 'Egypt', 'EG', 'EG', '27', '30', 'EG']
['1', 'Syria', 'SY', 'SY', '35', '38', 'SY']
['4', 'Awaran, Balochistan, Pakistan', 'PK', 'PK02', '26.4555', '65.2312', '-2755131']
['1', 'Pakistan', 'PK', 'PK', '30', '70', 'PK']
['1', 'China', 'CH', 'CH', '35', '105', 'CH']
['1', 'Ethiopia', 'ET', 'ET', '8', '38', 'ET']
['1', 'Iran', 'IR', 'IR', '32', '53', 'IR']
['1', 'Sudan', 'SU', 'SU', '15', '30', 'SU']
['1', 'Oman', 'MU', 'MU', '21', '57', 'MU']
['1', 'Afghanistan', 'AF', 'AF', '33', '65', 'AF']

These follow the same columns as above -- a location type, fullname and FIPS codes, and lat-long coordinates.

PERSONS and ORGANIZATIONS are both simply lists of the people and organizations extracted from the document, split by semicolons.

In [10]:

# PERSONS:
print row[5].split(";")

['joel osteen', 'mohammed al balushi']

In [11]:

# ORGANIZATIONS
print row[6].split(";")

['united nations']

So this article mentions American televangelist Joel Osteen, Omani football/soccer player Mohammed Al Balushi, and the UN. The source link itself is dead, but we know that there was a major earthquake in Pakistan in late September. Maybe this is an article about earthquake relief efforts?

Next we get to the EMOTION data, which is comma-delimited:

In [12]:

print row[7].split(",")

['-1.27388535031847', '2.07006369426752', '3.34394904458599', '5.4140127388535', '23.7261146496815', '0.318471337579618']

The first value is TONE, the average 'tone' of the article. Tone is measured from -100 to +100, so a value of -1.27 is neutral-leaning-negative. In fact, it is simply the subtraction of the next two values: Positive Score - Negative Score, both measured on a 0-100 scale.

Polarity, the next value, is the percent of 'tonal' words in the document; here, only ~5% of words in the text were tonal, suggesting a mostly-neutral document.

Activity Reference Density is the percent of active words in the document (23.72% here), and Self/Group Reference Density is the percent of words referencing pronouns -- extremely low, though the GKG documentation says that this is typical of news media.

After TONE come CAMEOEVENTIDS, which are GDELT GlobalEventID codes that can be used to tie GKG back to GDELT:

In [13]:

print row[8].split(",")

['269967042', '269967401']

Finally comes sourcing information: the SOURCES and SOURCEURLS. This nameset contains only one article, so it includes only one source. If there were more sources, they would be comma-delimited.

In [14]:

# SOURCE:
print row[9]

main.omanobserver.om

In [15]:

#SOURCEURL
print row[10]

http://main.omanobserver.om/?p=17731&c=8oNJ4q9WWDBTCPNQDVVAm5a7sg99Rx0ggsXaeQ4orZI&mkt=en-us

Example Application: Analyzing Iranian Leadership¶

Next, I'm going to run through a fairly simple application: analyzing the co-mention network surrounding the Iranian leadership, inspired by Drew Conway's analysis of Chinese leadership. To do this, I'll start with a list of the names of Iran's leaders, taken from the CIA World Factbook, and pull any names that co-appear with them across all the available GKG data. Then I'll create a network of co-mentions and analyze it.

First, I grab the names of the leadership from the CIA World Factbook. (I did this by hand; it's probably possible to scrape the names, but I'd rather not do anything to the CIA website that may be mistaken for 'hacking').

In [16]:

LEADERS = ["Ali Hoseini-KHAMENEI", "Hasan Fereidun RUHANI", "Mohsen HAJI-MIRZAIE", 
    "Mohammad NAHAVANDIAN", "Eshaq JAHANGIRI", "Mohammad SHARIATMADARI", 
    "Elham AMINZADEH", "Mohammad Baqer NOBAKHT", "Majid ANSARI", 
    "Mohammad Baqer NOBAKHT", "Sorena SATARI", "Shahindokht MOLAVERDI", 
    "Ali Akbar SALEHI", "Mohammad Ali NAJAFI", "Masumeh EBTEKAR", 
    "Mohammad Ali SHAHADI", "Mohammad HOJJATI", "Mahmud VAEZI-Jazai", 
    "Ali JANATI", "Hosein DEHQAN", "Ali TAYEBNIA", "Ali Asqar FANI", 
    "Hamid CHITCHIAN", "Mohammad Javad ZARIF-Khonsari", 
    "Seyed Hasan QAZIZADEH-Hashemi", "Mohammad Reza NEMATZADEH", 
    "Mahmud ALAVI, Hojjat ol-Eslam", "Abdolreza Rahmani-FAZLI", 
    "Mostafa PUR-MOHAMMADI", "Ali RABIEI", "Bijan Namdar-ZANGANEH", 
    "Abbas Ahmad AKHUNDI", "Reza FARAJI-DANA", "Valiollah SEIF", 
    "Mohammad KHAZAI-Torshizi"]
# Convert the names to all lower-case
LEADERS = [name.lower() for name in LEADERS]

Next, we iterate over all the GKG files and look for matching names.

In [17]:

entries = []
for path in os.listdir(PATH):
    if path[-3:] != "csv": continue
    f = open(PATH + path)
    for row in f:
        actors = row.split("\t")[5].split(";")
        for actor in actors:
            if actor in LEADERS:
                entries.append(actors)
                break
print len(entries)

We want to translate each list of co-appearing names into dyads, and count the number of times each dyad appears. The itertools module in the Standard Library has a combinations(...) method that provides all possible combinations of elements in a list. We'll also import defaultdict to store the dyad-counts in.

In [18]:

import itertools
from collections import defaultdict

In [19]:

dyads = defaultdict(int)
for entry in entries:
    for p1, p2 in itertools.combinations(entry, 2):
        if (p2, p1) in dyads:
            dyads[(p2, p1)] += 1
        else:
            dyads[(p1, p2)] += 1

We can take a quick histogram to see how many dyads occur in different frequencies.

In [20]:

import matplotlib.pyplot as plt
# Some initial styling, to make our graphs look good:
matplotlib.rcParams['axes.facecolor'] = "#eeeeee"
matplotlib.rcParams['axes.grid'] = True
matplotlib.rcParams['xtick.labelsize'] = 14
matplotlib.rcParams['ytick.labelsize'] = 14

In [21]:

fig = plt.figure(figsize=(20,12))
ax = fig.add_subplot(111)
ax.set_yscale('log')
h = ax.hist(dyads.values(), bins=np.linspace(1, 250, 26))

So the vast majority of dyads occur only a small number of times, and a small number occur many more times. We can check just how many dyads occur more than once:

In [22]:

counts = np.array(dyads.values())
print len(counts[counts>1])/(1.0*len(counts))

0.181183974487

Now we can build a network based on the co-mentions, using the NetworkX library. We'll filter out all dyads that occur only once, in order to avoid spurious relationships and get to the core of the network.

In [23]:

import networkx as nx

In [24]:

# Build the graph
G = nx.Graph()
for dyad, count in dyads.iteritems():
    if count > 1:
        G.add_edge(dyad[0], dyad[1], weight=count)

If you want to explore the network visually, the best thing to do is probably to save it to a file (for example, as GraphML) and load it in Gephi or another network analysis tool.

In [25]:

nx.write_graphml(G, "iran.graphml")

However, we can also do some analysis here. NetworkX has a decent drawing ability, so let's visualize the network.

In [26]:

fig = plt.figure(figsize=(20,20))
ax = fig.add_subplot(111)
pos = nx.spring_layout(G, k=0.2, iterations=25)
nx.draw_networkx_edges(G, pos=pos, ax=ax, edge_color='#eeeeee')
nx.draw_networkx_labels(G, pos=pos, ax=ax, font_size=16)
_ = ax.axis('off')

Not bad, but this doesn't necessarily tell us all that much. Another thing we can do is look at network centralities. Eigenvector centrality is considerd a good measure of influence or power within a network, while betweenness centrality is a measure of a node's role as a gatekeeper or boundary-spanner between different groups. Plotting one against the other is a good way of finding both the overall most-important actors, and those with special roles.

In [27]:

eigen_centralities = nx.eigenvector_centrality(G)
between_centralities = nx.betweenness_centrality(G)

fig = plt.figure(figsize=(20,12))
ax = fig.add_subplot(111)


for name in eigen_centralities.keys():
    ax.text(eigen_centralities[name], between_centralities[name], name, 
            fontdict={"size": 16})

ax.set_xlabel("Eigenvector Centrality", size=20)
ax.set_ylabel("Betweenness Centrality", size=20)

Out[27]:

<matplotlib.text.Text at 0x109458e50>

We notice right away that the right-hand side of the chart pulls out the key heads of government: Rouhani, Obama and Netanyahu. However, the actor with the highest centrality score is actually Ali Akbar Salehi, the head of Iran's Atomic Energy Organization. This shouldn't be surprising, given how much of the news about Iran focuses on the nuclear issue. The other non-head of government on the right-hand side is Mohammad Javad Zarif, Iran's foreign minister.

Note the name sticking out on the left: Mohammad-Ali Najafi. His low eigenvector centrality suggests that he is less influential than the names on the right, but nevertheless mentioned almost as frequently along with different individuals who are otherwise not mentioned together. In fact, Najafi is the head of Iran's Cultural Heritage and Tourism Organization, and has been involved with (and quoted on) Iran's new openness to tourists; it makes sense that the network would show him as more connected to the outside world.

There may be a lot going on in the lower left-hand portion, which isn't visible at the current scale. So let's zoom in on it:

In [28]:

fig = plt.figure(figsize=(20,12))
ax = fig.add_subplot(111)

for name in eigen_centralities.keys():
    x = eigen_centralities[name]
    y = between_centralities[name]
    if x < 0.25 and y < 0.2 and y > 0:
        ax.text(x, y, name,  fontdict={"size": 16})
ax.set_xlim(0, 0.2)
ax.set_ylim(0, 0.16)
ax.set_xlabel("Eigenvector Centrality", size=20)
ax.set_ylabel("Betweenness Centrality", size=20)

Out[28]:

<matplotlib.text.Text at 0x109202c10>

The names that stand out here seem to be the second tier of diplomatic figures: John Kerry and his EU counterpart Catherine Ashton, Yukiya Amano head of the International Atomic Energy Agency (again, the salience of the nuclear issue).

Note that Ali Khamenei, the Supreme Leader of Iran, is also in this second tier. This highlights that though he is the ultimate authority on the ongoing negotiations, his name comes up in 'lesser' contexts, as he isn't publiclly as active a participant. This is also a good reminder of the limits of some network metrics -- the fact that Khamenei doesn't emerge as highly influencial within media reports doesn't mean that he has less power.

Let's zoom in one more step, since it highlights some of the other issues to be aware of when using GKG:

In [29]:

fig = plt.figure(figsize=(20,12))
ax = fig.add_subplot(111)

for name in eigen_centralities.keys():
    x = eigen_centralities[name]
    y = between_centralities[name]
    if x < 0.05 and y < 0.02 and y > 0:
        ax.text(x, y, name,  fontdict={"size": 16})
ax.set_xlim(0, 0.05)
ax.set_ylim(0, 0.02)
ax.set_xlabel("Eigenvector Centrality", size=20)
ax.set_ylabel("Betweenness Centrality", size=20)

Out[29]:

<matplotlib.text.Text at 0x108fc7c90>

Here we start to see some more issues. Notice on the left-hand side 'hassan rohani' and 'ali hoseini-khamenei', alternate spellings of the names of the President and Supreme Leader. This kind of issue is likely to crop up a lot, particularly with non-Western names. Ideally, you can identify these spellings in advance and consolidate them, so that each individual is referred to only by one name. Note also that 'Shia Islam' is being miscoded as an individual.

GKG isn't perfect (yet), but it's an incredibly powerful new tool. This is just a simple example of what might be done with it, particularly once the full dataset is released and we have more than one month to work with. With enough data, we'll be able to study not just static networks, but how they change over time. Who is becoming more prominent, and who is receding? What themes are associated with which individuals? With which cities, or countries? And can we understand (and ultimately, forecast) how these will change?

In [1]:

# Style the code using CSS shamelessly lifted from Bayesian Methods for Hackers
# https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers
from IPython.core.display import HTML
styles = open("Style.css").read()
HTML(styles)

Out[1]: