Introduction to the Global Knowledge Graph

David Masad

david.masad[at]gmail[dot]com / @badnetworker

Department of Computational Social Science, George Mason University

The Global Knowledge Graph (GKG) is a companion product to GDELT, created by Kalev Leetaru and generated from the same firehose of daily media reports as GDELT. Instead of events, GKG focuses on entities: people, places, organizations, even themes. GKG also computes tonal data for articles, estimating what fraction of the source text contains 'positive' or 'negative' words, as well as 'active' words and names of entities. Finally, it also contains counts: number of people affected by a hurricane, for example, or participating in a demonstration.

Each row of GKG data contains all the entities and themes extracted from some documents, along with the tonal data for them. Essentially, you can think of GKG as an extremely multipartite graph. You can use it to see which individuals are mentioned together -- or which individuals are linked by theme. Or which themes are associated with different organizations. And how these change over time. The possibilities are massive -- so let's get started!

NOTE: GKG is still considered to be an experimental Alpha-release product. The information in this tutorial (and in GKG itself) is likely to change substantially between now and when it is considered finalized.

Getting the data

GKG comes in two sets of files: count files, which contain only the event-linked counts and some supporting information, and the full GKG files, which add all the entity information. Those are the files we'll focus on for now.

Here is some quick and dirty code that will download all the GKG data currently available. Note that as of now, it only goes back to the beginning of October, updated (mostly) daily. Eventually, there will be data available all the way back to 1979.

Start running the code below -- and if you get bored waiting for it to download, skip below to read about the data generating process.

In [2]:
import os
import datetime as dt
import time
import io

import numpy as np
import requests
In [3]:
URL = "http://gdelt.utdallas.edu/data/gkg/" # The GKG data directory
PATH = "/Users/dmasad/Data/GDELT/GKG/" # The local directory to store the data
In [ ]:
# Specify the start and end date for your data
start_date = dt.datetime(2013, 10, 5)
end_date = dt.datetime.today()
date = start_date

# For each date in between, download the corresponding file
while date <= end_date:
    filename = date.strftime("%Y%m%d") + ".gkg.csv.zip"
    req = requests.get(URL + filename)
    dl = io.open(PATH + filename, "wb")
    for chunk in req.iter_content(chunk_size=1024):
        if chunk:
            dl.write(chunk)
    dl.close()
    time.sleep(30) # Be nice and don't overload the server.
    date += dt.timedelta(days=1)

When all the downloads are done, unzip the files. If you're on a Mac or Linux machine, you can unzip them all at once from the command line by going to the directory where the files are and running:

> unzip \*.zip

The Data-Generating Process

GKG is generated from the same set of documents as GDELT -- mostly English-language news sources, plus reports provided by BBC Monitoring and (experimentally) machine-translated foreign-language sources.

The GKG parser extracts the entity names and themes (via keyword-matching) that appear in each document. It then groups them by namesets -- unique combinations of entities and themes. For example, all the documents mentioning Barack Obama, John Kerry and Hassan Rouhani (and matching themes and entities) will be grouped together, and separately from articles mentioning Obama, Kerry, Rouhani AND Benjamin Netanyahu. Each daily nameset is one row in GKG.

The GKG parser isn't perfect yet -- for example, it at least occasionally identifies a person named Al Qaeda. Some of these issues will be ironed out in time (remember, alpha release), but any dataset this large is all but guarenteed to have noise.

The Data Format

GKG is stored in uniquely-formatted tab-delimited files, where some columns are small sub-tables. Let's open one up and take a look:

In [4]:
f = open(PATH + "20131001.gkg.csv")

Conventiently, each GKG file has headers included (at least for now) as the first row:

In [5]:
headers = f.readline()
print headers.split("\t")
['DATE', 'NUMARTS', 'COUNTS', 'THEMES', 'LOCATIONS', 'PERSONS', 'ORGANIZATIONS', 'TONE', 'CAMEOEVENTIDS', 'SOURCES', 'SOURCEURLS\n']

Now let's see the actual data. I'm going to cheat a bit here and go to an entry I know is interseting.

In [6]:
f.readline() # Skip a row
row = f.readline()
row = row.split("\t")
for entry in row:
    print entry
20131001
1
KILL#309##1#Pakistan#PK#PK#30#70#PK;WOUND#500##1#Pakistan#PK#PK#30#70#PK
NATURAL_DISASTER;NATURAL_DISASTER_NATURAL_DISASTERS;MEDIA_MSM;NATURAL_DISASTER_EARTHQUAKE;KILL;WOUND;ARMEDCONFLICT;TAX_FNCACT;TAX_FNCACT_BABY;TAX_FNCACT_CHILD;AFFECT;TAX_FNCACT_STUDENTS
1#Egypt#EG#EG#27#30#EG;1#Syria#SY#SY#35#38#SY;4#Awaran, Balochistan, Pakistan#PK#PK02#26.4555#65.2312#-2755131;1#Pakistan#PK#PK#30#70#PK;1#China#CH#CH#35#105#CH;1#Ethiopia#ET#ET#8#38#ET;1#Iran#IR#IR#32#53#IR;1#Sudan#SU#SU#15#30#SU;1#Oman#MU#MU#21#57#MU;1#Afghanistan#AF#AF#33#65#AF
joel osteen;mohammed al balushi
united nations
-1.27388535031847,2.07006369426752,3.34394904458599,5.4140127388535,23.7261146496815,0.318471337579618
269967042,269967401
main.omanobserver.om
http://main.omanobserver.om/?p=17731&c=8oNJ4q9WWDBTCPNQDVVAm5a7sg99Rx0ggsXaeQ4orZI&mkt=en-us

Whoa, what's going on here? Each entry isn't just a straightforward data column -- some are lists of items, separated by different characters.

The first two entries are straightforward -- the date the row is coming from, and the number of articles included in this particular nameset.

The next row is the COUNTS data. Essentially, this is a small sub-table. Count entries are separated by semicolons, and 'columns' within those rows are separated by hashmarks. So:

In [7]:
for entry in row[2].split(";"):
    print entry.split("#")
['KILL', '309', '', '1', 'Pakistan', 'PK', 'PK', '30', '70', 'PK']
['WOUND', '500', '', '1', 'Pakistan', 'PK', 'PK', '30', '70', 'PK']

So this nameset is dealing with an event in Pakistan which is described as having 309 killed and 500 wounded. The first entry is the COUNTTYPE -- what's being counted; next is the NUMBER, the actual count; next (missing here) is the OBJECTTYPE, describing who was affected. The next records are geographic the GEO_TYPE (1 here indicates a country), followed by the GEO_FULLNAME, the GEO_COUNTRYCODE and GEO_ADM1CODE (the FIPS codes for the location), the latitude and longitude (just the centroid of Pakistan, here) and a FEATUREID.

The next entry is just a list of THEMES (see the full theme spreadsheet for their interpretation), separated by semicolons.

In [8]:
print row[3].split(";")
['NATURAL_DISASTER', 'NATURAL_DISASTER_NATURAL_DISASTERS', 'MEDIA_MSM', 'NATURAL_DISASTER_EARTHQUAKE', 'KILL', 'WOUND', 'ARMEDCONFLICT', 'TAX_FNCACT', 'TAX_FNCACT_BABY', 'TAX_FNCACT_CHILD', 'AFFECT', 'TAX_FNCACT_STUDENTS']

This suggests that the event in question is an earthquake, though the article also mentions Armed Conflict, and the functional actors (role designations, denoted with TAX_FNACT prefix) BABY, CHILD, and STUDENTS.

LOCATIONS are another sub-table, listing all the locations mentioned in the article. Like COUNTS, its 'rows' are split with a semicolon, and 'columns' with a hash-sign.

In [9]:
for entry in row[4].split(";"):
    print entry.split("#")
['1', 'Egypt', 'EG', 'EG', '27', '30', 'EG']
['1', 'Syria', 'SY', 'SY', '35', '38', 'SY']
['4', 'Awaran, Balochistan, Pakistan', 'PK', 'PK02', '26.4555', '65.2312', '-2755131']
['1', 'Pakistan', 'PK', 'PK', '30', '70', 'PK']
['1', 'China', 'CH', 'CH', '35', '105', 'CH']
['1', 'Ethiopia', 'ET', 'ET', '8', '38', 'ET']
['1', 'Iran', 'IR', 'IR', '32', '53', 'IR']
['1', 'Sudan', 'SU', 'SU', '15', '30', 'SU']
['1', 'Oman', 'MU', 'MU', '21', '57', 'MU']
['1', 'Afghanistan', 'AF', 'AF', '33', '65', 'AF']

These follow the same columns as above -- a location type, fullname and FIPS codes, and lat-long coordinates.

PERSONS and ORGANIZATIONS are both simply lists of the people and organizations extracted from the document, split by semicolons.

In [10]:
# PERSONS:
print row[5].split(";")
['joel osteen', 'mohammed al balushi']
In [11]:
# ORGANIZATIONS
print row[6].split(";")
['united nations']

So this article mentions American televangelist Joel Osteen, Omani football/soccer player Mohammed Al Balushi, and the UN. The source link itself is dead, but we know that there was a major earthquake in Pakistan in late September. Maybe this is an article about earthquake relief efforts?

Next we get to the EMOTION data, which is comma-delimited:

In [12]:
print row[7].split(",")
['-1.27388535031847', '2.07006369426752', '3.34394904458599', '5.4140127388535', '23.7261146496815', '0.318471337579618']

The first value is TONE, the average 'tone' of the article. Tone is measured from -100 to +100, so a value of -1.27 is neutral-leaning-negative. In fact, it is simply the subtraction of the next two values: Positive Score - Negative Score, both measured on a 0-100 scale.

Polarity, the next value, is the percent of 'tonal' words in the document; here, only ~5% of words in the text were tonal, suggesting a mostly-neutral document.

Activity Reference Density is the percent of active words in the document (23.72% here), and Self/Group Reference Density is the percent of words referencing pronouns -- extremely low, though the GKG documentation says that this is typical of news media.

After TONE come CAMEOEVENTIDS, which are GDELT GlobalEventID codes that can be used to tie GKG back to GDELT:

In [13]:
print row[8].split(",")
['269967042', '269967401']

Finally comes sourcing information: the SOURCES and SOURCEURLS. This nameset contains only one article, so it includes only one source. If there were more sources, they would be comma-delimited.

In [14]:
# SOURCE:
print row[9]
main.omanobserver.om
In [15]:
#SOURCEURL
print row[10]
http://main.omanobserver.om/?p=17731&c=8oNJ4q9WWDBTCPNQDVVAm5a7sg99Rx0ggsXaeQ4orZI&mkt=en-us

Example Application: Analyzing Iranian Leadership

Next, I'm going to run through a fairly simple application: analyzing the co-mention network surrounding the Iranian leadership, inspired by Drew Conway's analysis of Chinese leadership. To do this, I'll start with a list of the names of Iran's leaders, taken from the CIA World Factbook, and pull any names that co-appear with them across all the available GKG data. Then I'll create a network of co-mentions and analyze it.

First, I grab the names of the leadership from the CIA World Factbook. (I did this by hand; it's probably possible to scrape the names, but I'd rather not do anything to the CIA website that may be mistaken for 'hacking').

In [16]:
LEADERS = ["Ali Hoseini-KHAMENEI", "Hasan Fereidun RUHANI", "Mohsen HAJI-MIRZAIE", 
    "Mohammad NAHAVANDIAN", "Eshaq JAHANGIRI", "Mohammad SHARIATMADARI", 
    "Elham AMINZADEH", "Mohammad Baqer NOBAKHT", "Majid ANSARI", 
    "Mohammad Baqer NOBAKHT", "Sorena SATARI", "Shahindokht MOLAVERDI", 
    "Ali Akbar SALEHI", "Mohammad Ali NAJAFI", "Masumeh EBTEKAR", 
    "Mohammad Ali SHAHADI", "Mohammad HOJJATI", "Mahmud VAEZI-Jazai", 
    "Ali JANATI", "Hosein DEHQAN", "Ali TAYEBNIA", "Ali Asqar FANI", 
    "Hamid CHITCHIAN", "Mohammad Javad ZARIF-Khonsari", 
    "Seyed Hasan QAZIZADEH-Hashemi", "Mohammad Reza NEMATZADEH", 
    "Mahmud ALAVI, Hojjat ol-Eslam", "Abdolreza Rahmani-FAZLI", 
    "Mostafa PUR-MOHAMMADI", "Ali RABIEI", "Bijan Namdar-ZANGANEH", 
    "Abbas Ahmad AKHUNDI", "Reza FARAJI-DANA", "Valiollah SEIF", 
    "Mohammad KHAZAI-Torshizi"]
# Convert the names to all lower-case
LEADERS = [name.lower() for name in LEADERS]

Next, we iterate over all the GKG files and look for matching names.

In [17]:
entries = []
for path in os.listdir(PATH):
    if path[-3:] != "csv": continue
    f = open(PATH + path)
    for row in f:
        actors = row.split("\t")[5].split(";")
        for actor in actors:
            if actor in LEADERS:
                entries.append(actors)
                break
print len(entries)
966

We want to translate each list of co-appearing names into dyads, and count the number of times each dyad appears. The itertools module in the Standard Library has a combinations(...) method that provides all possible combinations of elements in a list. We'll also import defaultdict to store the dyad-counts in.

In [18]:
import itertools
from collections import defaultdict
In [19]:
dyads = defaultdict(int)
for entry in entries:
    for p1, p2 in itertools.combinations(entry, 2):
        if (p2, p1) in dyads:
            dyads[(p2, p1)] += 1
        else:
            dyads[(p1, p2)] += 1

We can take a quick histogram to see how many dyads occur in different frequencies.

In [20]:
import matplotlib.pyplot as plt
# Some initial styling, to make our graphs look good:
matplotlib.rcParams['axes.facecolor'] = "#eeeeee"
matplotlib.rcParams['axes.grid'] = True
matplotlib.rcParams['xtick.labelsize'] = 14
matplotlib.rcParams['ytick.labelsize'] = 14
In [21]:
fig = plt.figure(figsize=(20,12))
ax = fig.add_subplot(111)
ax.set_yscale('log')
h = ax.hist(dyads.values(), bins=np.linspace(1, 250, 26))

So the vast majority of dyads occur only a small number of times, and a small number occur many more times. We can check just how many dyads occur more than once:

In [22]:
counts = np.array(dyads.values())
print len(counts[counts>1])/(1.0*len(counts))
0.181183974487

Now we can build a network based on the co-mentions, using the NetworkX library. We'll filter out all dyads that occur only once, in order to avoid spurious relationships and get to the core of the network.

In [23]:
import networkx as nx
In [24]:
# Build the graph
G = nx.Graph()
for dyad, count in dyads.iteritems():
    if count > 1:
        G.add_edge(dyad[0], dyad[1], weight=count)

If you want to explore the network visually, the best thing to do is probably to save it to a file (for example, as GraphML) and load it in Gephi or another network analysis tool.

In [25]:
nx.write_graphml(G, "iran.graphml")

However, we can also do some analysis here. NetworkX has a decent drawing ability, so let's visualize the network.

In [26]:
fig = plt.figure(figsize=(20,20))
ax = fig.add_subplot(111)
pos = nx.spring_layout(G, k=0.2, iterations=25)
nx.draw_networkx_edges(G, pos=pos, ax=ax, edge_color='#eeeeee')
nx.draw_networkx_labels(G, pos=pos, ax=ax, font_size=16)
_ = ax.axis('off')