Getting Started with ICEWS in Python

David Masad

Department of Computational Social Science, George Mason University

The Integrated Crisis Early Warning System (ICEWS) is a machine-coded event dataset developed by Lockheed Martin and others for DARPA and the Office of Naval Research. For a long time, ICEWS was available only within the Department of Defense, and to a few select academics. Now, for the first time, a checkpointed version of ICEWS is being released to the general public (or, at least, the parts of the general public that care about political event data).

Unlike some event data sets, the public version of ICEWS will only be updated annually or so, but it still includes almost 20 years worth of event data that's been used successfully both in the government and academic research.

This document is mostly a cleaned-up version of my own initial exploration of the dataset. Hopefully it'll prove useful to others who want to use ICEWS in their own research.

UPDATE (03/29/15): Jennifer Lautenschlager, from the ICEWS team at Lockheed Martin, was kind enough to provide some clarifications, which I've added.

Technical note

This is done in Python 3.4.2, with pandas version 0.15.2. The only requirement that might be tricky to install is Basemap, which is only used for the mapping section. You won't miss much without it.

In [1]:
import os
from collections import defaultdict

# Other libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
# Show plots inline
%matplotlib inline

Downloading the data

The data is available via the Harvard Dataverse, at http://thedata.harvard.edu/dvn/dv/icews. The two datasets I use are the ICEWS Coded Event Data and the Ground Truth Data Set. The easiest way to download both is to go to the Data & Analysis tab, click Select all files at the top, and then Download Selected Files.

The ICEWS event data comes as one file per year, initially zipped. On OSX or Linux, you can unzip all the files in a directory at once from the terminal with

$ unzip "*.zip"

And you can delete all the zipped files with

$ rm *.zip

In this document, I assume that all the annual data files, as well as the one Ground Truth data file, are in the same directory.

Loading the data

In [2]:
# Path to directory where the data is stored
DATA = "/Users/dmasad/Data/ICEWS/"

For testing purposes, I start by loading a single year into a pandas DataFrame. The data files are tab-delimited, and have the column names as the first row.

In [3]:
one_year = pd.read_csv(DATA + "events.1995.20150313082510.tab", sep="\t")
In [4]:
one_year.head()
Out[4]:
Event ID Event Date Source Name Source Sectors Source Country Event Text CAMEO Code Intensity Target Name Target Sectors Target Country Story ID Sentence Number Publisher City District Province Country Latitude Longitude
0 926685 1995-01-01 Extremist (Russia) Radicals / Extremists / Fundamentalists,Dissident Russian Federation Praise or endorse 51 3.4 Boris Yeltsin Elite,Executive,Executive Office,Government Russian Federation 28235806 5 The Toronto Star Moscow NaN Moskva Russian Federation 55.7522 37.6156
1 926687 1995-01-01 Government (Bosnia and Herzegovina) Government Bosnia and Herzegovina Express intent to cooperate 30 4.0 Citizen (Serbia) General Population / Civilian / Social,Social Serbia 28235807 1 The Toronto Star NaN NaN Bosnia Bosnia and Herzegovina 44.0000 18.0000
2 926686 1995-01-01 Citizen (Serbia) General Population / Civilian / Social,Social Serbia Express intent to cooperate 30 4.0 Government (Bosnia and Herzegovina) Government Bosnia and Herzegovina 28235807 1 The Toronto Star NaN NaN Bosnia Bosnia and Herzegovina 44.0000 18.0000
3 926688 1995-01-01 Canada NaN Canada Praise or endorse 51 3.4 City Mayor (Canada) Government,Local,Municipal Canada 28235809 3 The Toronto Star NaN NaN Ontario Canada 49.2501 -84.4998
4 926689 1995-01-01 Lawyer/Attorney (Canada) Legal,Social Canada Arrest, detain, or charge with legal action 173 -5.0 Police (Canada) Government,Police Canada 28235964 1 The Toronto Star Montreal Montreal Quebec Canada 45.5088 -73.5878
In [5]:
one_year.dtypes
Out[5]:
Event ID             int64
Event Date          object
Source Name         object
Source Sectors      object
Source Country      object
Event Text          object
CAMEO Code           int64
Intensity          float64
Target Name         object
Target Sectors      object
Target Country      object
Story ID             int64
Sentence Number      int64
Publisher           object
City                object
District            object
Province            object
Country             object
Latitude           float64
Longitude          float64
dtype: object

Looks pretty good! Notice that the Event Date column is an object (meaning a string), so when we load in all of the data we should tell pandas to parse it automatically.

Loading all the data

The ICEWS data isn't too big to hold in memory all at once, so I go ahead and load the entire thing. To do it, we'll iterate over all the data files, read each into a DataFrame, and then concatenate them together.

Note that in this code, I added the parse_dates=[1] argument to the .read_csv(...) method, telling pandas to parse the second column as a date.

This code assumes that the ICEWS data files are the only .tab files in your DATA directory. If that isn't the case, adjust as needed.

In [6]:
all_data = []
for f in os.listdir(DATA): # Iterate over all files
    if f[-3:] != "tab":  # Skip non-tab files.
        continue
    df = pd.read_csv(DATA + f, sep='\t', parse_dates=[1])
    all_data.append(df)

data = pd.concat(all_data)

Some of the ICEWS column names have spaces in them, which means they can't be referenced using pandas's period notation. To fix this, I rename the columns to replace the spaces with underscores:

In [7]:
cols = {col: col.replace(" ", "_") for col in data.columns}
data.rename(columns=cols, inplace=True)
In [8]:
data.dtypes
Out[8]:
Event_ID                    int64
Event_Date         datetime64[ns]
Source_Name                object
Source_Sectors             object
Source_Country             object
Event_Text                 object
CAMEO_Code                  int64
Intensity                 float64
Target_Name                object
Target_Sectors             object
Target_Country             object
Story_ID                    int64
Sentence_Number             int64
Publisher                  object
City                       object
District                   object
Province                   object
Country                    object
Latitude                  float64
Longitude                 float64
dtype: object
In [9]:
print(data.Event_Date.min())
print(data.Event_Date.max())
1995-01-01 00:00:00
2014-02-28 00:00:00
In [10]:
len(data)
Out[10]:
13514121

Looks good! The data types are what we expect, and the dates seem to have been parsed correctly.

Examining the data

A good initial examination of the data is seeing who the most frequent actors are. The following code counts how often each actor appears as the source or target of an event:

In [11]:
actors_source = data.Source_Name.value_counts()
actors_target = data.Target_Name.value_counts()
actor_counts = pd.DataFrame({"SourceFreq": actors_source,
                             "TargetFreq": actors_target})
actor_counts.fillna(0, inplace=True)
actor_counts["Total"] = actor_counts.SourceFreq + actor_counts.TargetFreq

Now let's look at the top 50 actors. For people like me who are more used to GDELT and Phoenix, the actor list might look a little different than what we expect:

In [12]:
actor_counts.sort("Total", ascending=False, inplace=True)
actor_counts.head(50)
Out[12]:
SourceFreq TargetFreq Total
United States 330446 341603 672049
Russia 195571 260635 456206
China 192944 254747 447691
Israel 116427 150810 267237
Japan 103651 145657 249308
India 96871 147406 244277
Iran 84099 150208 234307
Citizen (India) 65350 136966 202316
United Nations 92022 107413 199435
Unspecified Actor 0 198718 198718
European Union 90824 100310 191134
Iraq 48025 128246 176271
Vladimir Putin 102453 73104 175557
George W. Bush 100763 72909 173672
North Korea 65394 105958 171352
Turkey 64995 104946 169941
South Korea 69509 89696 159205
Pakistan 57450 98419 155869
Police (India) 117726 32771 150497
United Kingdom 65195 81465 146660
Palestinian Territory, Occupied 39861 101029 140890
France 61803 72704 134507
Australia 41393 71031 112424
Afghanistan 31678 75604 107282
North Atlantic Treaty Organization 51437 52715 104152
Syria 33110 64113 97223
Germany 42116 51114 93230
Egypt 34961 47969 82930
Barack Obama 46660 34782 81442
Indonesia 28997 43795 72792
Georgia 26227 45432 71659
Hu Jintao 39679 28938 68617
Citizen (Palestinian Territory, Occupied) 19921 48444 68365
Yasir Arafat 31196 35228 66424
Citizen (Russia) 23473 42401 65874
Thailand 25541 40200 65741
Mahmoud Abbas 34629 29995 64624
Citizen (Australia) 22541 41030 63571
Serbia 21006 41502 62508
Taliban 31678 30688 62366
Government (India) 12117 49629 61746
Israeli Defense Forces 42364 19130 61494
UN Security Council 27184 34210 61394
Ukraine 23290 37906 61196
Taiwan 19421 41052 60473
Kofi Annan 37150 22442 59592
Vietnam 24456 34951 59407
Tony Blair 33083 25713 58796
Lebanon 17041 40094 57135
Mexico 24690 32142 56832

What stood out to me was the mix of country-level actors with named individuals. Unlike event datasets that use CAMEO coding, leaders or sub-state organizations don't seem to be coded as add-ons to a state actor code (e.g. USAGOV) but separate actors in their own right.

Update (03/29/2015): The _Sectors column contains the role information that would otherwise be contained in the chained CAMEO designations. For example, if you scroll back to the first row of 1995 data, the target name is Boris Yeltsin, and the target sectors associated with him are "Elite,Executive,Executive Office,Government".

The Citizen (Country) actor stood out to me in particular, especially since it isn't mentioned specifically in the included documentation -- so let's take a look at some of the rows that use it:

In [13]:
data[data.Source_Name=="Citizen (India)"].head()
Out[13]:
Event_ID Event_Date Source_Name Source_Sectors Source_Country Event_Text CAMEO_Code Intensity Target_Name Target_Sectors Target_Country Story_ID Sentence_Number Publisher City District Province Country Latitude Longitude
676 927826 1995-01-11 Citizen (India) General Population / Civilian / Social,Social India Reject proposal to meet, discuss, or negotiate 125 -5.0 Narasimha Rao Executive Office,Executive,Government India 28239021 2 The Associated Press Political Service NaN NaN State of Tamil Nadu India 11.0000 78.0000
783 927996 1995-01-12 Citizen (India) General Population / Civilian / Social,Social India Express intent to meet or negotiate 36 4.0 United States NaN United States 28239081 1 The Associated Press Political Service Swanton Saline County Nebraska United States 40.3781 -97.0728
2151 2547954 1995-01-26 Citizen (India) Social,General Population / Civilian / Social India Demonstrate or rally 141 -6.5 Unspecified Actor Unspecified NaN 28242253 2 Reuters News Jammu NaN State of Jammu and Kashmir India 32.7357 74.8691
5760 935070 1995-03-06 Citizen (India) Social,General Population / Civilian / Social India Kill by physical assault 1823 -10.0 Congress Party (National) Major Party,Government Major Party ... India 28915932 4 The Associated Press Political Service Hyderabad Hyderabad State of Andhra Pradesh India 17.3841 78.4564
5766 935081 1995-03-06 Citizen (India) Social,General Population / Civilian / Social India Use unconventional violence 180 -9.0 Militant (India) Unidentified Forces India 28915955 6 The Associated Press Political Service New Delhi NaN National Capital Territory of Delhi India 28.6358 77.2244

So it looks like Citizen really means civilians, or possibly civil society actors unaffiliated with any organization the ICEWS coding system recognizes.

Update (03/29/2015): I had trouble finding news events that corresponded to the events above, but Jennifer Lautenschlager pointed me to this news article that indicates that there was election violence in India in that time frame.

To get country-level actors comparable to other event datasets, looks like we need to use the source and target country columns:

In [14]:
country_source = data.Source_Country.value_counts()
country_target = data.Target_Country.value_counts()
country_counts = pd.DataFrame({"SourceFreq": country_source,
                             "TargetFreq": country_target})
country_counts.fillna(0, inplace=True)
country_counts["Total"] = country_counts.SourceFreq + country_counts.TargetFreq
In [15]:
country_counts.sort("Total", ascending=False, inplace=True)
country_counts.head(10)
Out[15]:
SourceFreq TargetFreq Total
United States 997696 803460 1801156
India 773712 760583 1534295
Russian Federation 746829 706700 1453529
China 541432 525955 1067387
Japan 344413 332380 676793
Australia 340339 320329 660668
Israel 338118 315501 653619
United Kingdom 331735 302389 634124
Occupied Palestinian Territory 251678 317883 569561
Iran 286274 283276 569550

This looks pretty good too! India seems more represented compared to what I've seen in other datasets, and of course Israel/Palestine maintain their usual place on the event data leaderboard.

Update (03/29/2015): Since the Sectors are also an important way of understanding the relevant data, let's get their frequencies too. Sectors are a bit trickier, since each cell can contain multiple selectors, separated by commas. So we need to loop over each cell, split the selectors mentioned, and count each one.

In [16]:
# Count source sectors
source_sectors = defaultdict(int)
source_sector_counts = data.Source_Sectors.value_counts()
for sectors, count in source_sector_counts.iteritems():
    sectors = sectors.split(",")
    for sector in sectors:
        source_sectors[sector] += 1

# Count Target sectors
target_sectors = defaultdict(int)
target_sector_counts = data.Target_Sectors.value_counts()
for sectors, count in target_sector_counts.iteritems():
    sectors = sectors.split(",")
    for sector in sectors:
        target_sectors[sector] += 1
        
# Convert into series
source_sectors = pd.Series(source_sectors)
target_sectors = pd.Series(target_sectors)
# Combine into a dataframe, and fill missing with 0
sector_counts = pd.DataFrame({"SourceFreq": source_sectors,
                              "TargetFreq": target_sectors})

sector_counts.fillna(0, inplace=True)
sector_counts["Total"] = sector_counts.SourceFreq + sector_counts.TargetFreq
In [17]:
sector_counts.sort("Total", ascending=False, inplace=True)
In [18]:
sector_counts.head(10)
Out[18]:
SourceFreq TargetFreq Total
Government 176897 138684 315581
Parties 171411 135383 306794
Ideological 153750 121842 275592
(National) Major Party 134134 106262 240396
Executive 129382 103265 232647
Elite 92926 80163 173089
Legislative / Parliamentary 63654 45670 109324
Executive Office 54710 49770 104480
Cabinet 57273 42038 99311
Center Left 48678 37593 86271
In [19]:
sector_counts.tail(10)
Out[19]:
SourceFreq TargetFreq Total
International Exiles 2 1 3
Bedouin 2 0 2
Nepali-Pahari 1 1 2
Western 1 1 2
Navy Headquarters 1 1 2
Army Education / Training 0 1 1
Unspecified 0 1 1
Consumer Services MNCs 1 0 1
State-Owned Consumer Goods 1 0 1
Saharan 1 0 1

In addition to CAMEO-type actor designations (e.g. Government) it looks like some of the Sectors also resemble the Issues in Phoenix, or Themes in the GDELT GKG.

Daily Event Counts

An easy way to get an idea of whether there were significant changes in data collection over time is to look at total events over time. ICEWS events have the full daily date only, so let's go with that and look at daily events.

In [20]:
daily_events = data.groupby("Event_Date").aggregate(len)["Event_ID"]
In [21]:
daily_events.plot(color='k', lw=0.2, figsize=(12,6), 
                  title="ICEWS Daily Event Count")
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x1044704e0>

There seems to be a definite ramp-up period from 1995 through 1999 or so, and some sort of fall in event volume around 2009. Notice that there are also a few individual days, especially around 2004, with very few events for some reason.

Update (03/29/2015): Jennifer Lautenschlager clarified that the jumps in the 1995-2001 period reflect publishers entering incrementally into the commercial data system that feeds into ICEWS. The post-2008 dip reflects a decline in number of stories overall, possibly driven by budget cuts due to the recession.

Since each event has an associated Story ID, we can count how many unique stories are processed by ICEWS every day and end up generating events.

In [22]:
daily_stories = data.groupby("Event_Date").aggregate(pd.Series.nunique)["Story_ID"]
In [23]:
daily_stories.plot(color='k', lw=0.2, figsize=(12,6), 
                   title="ICEWS Daily Story Count")
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a31dfd0>

With these two series, we can measure the daily average events generated per story:

In [24]:
events_per_story = daily_events / daily_stories
events_per_story.plot(color='k', lw=0.2, figsize=(12,6), 
                      title="ICEWS Daily Events Per Story")
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a317080>

This confirms that indeed, except for a few anomalies, the number of events generated per story stays relatively consistent over time. Nevertheless, it's probably important to at least try to distinguish between fewer stories as caused by fewer newsworthy events, and fewer stories as caused by fewer journalists writing them.

Mapping

Another good way to get an idea of the dataset's coverage is to put the events on a map. To do that, let's group the data by the latitude and longitude for each event, and count the number of events at each point. Then we can put those points on a world map using the basemap package.

In [25]:
points = data.groupby(["Latitude", "Longitude"]).aggregate(len)["Event_ID"]
points = points.reset_index()

Nobody will be surprised that the distribution of events-per-point is very long-tailed, with many points having only a small number of events, and a small number of points having hundreds of thousands of events.

In [26]:
points.Event_ID.hist()
plt.yscale('log')

So the best way to deal with this is to plot point size based on the log of the number of events recorded there.

The following code draws a world map using Basemap's default, built-in map, and then iterates over all the points, putting a dot on the map for each one. Finally, it exports the resulting map to a PNG file

In [27]:
plt.figure(figsize=(16,16))

# Draw the world map itself
m = Basemap(projection='eck4',lon_0=0,resolution='c')
m.drawcoastlines()
m.fillcontinents()
# draw parallels and meridians.
m.drawparallels(np.arange(-90.,120.,30.))
m.drawmeridians(np.arange(0.,360.,60.))
m.drawmapboundary()
m.drawcountries()
plt.title("ICEWS Total Events", fontsize=24)

# Plot the points
for row in points.iterrows():
    row = row[1]
    lat = row.Latitude
    lon = row.Longitude
    count = np.log10(row.Event_ID + 1) * 2
    x, y = m(lon, lat) # Convert lat-long to plot coordinates
    m.plot(x, y, 'ro', markersize=count, alpha=0.3)
plt.savefig("ICEWS.png", dpi=120, facecolor="#FFFFFF")

This looks... shockingly good to me. A few regions -- particularly the Indian subcontinent, East Asia and South America -- seem much better covered than in some other datasets. US Pacific Command was one of ICEWS's first customers, so it makes sense that its AOR would be well covered. Nigeria also seems to be relatively densly-covered, though whether this is because of particular attention or simply its population and regional significance isn't clear.

The ICEWS documentation says that purely domestic US events aren't included. This explains why the continental US appears sparser than some other datasets -- but there are obviously many points still left. Most of these events have at least one foreign actor, and apparently very few purely domestic events slip past the filters ICEWS have in place.

The Israeli-Palestinian Dyad

Back to everyone's favorite dyad, which has had more than its fair share of event data analysis pointed at it. Let's subset all events originating from Israel and targeting what ICEWS codes as the Occupied Palestinian Territory, and vice-versa.

In [28]:
dyad = ["Israel", "Occupied Palestinian Territory"]
ilpalcon = data[(data.Source_Country.isin(dyad)) & 
                (data.Target_Country.isin(dyad))]
In [29]:
ilpalcon.head()
Out[29]:
Event_ID Event_Date Source_Name Source_Sectors Source_Country Event_Text CAMEO_Code Intensity Target_Name Target_Sectors Target_Country Story_ID Sentence_Number Publisher City District Province Country Latitude Longitude
50 926771 1995-01-03 Citizen (Palestinian Territory, Occupied) Social,General Population / Civilian / Social Occupied Palestinian Territory Criticize or denounce 111 -2.0 Yitzhak Rabin Government,State-Owned Defense / Security,Exec... Israel 28235898 4 The Toronto Star Bethlehem NaN West Bank Occupied Palestinian Territory 31.7049 35.2038
132 926876 1995-01-03 Yitzhak Rabin Government,State-Owned Defense / Security,Exec... Israel Make statement 10 0.0 Government (Israel) Government Israel 28242652 3 The Wall Street Journal Europe Jerusalem NaN Jerusalem District Israel 31.7690 35.2163
133 926877 1995-01-03 Cabinet / Council of Ministers / Advisors (Isr... Government,Cabinet,Executive Israel Praise or endorse 51 3.4 Yitzhak Rabin Government,State-Owned Defense / Security,Exec... Israel 28242652 4 The Wall Street Journal Europe Jerusalem NaN Jerusalem District Israel 31.7690 35.2163
161 926955 1995-01-04 Yasir Arafat Executive Office,Government,Executive Occupied Palestinian Territory Engage in diplomatic cooperation 50 3.5 Israel NaN Israel 28241261 3 The Christian Science Monitor Gaza NaN Gaza Strip Occupied Palestinian Territory 31.5000 34.4667
162 926956 1995-01-04 Yasir Arafat Executive Office,Government,Executive Occupied Palestinian Territory Engage in diplomatic cooperation 50 3.5 Hamas Parties,(National) Major Party,Dissident,Nongo... Occupied Palestinian Territory 28241261 3 The Christian Science Monitor Gaza NaN Gaza Strip Occupied Palestinian Territory 31.5000 34.4667

Unlike GDELT and Phoenix, ICEWS doesn't include a quad/penta-code categorizing events into broadly cooperative or conflict actions (though you can create them yourself using the ICEWS CAMEO code, e.g. as described in the Phoenix documentation). Instead, it provides an Intensity score -- positive intensity indicates positive events (providing assistance, etc.) while negative scores indicate conflict (criticism, physical attacks). Taking the average intensity for some period of time should provide a rough estimate of each side's posture towards the other.

Let's break down the subset further, one for Israeli-initiated actions and one for Palestinian-initiated ones. That will give us a rough estimate of reciprocity -- is one side behaving more peacefully towards the other, or are their actions relatively mirrored?

First, we select Israel-initiated events, and get the mean intensity by day.

In [30]:
il_initiated = ilpalcon[ilpalcon.Source_Country=="Israel"]
il_initiated = il_initiated.groupby("Event_Date")
il_initiated = il_initiated.aggregate(np.mean)["Intensity"]
In [31]:
il_initiated.plot()
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x112ee3630>

It looks like daily events are too noisy to give us a good picture of what's going on. To let's use pandas's rolling mean tool to see the average intensity across a 30-day window:

In [32]:
pd.rolling_mean(il_initiated, 30).plot()
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x26d873908>

Notice the sharp drop that occurs in late 2000, marking the beginning of the Second Intifada.

Now let's get the same dataset for Palestinian-initiated actions. This time, I string together the pandas operations using the '\' operator, which allows multiple lines to be strung together for legibility as if they were a single line of code:

In [33]:
pal_initiated = ilpalcon[ilpalcon.Source_Country=="Occupied Palestinian Territory"] \
                        .groupby("Event_Date") \
                        .aggregate(np.mean) \
                        ["Intensity"]

Next, combining the two mean intensity series into a single dataframe:

In [34]:
df = pd.DataFrame({"IL_Initiated": pd.rolling_mean(il_initiated, 30),
                   "PAL_Initiated": pd.rolling_mean(pal_initiated, 30)})

And now we can plot the mean intensity of actions initiated by each side.

In [35]:
fig, ax = plt.subplots(figsize=(12,6))
df.plot(ax=ax)
ax.set_ylabel("Mean Intensity Coding")
Out[35]:
<matplotlib.text.Text at 0x26db48550>

Not too surprisingly, they seem to overlap almost perfectly. There are a few points that stand out where the lines diverge significantly -- in a more in-depth analysis, they might warrant further examination to see whether they represent something interesting happening on the ground, or just a blip in the data collection.

We can correlate the series, and see that they do indeed track each other pretty closely (though not as perfectly as they may look on visual examination):

In [36]:
df.corr()
Out[36]:
IL_Initiated PAL_Initiated
IL_Initiated 1.000000 0.802497
PAL_Initiated 0.802497 1.000000

Ground Truth Dataset

One of ICEWS's biggest advantages is that it includes not only machine-coded event data, but hand-validated ground truth data on whether, on a monthly basis, each country is experiencing one of several types of conflict events.

Let's load it and take a look:

In [37]:
ground_truth = pd.read_csv(DATA + "gtds_2001.to.feb.2014.csv")
In [38]:
ground_truth.head()
Out[38]:
ccode country year month time ins reb dpc erv ic notes coder insnotes dpcnotes rebnotes ervnotes icnotes
0 20 CANADA 2001 1 2001m1 0 0 0 0 0 NaN Bentley & Leonard NaN NaN NaN NaN NaN
1 20 CANADA 2001 2 2001m2 0 0 0 0 0 NaN Bentley & Leonard NaN NaN NaN NaN NaN
2 20 CANADA 2001 3 2001m3 0 0 0 0 0 NaN Bentley & Leonard NaN NaN NaN NaN NaN
3 20 CANADA 2001 4 2001m4 0 0 0 0 0 NaN Bentley & Leonard NaN NaN NaN NaN NaN
4 20 CANADA 2001 5 2001m5 0 0 0 0 0 NaN Bentley & Leonard NaN NaN NaN NaN NaN

The columns ins to ic are 1 if the country experienced that event during that month, and 0 otherwise. They are:

  • ins: Insurgency
  • reb: Rebellion
  • dpc: Domestic political crisis
  • erv: Ethnic or religious violence
  • ic: International conflict

For more details, see the GTDS documentation.

In [39]:
# Convert the 'time' column to datetime:
ground_truth["time"] = pd.to_datetime(ground_truth.time, format="%Ym%m")

We can do some simple analysis on the ground truth dataset alone, for example see how many insurgencies are going on in the world on a month-by-month basis:

In [40]:
insurgency_count = ground_truth.groupby("time").aggregate(sum)["ins"]
In [41]:
insurgency_count.plot()
plt.ylabel("# of countries")
plt.title("Number of countries experiencing insurgencies")
Out[41]:
<matplotlib.text.Text at 0x26d89afd0>

Combining the event data with ground truth

The real advantage that the ground truth data gives us is being able to combine it with the machine-coded event data for analysis and ultimately prediction.

In this example, I'm going to do a very simple analysis, and try and see whether countries experiencing one of the conflicts measured by the GTDS generate more events, and events of lower intensity.

First, we count how many 'bad things' are happening per country-month:

In [42]:
ground_truth["Conflict"] = 0
for col in ["ins", "reb", "dpc", "erv", "ic"]:
    ground_truth.Conflict += ground_truth[col]

All we care about for now is the country, the month, and the count of conflict types:

In [43]:
monthly_conflict = ground_truth[["time", "country", "Conflict"]]
In [44]:
monthly_conflict.head()
Out[44]:
time country Conflict
0 2001-01-01 CANADA 0
1 2001-02-01 CANADA 0
2 2001-03-01 CANADA 0
3 2001-04-01 CANADA 0
4 2001-05-01 CANADA 0

Now let's go back to the ICEWS event data, and aggregate it on a country-month basis too. For purposes of this analysis, I'll associate events with the country that ICEWS places them in, rather than the source or target country.

I'll collect two measures: how many events were generated per country-month, and what their average intensity was.

ICEWS events are on a daily basis, so we need to associate a year-month with each event. Unfortunately, pandas doesn't know how to deal with 'months' -- notice that we converted the ground truth event date into the first day of the relevant month. We'll do the same for the ICEWS events:

In [45]:
get_month = lambda x: pd.datetime(x.year, x.month, 1)
data["YearMonth"] = data.Event_Date.apply(get_month)

Now we'll group the data by country and month (really, first-day-of-the-month) and get the number and mean intensity of events for each.

In [46]:
monthly_grouped = data.groupby(["YearMonth", "Country"])
monthly_counts = monthly_grouped.aggregate(len)["Event_ID"]
monthly_intensity = monthly_grouped.aggregate(np.mean)["Intensity"]

And combine these series into a single DataFrame:

In [47]:
monthly_events = pd.DataFrame({"EventCounts": monthly_counts,
                               "MeanIntensity": monthly_intensity})
monthly_events.reset_index(inplace=True)
In [48]:
monthly_events.head()
Out[48]:
YearMonth Country EventCounts MeanIntensity
0 1995-01-01 Afghanistan 5 1.640000
1 1995-01-01 Albania 10 0.800000
2 1995-01-01 Algeria 25 -3.064000
3 1995-01-01 Angola 3 -6.333333
4 1995-01-01 Argentina 18 1.022222

So this is fun: country names in the ICEWS event dataset are written with only the first letters capitalized, but the GTDS country names are in ALL CAPS. We need to convert one to the other in order to be able to match them -- and making country names all-caps is easier than dealing with title-casing multi-word all-cap country names.

In [49]:
capitalize = lambda x: x.upper()
monthly_events["Country"] = monthly_events.Country.apply(capitalize)

Now that we've done that, we can merge the dataframes on month and country name. The merge includes all the columns from both dataframes by default, so we need to only keep the ones we're interested in:

In [50]:
monthly_data = monthly_conflict.merge(monthly_events,
                       left_on=["time","country"], right_on=["YearMonth", "Country"])
monthly_data = monthly_data[["YearMonth", "Country", "Conflict", "EventCounts", "MeanIntensity"]]
In [51]:
monthly_data.head()
Out[51]:
YearMonth Country Conflict