Getting Started with ICEWS in Python¶

David Masad ¶

The Integrated Crisis Early Warning System (ICEWS) is a machine-coded event dataset developed by Lockheed Martin and others for DARPA and the Office of Naval Research. For a long time, ICEWS was available only within the Department of Defense, and to a few select academics. Now, for the first time, a checkpointed version of ICEWS is being released to the general public (or, at least, the parts of the general public that care about political event data).

Unlike some event data sets, the public version of ICEWS will only be updated annually or so, but it still includes almost 20 years worth of event data that's been used successfully both in the government and academic research.

This document is mostly a cleaned-up version of my own initial exploration of the dataset. Hopefully it'll prove useful to others who want to use ICEWS in their own research.

UPDATE (03/29/15): Jennifer Lautenschlager, from the ICEWS team at Lockheed Martin, was kind enough to provide some clarifications, which I've added.

Technical note¶

This is done in Python 3.4.2, with pandas version 0.15.2. The only requirement that might be tricky to install is Basemap, which is only used for the mapping section. You won't miss much without it.

In [1]:

import os
from collections import defaultdict

# Other libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
# Show plots inline
%matplotlib inline

Downloading the data¶

The data is available via the Harvard Dataverse, at http://thedata.harvard.edu/dvn/dv/icews. The two datasets I use are the ICEWS Coded Event Data and the Ground Truth Data Set. The easiest way to download both is to go to the Data & Analysis tab, click Select all files at the top, and then Download Selected Files.

The ICEWS event data comes as one file per year, initially zipped. On OSX or Linux, you can unzip all the files in a directory at once from the terminal with

$ unzip "*.zip"

And you can delete all the zipped files with

$ rm *.zip

In this document, I assume that all the annual data files, as well as the one Ground Truth data file, are in the same directory.

Loading the data¶

In [2]:

# Path to directory where the data is stored
DATA = "/Users/dmasad/Data/ICEWS/"

For testing purposes, I start by loading a single year into a pandas DataFrame. The data files are tab-delimited, and have the column names as the first row.

In [3]:

one_year = pd.read_csv(DATA + "events.1995.20150313082510.tab", sep="\t")

In [4]:

one_year.head()

Out[4]:

	Event ID	Event Date	Source Name	Source Sectors	Source Country	Event Text	CAMEO Code	Intensity	Target Name	Target Sectors	Target Country	Story ID	Sentence Number	Publisher	City	District	Province	Country	Latitude	Longitude
0	926685	1995-01-01	Extremist (Russia)	Radicals / Extremists / Fundamentalists,Dissident	Russian Federation	Praise or endorse	51	3.4	Boris Yeltsin	Elite,Executive,Executive Office,Government	Russian Federation	28235806	5	The Toronto Star	Moscow	NaN	Moskva	Russian Federation	55.7522	37.6156
1	926687	1995-01-01	Government (Bosnia and Herzegovina)	Government	Bosnia and Herzegovina	Express intent to cooperate	30	4.0	Citizen (Serbia)	General Population / Civilian / Social,Social	Serbia	28235807	1	The Toronto Star	NaN	NaN	Bosnia	Bosnia and Herzegovina	44.0000	18.0000
2	926686	1995-01-01	Citizen (Serbia)	General Population / Civilian / Social,Social	Serbia	Express intent to cooperate	30	4.0	Government (Bosnia and Herzegovina)	Government	Bosnia and Herzegovina	28235807	1	The Toronto Star	NaN	NaN	Bosnia	Bosnia and Herzegovina	44.0000	18.0000
3	926688	1995-01-01	Canada	NaN	Canada	Praise or endorse	51	3.4	City Mayor (Canada)	Government,Local,Municipal	Canada	28235809	3	The Toronto Star	NaN	NaN	Ontario	Canada	49.2501	-84.4998
4	926689	1995-01-01	Lawyer/Attorney (Canada)	Legal,Social	Canada	Arrest, detain, or charge with legal action	173	-5.0	Police (Canada)	Government,Police	Canada	28235964	1	The Toronto Star	Montreal	Montreal	Quebec	Canada	45.5088	-73.5878

In [5]:

one_year.dtypes

Out[5]:

Event ID             int64
Event Date          object
Source Name         object
Source Sectors      object
Source Country      object
Event Text          object
CAMEO Code           int64
Intensity          float64
Target Name         object
Target Sectors      object
Target Country      object
Story ID             int64
Sentence Number      int64
Publisher           object
City                object
District            object
Province            object
Country             object
Latitude           float64
Longitude          float64
dtype: object

Looks pretty good! Notice that the Event Date column is an object (meaning a string), so when we load in all of the data we should tell pandas to parse it automatically.

Loading all the data¶

The ICEWS data isn't too big to hold in memory all at once, so I go ahead and load the entire thing. To do it, we'll iterate over all the data files, read each into a DataFrame, and then concatenate them together.

Note that in this code, I added the parse_dates=[1] argument to the .read_csv(...) method, telling pandas to parse the second column as a date.

This code assumes that the ICEWS data files are the only .tab files in your DATA directory. If that isn't the case, adjust as needed.

In [6]:

all_data = []
for f in os.listdir(DATA): # Iterate over all files
    if f[-3:] != "tab":  # Skip non-tab files.
        continue
    df = pd.read_csv(DATA + f, sep='\t', parse_dates=[1])
    all_data.append(df)

data = pd.concat(all_data)

Some of the ICEWS column names have spaces in them, which means they can't be referenced using pandas's period notation. To fix this, I rename the columns to replace the spaces with underscores:

In [7]:

cols = {col: col.replace(" ", "_") for col in data.columns}
data.rename(columns=cols, inplace=True)

In [8]:

data.dtypes

Out[8]:

Event_ID                    int64
Event_Date         datetime64[ns]
Source_Name                object
Source_Sectors             object
Source_Country             object
Event_Text                 object
CAMEO_Code                  int64
Intensity                 float64
Target_Name                object
Target_Sectors             object
Target_Country             object
Story_ID                    int64
Sentence_Number             int64
Publisher                  object
City                       object
District                   object
Province                   object
Country                    object
Latitude                  float64
Longitude                 float64
dtype: object

In [9]:

print(data.Event_Date.min())
print(data.Event_Date.max())

1995-01-01 00:00:00
2014-02-28 00:00:00

In [10]:

len(data)

Out[10]:

13514121

Looks good! The data types are what we expect, and the dates seem to have been parsed correctly.

Examining the data¶

A good initial examination of the data is seeing who the most frequent actors are. The following code counts how often each actor appears as the source or target of an event:

In [11]:

actors_source = data.Source_Name.value_counts()
actors_target = data.Target_Name.value_counts()
actor_counts = pd.DataFrame({"SourceFreq": actors_source,
                             "TargetFreq": actors_target})
actor_counts.fillna(0, inplace=True)
actor_counts["Total"] = actor_counts.SourceFreq + actor_counts.TargetFreq

Now let's look at the top 50 actors. For people like me who are more used to GDELT and Phoenix, the actor list might look a little different than what we expect:

In [12]:

actor_counts.sort("Total", ascending=False, inplace=True)
actor_counts.head(50)

Out[12]:

	SourceFreq	TargetFreq	Total
United States	330446	341603	672049
Russia	195571	260635	456206
China	192944	254747	447691
Israel	116427	150810	267237
Japan	103651	145657	249308
India	96871	147406	244277
Iran	84099	150208	234307
Citizen (India)	65350	136966	202316
United Nations	92022	107413	199435
Unspecified Actor	0	198718	198718
European Union	90824	100310	191134
Iraq	48025	128246	176271
Vladimir Putin	102453	73104	175557
George W. Bush	100763	72909	173672
North Korea	65394	105958	171352
Turkey	64995	104946	169941
South Korea	69509	89696	159205
Pakistan	57450	98419	155869
Police (India)	117726	32771	150497
United Kingdom	65195	81465	146660
Palestinian Territory, Occupied	39861	101029	140890
France	61803	72704	134507
Australia	41393	71031	112424
Afghanistan	31678	75604	107282
North Atlantic Treaty Organization	51437	52715	104152
Syria	33110	64113	97223
Germany	42116	51114	93230
Egypt	34961	47969	82930
Barack Obama	46660	34782	81442
Indonesia	28997	43795	72792
Georgia	26227	45432	71659
Hu Jintao	39679	28938	68617
Citizen (Palestinian Territory, Occupied)	19921	48444	68365
Yasir Arafat	31196	35228	66424
Citizen (Russia)	23473	42401	65874
Thailand	25541	40200	65741
Mahmoud Abbas	34629	29995	64624
Citizen (Australia)	22541	41030	63571
Serbia	21006	41502	62508
Taliban	31678	30688	62366
Government (India)	12117	49629	61746
Israeli Defense Forces	42364	19130	61494
UN Security Council	27184	34210	61394
Ukraine	23290	37906	61196
Taiwan	19421	41052	60473
Kofi Annan	37150	22442	59592
Vietnam	24456	34951	59407
Tony Blair	33083	25713	58796
Lebanon	17041	40094	57135
Mexico	24690	32142	56832

What stood out to me was the mix of country-level actors with named individuals. Unlike event datasets that use CAMEO coding, leaders or sub-state organizations don't seem to be coded as add-ons to a state actor code (e.g. USAGOV) but separate actors in their own right.

Update (03/29/2015): The _Sectors column contains the role information that would otherwise be contained in the chained CAMEO designations. For example, if you scroll back to the first row of 1995 data, the target name is Boris Yeltsin, and the target sectors associated with him are "Elite,Executive,Executive Office,Government".

The Citizen (Country) actor stood out to me in particular, especially since it isn't mentioned specifically in the included documentation -- so let's take a look at some of the rows that use it:

In [13]:

data[data.Source_Name=="Citizen (India)"].head()

Out[13]:

	Event_ID	Event_Date	Source_Name	Source_Sectors	Source_Country	Event_Text	CAMEO_Code	Intensity	Target_Name	Target_Sectors	Target_Country	Story_ID	Sentence_Number	Publisher	City	District	Province	Country	Latitude	Longitude
676	927826	1995-01-11	Citizen (India)	General Population / Civilian / Social,Social	India	Reject proposal to meet, discuss, or negotiate	125	-5.0	Narasimha Rao	Executive Office,Executive,Government	India	28239021	2	The Associated Press Political Service	NaN	NaN	State of Tamil Nadu	India	11.0000	78.0000
783	927996	1995-01-12	Citizen (India)	General Population / Civilian / Social,Social	India	Express intent to meet or negotiate	36	4.0	United States	NaN	United States	28239081	1	The Associated Press Political Service	Swanton	Saline County	Nebraska	United States	40.3781	-97.0728
2151	2547954	1995-01-26	Citizen (India)	Social,General Population / Civilian / Social	India	Demonstrate or rally	141	-6.5	Unspecified Actor	Unspecified	NaN	28242253	2	Reuters News	Jammu	NaN	State of Jammu and Kashmir	India	32.7357	74.8691
5760	935070	1995-03-06	Citizen (India)	Social,General Population / Civilian / Social	India	Kill by physical assault	1823	-10.0	Congress Party	(National) Major Party,Government Major Party ...	India	28915932	4	The Associated Press Political Service	Hyderabad	Hyderabad	State of Andhra Pradesh	India	17.3841	78.4564
5766	935081	1995-03-06	Citizen (India)	Social,General Population / Civilian / Social	India	Use unconventional violence	180	-9.0	Militant (India)	Unidentified Forces	India	28915955	6	The Associated Press Political Service	New Delhi	NaN	National Capital Territory of Delhi	India	28.6358	77.2244

So it looks like Citizen really means civilians, or possibly civil society actors unaffiliated with any organization the ICEWS coding system recognizes.

Update (03/29/2015): I had trouble finding news events that corresponded to the events above, but Jennifer Lautenschlager pointed me to this news article that indicates that there was election violence in India in that time frame.

To get country-level actors comparable to other event datasets, looks like we need to use the source and target country columns:

In [14]:

country_source = data.Source_Country.value_counts()
country_target = data.Target_Country.value_counts()
country_counts = pd.DataFrame({"SourceFreq": country_source,
                             "TargetFreq": country_target})
country_counts.fillna(0, inplace=True)
country_counts["Total"] = country_counts.SourceFreq + country_counts.TargetFreq

In [15]:

country_counts.sort("Total", ascending=False, inplace=True)
country_counts.head(10)

Out[15]:

	SourceFreq	TargetFreq	Total
United States	997696	803460	1801156
India	773712	760583	1534295
Russian Federation	746829	706700	1453529
China	541432	525955	1067387
Japan	344413	332380	676793
Australia	340339	320329	660668
Israel	338118	315501	653619
United Kingdom	331735	302389	634124
Occupied Palestinian Territory	251678	317883	569561
Iran	286274	283276	569550

This looks pretty good too! India seems more represented compared to what I've seen in other datasets, and of course Israel/Palestine maintain their usual place on the event data leaderboard.

Update (03/29/2015): Since the Sectors are also an important way of understanding the relevant data, let's get their frequencies too. Sectors are a bit trickier, since each cell can contain multiple selectors, separated by commas. So we need to loop over each cell, split the selectors mentioned, and count each one.

In [16]:

# Count source sectors
source_sectors = defaultdict(int)
source_sector_counts = data.Source_Sectors.value_counts()
for sectors, count in source_sector_counts.iteritems():
    sectors = sectors.split(",")
    for sector in sectors:
        source_sectors[sector] += 1

# Count Target sectors
target_sectors = defaultdict(int)
target_sector_counts = data.Target_Sectors.value_counts()
for sectors, count in target_sector_counts.iteritems():
    sectors = sectors.split(",")
    for sector in sectors:
        target_sectors[sector] += 1
        
# Convert into series
source_sectors = pd.Series(source_sectors)
target_sectors = pd.Series(target_sectors)
# Combine into a dataframe, and fill missing with 0
sector_counts = pd.DataFrame({"SourceFreq": source_sectors,
                              "TargetFreq": target_sectors})

sector_counts.fillna(0, inplace=True)
sector_counts["Total"] = sector_counts.SourceFreq + sector_counts.TargetFreq

In [17]:

sector_counts.sort("Total", ascending=False, inplace=True)

In [18]:

sector_counts.head(10)

Out[18]:

	SourceFreq	TargetFreq	Total
Government	176897	138684	315581
Parties	171411	135383	306794
Ideological	153750	121842	275592
(National) Major Party	134134	106262	240396
Executive	129382	103265	232647
Elite	92926	80163	173089
Legislative / Parliamentary	63654	45670	109324
Executive Office	54710	49770	104480
Cabinet	57273	42038	99311
Center Left	48678	37593	86271

In [19]:

sector_counts.tail(10)

Out[19]:

	SourceFreq	TargetFreq	Total
International Exiles	2	1	3
Bedouin	2	0	2
Nepali-Pahari	1	1	2
Western	1	1	2
Navy Headquarters	1	1	2
Army Education / Training	0	1	1
Unspecified	0	1	1
Consumer Services MNCs	1	0	1
State-Owned Consumer Goods	1	0	1
Saharan	1	0	1

In addition to CAMEO-type actor designations (e.g. Government) it looks like some of the Sectors also resemble the Issues in Phoenix, or Themes in the GDELT GKG.

Daily Event Counts¶

An easy way to get an idea of whether there were significant changes in data collection over time is to look at total events over time. ICEWS events have the full daily date only, so let's go with that and look at daily events.

In [20]:

daily_events = data.groupby("Event_Date").aggregate(len)["Event_ID"]

In [21]:

daily_events.plot(color='k', lw=0.2, figsize=(12,6), 
                  title="ICEWS Daily Event Count")

Out[21]:

<matplotlib.axes._subplots.AxesSubplot at 0x1044704e0>

There seems to be a definite ramp-up period from 1995 through 1999 or so, and some sort of fall in event volume around 2009. Notice that there are also a few individual days, especially around 2004, with very few events for some reason.

Update (03/29/2015): Jennifer Lautenschlager clarified that the jumps in the 1995-2001 period reflect publishers entering incrementally into the commercial data system that feeds into ICEWS. The post-2008 dip reflects a decline in number of stories overall, possibly driven by budget cuts due to the recession.

Since each event has an associated Story ID, we can count how many unique stories are processed by ICEWS every day and end up generating events.

In [22]:

daily_stories = data.groupby("Event_Date").aggregate(pd.Series.nunique)["Story_ID"]

In [23]:

daily_stories.plot(color='k', lw=0.2, figsize=(12,6), 
                   title="ICEWS Daily Story Count")

Out[23]:

<matplotlib.axes._subplots.AxesSubplot at 0x11a31dfd0>

With these two series, we can measure the daily average events generated per story:

In [24]:

events_per_story = daily_events / daily_stories
events_per_story.plot(color='k', lw=0.2, figsize=(12,6), 
                      title="ICEWS Daily Events Per Story")

Out[24]:

<matplotlib.axes._subplots.AxesSubplot at 0x11a317080>

This confirms that indeed, except for a few anomalies, the number of events generated per story stays relatively consistent over time. Nevertheless, it's probably important to at least try to distinguish between fewer stories as caused by fewer newsworthy events, and fewer stories as caused by fewer journalists writing them.

Mapping¶

Another good way to get an idea of the dataset's coverage is to put the events on a map. To do that, let's group the data by the latitude and longitude for each event, and count the number of events at each point. Then we can put those points on a world map using the basemap package.

In [25]:

points = data.groupby(["Latitude", "Longitude"]).aggregate(len)["Event_ID"]
points = points.reset_index()

Nobody will be surprised that the distribution of events-per-point is very long-tailed, with many points having only a small number of events, and a small number of points having hundreds of thousands of events.

In [26]:

points.Event_ID.hist()
plt.yscale('log')

So the best way to deal with this is to plot point size based on the log of the number of events recorded there.

The following code draws a world map using Basemap's default, built-in map, and then iterates over all the points, putting a dot on the map for each one. Finally, it exports the resulting map to a PNG file

In [27]:

plt.figure(figsize=(16,16))

# Draw the world map itself
m = Basemap(projection='eck4',lon_0=0,resolution='c')
m.drawcoastlines()
m.fillcontinents()
# draw parallels and meridians.
m.drawparallels(np.arange(-90.,120.,30.))
m.drawmeridians(np.arange(0.,360.,60.))
m.drawmapboundary()
m.drawcountries()
plt.title("ICEWS Total Events", fontsize=24)

# Plot the points
for row in points.iterrows():
    row = row[1]
    lat = row.Latitude
    lon = row.Longitude
    count = np.log10(row.Event_ID + 1) * 2
    x, y = m(lon, lat) # Convert lat-long to plot coordinates
    m.plot(x, y, 'ro', markersize=count, alpha=0.3)
plt.savefig("ICEWS.png", dpi=120, facecolor="#FFFFFF")

This looks... shockingly good to me. A few regions -- particularly the Indian subcontinent, East Asia and South America -- seem much better covered than in some other datasets. US Pacific Command was one of ICEWS's first customers, so it makes sense that its AOR would be well covered. Nigeria also seems to be relatively densly-covered, though whether this is because of particular attention or simply its population and regional significance isn't clear.

The ICEWS documentation says that purely domestic US events aren't included. This explains why the continental US appears sparser than some other datasets -- but there are obviously many points still left. Most of these events have at least one foreign actor, and apparently very few purely domestic events slip past the filters ICEWS have in place.

The Israeli-Palestinian Dyad¶

Back to everyone's favorite dyad, which has had more than its fair share of event data analysis pointed at it. Let's subset all events originating from Israel and targeting what ICEWS codes as the Occupied Palestinian Territory, and vice-versa.

In [28]:

dyad = ["Israel", "Occupied Palestinian Territory"]
ilpalcon = data[(data.Source_Country.isin(dyad)) & 
                (data.Target_Country.isin(dyad))]

In [29]:

ilpalcon.head()

Out[29]:

	Event_ID	Event_Date	Source_Name	Source_Sectors	Source_Country	Event_Text	CAMEO_Code	Intensity	Target_Name	Target_Sectors	Target_Country	Story_ID	Sentence_Number	Publisher	City	District	Province	Country	Latitude	Longitude
50	926771	1995-01-03	Citizen (Palestinian Territory, Occupied)	Social,General Population / Civilian / Social	Occupied Palestinian Territory	Criticize or denounce	111	-2.0	Yitzhak Rabin	Government,State-Owned Defense / Security,Exec...	Israel	28235898	4	The Toronto Star	Bethlehem	NaN	West Bank	Occupied Palestinian Territory	31.7049	35.2038
132	926876	1995-01-03	Yitzhak Rabin	Government,State-Owned Defense / Security,Exec...	Israel	Make statement	10	0.0	Government (Israel)	Government	Israel	28242652	3	The Wall Street Journal Europe	Jerusalem	NaN	Jerusalem District	Israel	31.7690	35.2163
133	926877	1995-01-03	Cabinet / Council of Ministers / Advisors (Isr...	Government,Cabinet,Executive	Israel	Praise or endorse	51	3.4	Yitzhak Rabin	Government,State-Owned Defense / Security,Exec...	Israel	28242652	4	The Wall Street Journal Europe	Jerusalem	NaN	Jerusalem District	Israel	31.7690	35.2163
161	926955	1995-01-04	Yasir Arafat	Executive Office,Government,Executive	Occupied Palestinian Territory	Engage in diplomatic cooperation	50	3.5	Israel	NaN	Israel	28241261	3	The Christian Science Monitor	Gaza	NaN	Gaza Strip	Occupied Palestinian Territory	31.5000	34.4667
162	926956	1995-01-04	Yasir Arafat	Executive Office,Government,Executive	Occupied Palestinian Territory	Engage in diplomatic cooperation	50	3.5	Hamas	Parties,(National) Major Party,Dissident,Nongo...	Occupied Palestinian Territory	28241261	3	The Christian Science Monitor	Gaza	NaN	Gaza Strip	Occupied Palestinian Territory	31.5000	34.4667

Unlike GDELT and Phoenix, ICEWS doesn't include a quad/penta-code categorizing events into broadly cooperative or conflict actions (though you can create them yourself using the ICEWS CAMEO code, e.g. as described in the Phoenix documentation). Instead, it provides an Intensity score -- positive intensity indicates positive events (providing assistance, etc.) while negative scores indicate conflict (criticism, physical attacks). Taking the average intensity for some period of time should provide a rough estimate of each side's posture towards the other.

Let's break down the subset further, one for Israeli-initiated actions and one for Palestinian-initiated ones. That will give us a rough estimate of reciprocity -- is one side behaving more peacefully towards the other, or are their actions relatively mirrored?

First, we select Israel-initiated events, and get the mean intensity by day.

In [30]:

il_initiated = ilpalcon[ilpalcon.Source_Country=="Israel"]
il_initiated = il_initiated.groupby("Event_Date")
il_initiated = il_initiated.aggregate(np.mean)["Intensity"]

In [31]:

il_initiated.plot()

Out[31]:

<matplotlib.axes._subplots.AxesSubplot at 0x112ee3630>

It looks like daily events are too noisy to give us a good picture of what's going on. To let's use pandas's rolling mean tool to see the average intensity across a 30-day window:

In [32]:

pd.rolling_mean(il_initiated, 30).plot()

Out[32]:

<matplotlib.axes._subplots.AxesSubplot at 0x26d873908>

Notice the sharp drop that occurs in late 2000, marking the beginning of the Second Intifada.

Now let's get the same dataset for Palestinian-initiated actions. This time, I string together the pandas operations using the '' operator, which allows multiple lines to be strung together for legibility as if they were a single line of code:

In [33]:

pal_initiated = ilpalcon[ilpalcon.Source_Country=="Occupied Palestinian Territory"] \
                        .groupby("Event_Date") \
                        .aggregate(np.mean) \
                        ["Intensity"]

Next, combining the two mean intensity series into a single dataframe:

In [34]:

df = pd.DataFrame({"IL_Initiated": pd.rolling_mean(il_initiated, 30),
                   "PAL_Initiated": pd.rolling_mean(pal_initiated, 30)})

And now we can plot the mean intensity of actions initiated by each side.

In [35]:

fig, ax = plt.subplots(figsize=(12,6))
df.plot(ax=ax)
ax.set_ylabel("Mean Intensity Coding")

Out[35]:

<matplotlib.text.Text at 0x26db48550>

Not too surprisingly, they seem to overlap almost perfectly. There are a few points that stand out where the lines diverge significantly -- in a more in-depth analysis, they might warrant further examination to see whether they represent something interesting happening on the ground, or just a blip in the data collection.

We can correlate the series, and see that they do indeed track each other pretty closely (though not as perfectly as they may look on visual examination):

In [36]:

df.corr()

Out[36]:

	IL_Initiated	PAL_Initiated
IL_Initiated	1.000000	0.802497
PAL_Initiated	0.802497	1.000000

Ground Truth Dataset¶

One of ICEWS's biggest advantages is that it includes not only machine-coded event data, but hand-validated ground truth data on whether, on a monthly basis, each country is experiencing one of several types of conflict events.

Let's load it and take a look:

In [37]:

ground_truth = pd.read_csv(DATA + "gtds_2001.to.feb.2014.csv")

In [38]:

ground_truth.head()

Out[38]:

	ccode	country	year	month	time	notes	coder	insnotes	dpcnotes	rebnotes	ervnotes	icnotes
0	20	CANADA	2001	1	2001m1	NaN	Bentley & Leonard	NaN	NaN	NaN	NaN	NaN
1	20	CANADA	2001	2	2001m2	NaN	Bentley & Leonard	NaN	NaN	NaN	NaN	NaN
2	20	CANADA	2001	3	2001m3	NaN	Bentley & Leonard	NaN	NaN	NaN	NaN	NaN
3	20	CANADA	2001	4	2001m4	NaN	Bentley & Leonard	NaN	NaN	NaN	NaN	NaN
4	20	CANADA	2001	5	2001m5	NaN	Bentley & Leonard	NaN	NaN	NaN	NaN	NaN

The columns ins to ic are 1 if the country experienced that event during that month, and 0 otherwise. They are:

ins: Insurgency
reb: Rebellion
dpc: Domestic political crisis
erv: Ethnic or religious violence
ic: International conflict

For more details, see the GTDS documentation.

In [39]:

# Convert the 'time' column to datetime:
ground_truth["time"] = pd.to_datetime(ground_truth.time, format="%Ym%m")

We can do some simple analysis on the ground truth dataset alone, for example see how many insurgencies are going on in the world on a month-by-month basis:

In [40]:

insurgency_count = ground_truth.groupby("time").aggregate(sum)["ins"]

In [41]:

insurgency_count.plot()
plt.ylabel("# of countries")
plt.title("Number of countries experiencing insurgencies")

Out[41]:

<matplotlib.text.Text at 0x26d89afd0>

Combining the event data with ground truth¶

The real advantage that the ground truth data gives us is being able to combine it with the machine-coded event data for analysis and ultimately prediction.

In this example, I'm going to do a very simple analysis, and try and see whether countries experiencing one of the conflicts measured by the GTDS generate more events, and events of lower intensity.

First, we count how many 'bad things' are happening per country-month:

In [42]:

ground_truth["Conflict"] = 0
for col in ["ins", "reb", "dpc", "erv", "ic"]:
    ground_truth.Conflict += ground_truth[col]

All we care about for now is the country, the month, and the count of conflict types:

In [43]:

monthly_conflict = ground_truth[["time", "country", "Conflict"]]

In [44]:

monthly_conflict.head()

Out[44]:

	time	country
0	2001-01-01	CANADA
1	2001-02-01	CANADA
2	2001-03-01	CANADA
3	2001-04-01	CANADA
4	2001-05-01	CANADA

Now let's go back to the ICEWS event data, and aggregate it on a country-month basis too. For purposes of this analysis, I'll associate events with the country that ICEWS places them in, rather than the source or target country.

I'll collect two measures: how many events were generated per country-month, and what their average intensity was.

ICEWS events are on a daily basis, so we need to associate a year-month with each event. Unfortunately, pandas doesn't know how to deal with 'months' -- notice that we converted the ground truth event date into the first day of the relevant month. We'll do the same for the ICEWS events:

In [45]:

get_month = lambda x: pd.datetime(x.year, x.month, 1)
data["YearMonth"] = data.Event_Date.apply(get_month)

Now we'll group the data by country and month (really, first-day-of-the-month) and get the number and mean intensity of events for each.

In [46]:

monthly_grouped = data.groupby(["YearMonth", "Country"])
monthly_counts = monthly_grouped.aggregate(len)["Event_ID"]
monthly_intensity = monthly_grouped.aggregate(np.mean)["Intensity"]

And combine these series into a single DataFrame:

In [47]:

monthly_events = pd.DataFrame({"EventCounts": monthly_counts,
                               "MeanIntensity": monthly_intensity})
monthly_events.reset_index(inplace=True)

In [48]:

monthly_events.head()

Out[48]:

	YearMonth	Country	EventCounts	MeanIntensity
0	1995-01-01	Afghanistan	5	1.640000
1	1995-01-01	Albania	10	0.800000
2	1995-01-01	Algeria	25	-3.064000
3	1995-01-01	Angola	3	-6.333333
4	1995-01-01	Argentina	18	1.022222

So this is fun: country names in the ICEWS event dataset are written with only the first letters capitalized, but the GTDS country names are in ALL CAPS. We need to convert one to the other in order to be able to match them -- and making country names all-caps is easier than dealing with title-casing multi-word all-cap country names.

In [49]:

capitalize = lambda x: x.upper()
monthly_events["Country"] = monthly_events.Country.apply(capitalize)

Now that we've done that, we can merge the dataframes on month and country name. The merge includes all the columns from both dataframes by default, so we need to only keep the ones we're interested in:

In [50]:

monthly_data = monthly_conflict.merge(monthly_events,
                       left_on=["time","country"], right_on=["YearMonth", "Country"])
monthly_data = monthly_data[["YearMonth", "Country", "Conflict", "EventCounts", "MeanIntensity"]]

In [51]:

monthly_data.head()

Out[51]:

	YearMonth	Country	EventCounts	MeanIntensity
0	2001-01-01	CANADA	363	0.307989
1	2001-02-01	CANADA	438	0.773973
2	2001-03-01	CANADA	443	0.583070
3	2001-04-01	CANADA	716	0.095670
4	2001-05-01	CANADA	445	0.352584

Now let's make some quick box plots and eyeball whether conflicts make a difference for data generation:

In [52]:

monthly_data.boxplot(column="EventCounts", by="Conflict")

Out[52]:

<matplotlib.axes._subplots.AxesSubplot at 0x1cbf73198>

In [53]:

monthly_data.boxplot(column="MeanIntensity", by="Conflict")

Out[53]:

<matplotlib.axes._subplots.AxesSubplot at 0x1cc07da20>

We see a similar thing here -- no- or low-conflict country-months generate a wide variety of mean intensities, but the median mean intesity seems to become more negative with higher conflict scores.

However, what's the deal with the data points showing a very low mean intensity (which indicates conflict) when the ground-truth doesn't indicate that there were conflicts occuring? Let's check:

In [54]:

monthly_data[(monthly_data.Conflict==0) & (monthly_data.MeanIntensity<-9)]

Out[54]:

	YearMonth	Country	EventCounts	MeanIntensity
1188	2010-12-01	BARBADOS	1	-10.0
1491	2011-11-01	BELIZE	1	-9.5
1494	2012-02-01	BELIZE	1	-10.0
10373	2007-05-01	CAPE VERDE	1	-9.5
10395	2010-09-01	CAPE VERDE	1	-10.0
10614	2007-11-01	GUINEA-BISSAU	1	-10.0
10675	2013-03-01	GUINEA-BISSAU	1	-10.0
10791	2010-10-01	EQUATORIAL GUINEA	1	-10.0
10797	2011-05-01	EQUATORIAL GUINEA	1	-10.0
11495	2004-06-01	MAURITANIA	1	-9.2
13064	2004-03-01	GABON	2	-9.5
16465	2012-12-01	BOTSWANA	1	-9.5
21876	2010-03-01	BHUTAN	1	-10.0
21898	2012-04-01	BHUTAN	1	-9.5
24581	2012-05-01	SOLOMON ISLANDS	2	-9.1

Ah -- it looks like all of these were very low event counts. Remember that these are monthly, and one or two intensly-negative events generated in an entire month probably are not themselves strong indicators of conflict. At the very least, so few events probably also indicate that there isn't much collection of events happening for that country in general.

Summary¶

This was just a quick tour of things I tried playing around with ICEWS. There's a lot of public research that's already been done with ICEWS that it could be fun to attempt to replicate now that the data is finally public. It'll also be interesting to compare the data to other public event datasets, to figure out strengths and gaps, and improve both. The ground truth dataset alone could also be useful for building and testing models with completely different event data.

Comments? Suggestions? Questions? Find me on Twitter or let me know by email

In [55]:

# Putting the formatting out of the way
from IPython.core.display import HTML
styles = open("Style.css").read()
HTML(styles)

Out[55]: