By Felipe Hoffa (@felipehoffa, /r/bigquery)

To look further into how to read these results to understand history, look at:

In [13]:
# Some pre-flight work:
# - Libraries.
# - Patch to handle authentication automatically on Google Compute Enginefrom oauth2client.gce import AppAssertionCredentials
from bigquery_client import BigqueryClient
from pandas.io import gbq
import pandas as pd

def GetMetadata(path):
  import urllib2
  BASE_PATH = 'http://metadata/computeMetadata/v1/'
  request = urllib2.Request(BASE_PATH + path, headers={'Metadata-Flavor': 'Google'})
  return urllib2.urlopen(request).read()

credentials = AppAssertionCredentials(scope='https://www.googleapis.com/auth/bigquery')

client = BigqueryClient(credentials=credentials,
                        api='https://www.googleapis.com',
                        api_version='v2',
                        project_id=GetMetadata('project/project-id'))

gbq._authenticate = lambda: client

How many rows GDELT has for each country?

We are going to use a view specially defined for that. 'sample_views.country_date_matconf_numarts' counts the number of material conflicts per day per country.

To see the view definition, go to https://bigquery.cloud.google.com/table/gdelt-bq:sample_views.country_date_matconf_numarts and click on 'Details'.

In [2]:
query = """
SELECT country, SUM(c) AS count
FROM [gdelt-bq:sample_views.country_date_matconf_numarts]
GROUP BY country
ORDER BY count DESC
LIMIT 10
"""
gbq.read_gbq(query)
Waiting on bqjob_r3494ef2663a3c45e_00000147cc145563_1 ... (0s) Current status: DONE   
Out[2]:
country count
0 Israel 13138702
1 Pakistan 11547139
2 Afghanistan 11334579
3 Iraq 10181510
4 Syria 9538310
5 United Kingdom 8764009
6 Russia 7569903
7 India 6988340
8 Egypt 6511839
9 China 6268915

Focusing solely in Egypt. Let's plot the number of material conflicts reported for each day over the last 35 years:

In [9]:
query = """
SELECT date, c AS count
FROM [gdelt-bq:sample_views.country_date_matconf_numarts]
WHERE country='Egypt'
ORDER BY date
"""
data=gbq.read_gbq(query)
data.index=pd.to_datetime(data['date']*1000)
x = data['count'].plot()
Waiting on bqjob_r46d7af66323c4ad9_00000147cc18e43f_8 ... (0s) Current status: DONE   

What if we only focus on the 30 days previous to January 27th, 2011.

In [28]:
query = """
  SELECT country, date, c AS count
  FROM [gdelt-bq:sample_views.country_date_matconf_numarts] a 
  CROSS JOIN (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i < 30) b
  WHERE country='Egypt'
  AND date+i*86400000000 = PARSE_UTC_USEC('2011-01-27')
"""
data_egypt=gbq.read_gbq(query)
data_egypt.index=pd.to_datetime(data_egypt['date']*1000)
x = data_egypt['count'].plot()
Waiting on bqjob_r2f64a01b0c420ec3_00000147cc356c4a_11 ... (0s) Current status: DONE   

With that timeline defined, we can look for similar timelines across all years and countries.

To make this work, we used the trick explained on http://stackoverflow.com/questions/24923101/computing-a-moving-maximum-in-bigquery/24943950#24943950.

In [12]:
query = """
SELECT
  STRFTIME_UTC_USEC(a.ending_at, "%Y-%m-%d") ending_at1,
  STRFTIME_UTC_USEC(b.ending_at-30*86400000000, "%Y-%m-%d") starting_at2,
  STRFTIME_UTC_USEC(b.ending_at, "%Y-%m-%d") ending_at2,
  a.country, b.country, CORR(a.c, b.c) corr, COUNT(*) c
FROM (
  SELECT country, date+i*86400000000 ending_at, c, i
  FROM [gdelt-bq:sample_views.country_date_matconf_numarts] a 
  CROSS JOIN (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i < 30) b
) b
JOIN (
  SELECT country, date+i*86400000000 ending_at, c, i
  FROM [gdelt-bq:sample_views.country_date_matconf_numarts] a 
  CROSS JOIN (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i < 30) b
  WHERE country='Egypt'
  AND date+i*86400000000 = PARSE_UTC_USEC('2011-01-27')
) a
ON a.i=b.i
WHERE a.ending_at != b.ending_at
GROUP EACH BY ending_at1, ending_at2, starting_at2, a.country, b.country
HAVING (c = 30 AND ABS(corr) > 0.254)
ORDER BY corr DESC

LIMIT 10
"""
data=gbq.read_gbq(query)
data
Waiting on bqjob_r5ccda16b63fbcd4a_00000147cc1a0174_11 ... (203s) Current status: DONE   
Out[12]:
ending_at1 starting_at2 ending_at2 a_country b_country corr c
0 2011-01-27 2013-10-19 2013-11-18 Egypt Chad 0.875190 30
1 2011-01-27 2011-05-15 2011-06-14 Egypt Jordan 0.862724 30
2 2011-01-27 2005-08-07 2005-09-06 Egypt Turkey 0.856022 30
3 2011-01-27 2006-10-03 2006-11-02 Egypt Italy 0.847174 30
4 2011-01-27 2006-05-29 2006-06-28 Egypt Syria 0.846571 30
5 2011-01-27 2013-05-15 2013-06-14 Egypt South Korea 0.842728 30
6 2011-01-27 1995-06-11 1995-07-11 Egypt Iraq 0.842362 30
7 2011-01-27 2011-10-10 2011-11-09 Egypt Portugal 0.841235 30
8 2011-01-27 2002-08-04 2002-09-03 Egypt Sudan 0.835436 30
9 2011-01-27 2009-08-07 2009-09-06 Egypt Germany 0.832848 30

Those are the 10 periods that most closely resemble the timeline of material conflicts in Egypt before January 27th, 2011.

Let's plot Chad 30 days before October 19th, 2013.

In [34]:
query = """
  SELECT country, date, c AS count
  FROM [gdelt-bq:sample_views.country_date_matconf_numarts] a 
  CROSS JOIN (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i < 30) b
  WHERE country='Chad'
  AND date+i*86400000000 = PARSE_UTC_USEC('2013-11-18')
"""
data_chad=gbq.read_gbq(query)
data_chad.index=pd.to_datetime(data_chad['date']*1000)
data_chad['count'].plot()
Waiting on bqjob_r39c0b56b485a6249_00000147cc384b90_16 ... (0s) Current status: DONE   
Out[34]:
<matplotlib.axes.AxesSubplot at 0x7feb5c1ecbd0>

In fact, both periods look remarkably similar (but not necesarily significant, more analysis is needed before reaching any conclusions):

In [33]:
data_egypt['count'].plot()
Out[33]:
<matplotlib.axes.AxesSubplot at 0x7feb5cd93890>

To look further into how to read these results to understand history, look at:

By Felipe Hoffa (@felipehoffa, /r/bigquery)