James Houghton; April 8th, 2014 [email protected]
One of the challenges of dealing with GDELT is that its size makes implementation in a SQL database challenging. It is easier (although slower) to extract a subset of the data that we wish to work with, and do our actual data investigation afterwards. If we're interested in data broken down by date, we have it easy: GDELT files are provided this way. If instead we want to look at a single country, we have some work to do. This notebook parses through each of the GDELT files, one at a time, extracts the relevant lines, and exports them again to a smaller set of csv files. It then gives the option to load these files into a Pandas DataFrame, pickle the result, and remove the intermediate files.
There are certainly more computationally efficient methods for doing this, but this one works well enough.
Links to each of the datafiles are available on the GDELT website. We extract this list to help collect and process the data. We probably want to perform this task every time we rebuild a dataset, so that we know we have the most up-to-date set of files.
import requests import lxml.html as lh gdelt_base_url = 'http://data.gdeltproject.org/events/' # get the list of all the links on the gdelt file page page = requests.get(gdelt_base_url+'index.html') doc = lh.fromstring(page.content) link_list = doc.xpath("//*/ul/li/a/@href") # separate out those links that begin with four digits file_list = [x for x in link_list if str.isdigit(x[0:4])]
In this example, we pull out all of the rows with a specific country code, and create a set of files which mirrors that of GDELT itself in quantity and format. Each output file is smaller than its corresponding input file.
As we're pulling out rows based upon location, we'll use GDELT cells:
These use the FIPS country codes, for inconvenience.
The general algorithm we will follow can be loosely described as:
- Check and see if the files are available locally (We assume we can afford to keep a local copy of the compressed files, so that we don't have to continually ping the gdelt server.)
- If not, get the files from the web
- Unzip the compressed raw GDELT file
- For each resulting CSV file (probably only one)
- Create and open an output file
- For each line of the input file
- Read into a string
- Split string by tab delimiter
- Check for the desired values in the appropriate list indexes
- If they are not there, continue the loop
- If the are there, write the line to the output file
- Increment a current-file counter
- Close the input and output files
- Delete the input file
We reset the infilecounter and outfilecounter external to the main algorithm cell so that if the algorithm encounters an error, and quits, we can pick up where we left off.
We choose to check for the files, and potentially download them once for each file that we try to open, as opposed to doing all file collection in advance. That way, we can start to run additional code as soon as possible. This will help with debugging as we'll "fail fast".
Remember to set the local_path to somewhere you're happy to store large files.
infilecounter = 0 outfilecounter = 0
import os.path import urllib import zipfile import glob import operator local_path = '/Users/me/Desktop/GDELT_Data/' fips_country_code = 'UK' for compressed_file in file_list[infilecounter:]: print compressed_file, # if we dont have the compressed file stored locally, go get it. Keep trying if necessary. while not os.path.isfile(local_path+compressed_file): print 'downloading,', urllib.urlretrieve(url=gdelt_base_url+compressed_file, filename=local_path+compressed_file) # extract the contents of the compressed file to a temporary directory print 'extracting,', z = zipfile.ZipFile(file=local_path+compressed_file, mode='r') z.extractall(path=local_path+'tmp/') # parse each of the csv files in the working directory, print 'parsing,', for infile_name in glob.glob(local_path+'tmp/*'): outfile_name = local_path+'country/'+fips_country_code+'%04i.tsv'%outfilecounter # open the infile and outfile with open(infile_name, mode='r') as infile, open(outfile_name, mode='w') as outfile: for line in infile: # extract lines with our interest country code if fips_country_code in operator.itemgetter(51, 37, 44)(line.split('\t')): outfile.write(line) outfilecounter +=1 # delete the temporary file os.remove(infile_name) infilecounter +=1 print 'done'
We may be content to use the data we just sampled into csv files in its present state. However, if we are working in python, it is convenient to load them into a DataFrame, save that DataFrame to a pickle, and delete the temporary files. This can save space on the disk, and make our future analysis of the data more simple.
Our algorithm here is simple - we build dataframes out of each of the temporary files, and then merge them into one big dataframe. We save that big dataframe, and delete the temporary files.
We use a helper file here which lists the column names. You can download the file to your working directory with this link:
import glob import pandas as pd # Get the GDELT field names from a helper file colnames = pd.read_excel('CSV.header.fieldids.xlsx', sheetname='Sheet1', index_col='Column ID', parse_cols=1)['Field Name'] # Build DataFrames from each of the intermediary files files = glob.glob(local_path+'country/'+fips_country_code+'*') DFlist =  for active_file in files: print active_file DFlist.append(pd.read_csv(active_file, sep='\t', header=None, dtype=str, names=colnames, index_col=['GLOBALEVENTID'])) # Merge the file-based dataframes and save a pickle DF = pd.concat(DFlist) DF.to_pickle(local_path+'backup'+fips_country_code+'.pickle') # once everythin is safely stored away, remove the temporary files for active_file in files: os.remove(active_file)