Our idea is to take the Global Terrorism Database and perform a comprehensive analysis on it in order to get insight into the terrorism and create tools which let the user to explore the data interactively. The dataset is semistructured so it requires a lot of cleaning and preprocessing tasks in order to make it accessible for the entire project, hence we use some of the key techniques introduced in the course. First of all, we format and clean the data. Afterwards, we compress it to the client-side friendly form to require less space and make the UI more responsive. We then create many kinds of charts using D3.js, each having its own function and key objectives.
Important: We decided to distribute code in separate notebooks.
To find the code and corresponding documentation on each analysis step just jump to the linked notebooks below.
In this notebook we decompress, clean and format the data (so perform the initial preprocessing of the data) to a shape which is best for use in further analysis.
In this notebook we do some basic analysis using Plotly
, which is beyond the scope of this project but we liked the idea to show something in the static notebook in case something goes wrong online.
We decided to dive a bit more into visualizations and limit used machine learning methods to 2, both of which must be interactive enough to embed them into charts. Hence, we use k-nearest neighbors and k-means, both of which just make fun to play with.
We didn't go further to multiple/logistic regression, because we encountered some constraints using them with our data. The most of the predicting features we can feed to the classifier are categorical, meaning that each algorithm would have preferred the feature with the most classes, leading to a clear bias. We also have seen some of the examples on Kaggle, where no model performance was higher than middle score.
More on KNN and K-Means you find in the section Visualizations.
In this notebook you find the code for aggregation of data we use in all of our charts.
The idea of using histogram to highlight the temporal development of terrorism is straightforward. Each bar corresponds to a year. The length of a bar is linked to the killed/wounded/attacks in that year.
To compare different weapon, attack or target types, we would like to use as many dimensions as possible. For this, the scatterplot is almost perfect, as it enables encoding of 4 different metrics (X, Y, size, color).
The scatterplot above has one big issue: we can display up to ~30 circles before we run out of space. But what if we'd love to compare countries? Even on a rectangular map with Mercador projection we need some kind of zoom. To tackle the problem we decided on another, more difficult, but also interesting solution: map countries on a virtual globe and let the user rotate it!
The visualization of K-Means should let the user customize the number of groups, as well as to display the centroids and the data points they belong to. Here we arrive at the next problem: how do we display all 150,000 terrorist attacks on a small map?
Solution: We use hexbins!
Each hexagonal bin is a shape which aggregates attacks occurred in a area of a specific radius. This way we avoid plotting thousands of data points. Instead, we just plot hexbins and display further information on hover. Brilliant, isn't it?
The first idea we had on visualization was to create a grid of already classified points with iPython
, and import it into d3.js
. But we encountered many problems:
k
and Type
, which is space expensiveWe ended up with the calculation of kNN on fly. This way, the visualization allows to classify the point user explicitly selects on the map. It also shows the related neighbors and their types without flooding the map with data.
Preprocessing of data from the Global Terrorism Database was very challenging, as the provided data was encoded by rules specified in the (very, very long) codebook. That is, we had to decode every single column we were interested in, clean it from bad values like negatives, zeros, or nans. Decide how we proceed with unknown values. And reformat the data for further use (we used pandas.DataFrame
for that, pretty handsome).
The thing we are really proud of is our super-flexible client-side infrastructure: we have almost no hard-coded data structures, variables or constants. The JavaScript functions we provide take care of recognizing what the corresponding csv-file contains and populates the dropdowns and charts automatically! By changing csv-files with iPython you're good with refreshing the page to see the changed data to be visualized, (maybe almost) without rewriting js
-functions. Every piece of code was aimed at flexibility and customization. You can put almost any kind of information into those charts just by knowing how they handle the data (please read the comments).
This also applies to the project structure, which (we hope) complies with many principles of a good design.
We decided to dig deeper into visualizations and UX instead of modeling and bulding ensembles. We hope you will have a lot of fun.