#!/usr/bin/env python
# coding: utf-8
# ## Understanding the Panama Papers using pandas and plotly
#
# In this notebook, I use the pandas and plotly libraries to clean, wrangle and visualise the Panama Papers dataset. It is in part to help me learn python, pandas and plotly, and in part to help me satisfy my curiosity about the Panama Papers. Specifically, I hope to answer the following questions: (i) which countries do tax evading companies originate from, (ii) which countries are popular tax havens, and (iii) zooming into Singapore, where do tax evaders live?
#
# Everything I learnt about pandas are from two amazing tutorials by Brandon Rhodes (PyCon 2015) https://www.youtube.com/watch?v=5JnMutdy6Fw&list=PLO9pkowc_99YYMFnW9GZkIE93jbip1jZb&index=4 and Jonathan Rocher (SciPy 2015) https://www.youtube.com/watch?v=0CFFTJUZ2dc.
#
#
# In the process, I learnt how to use pandas to:
# 1. Load data
# 2. Clean and format data
# 3. Transform datasets
# 4. Aggregate and summarise data
#
# I also learnt how to use the plotly library for data visualisation and how to import and embed HTML files
#
# Note:
# - This is an exercise in pandas (for data cleaning) and plotly (for visualisation). I haven't learnt about data validation at this point, so please take my conclusions with a pinch of salt.
# - Source: https://offshoreleaks.icij.org/pages/database ICIJ Offshore Leaks Database is licensed under the Open Database License and contents under Creative Commons Attribution-ShareAlike license.
#
# In[143]:
get_ipython().run_line_magic('matplotlib', 'inline')
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import cufflinks as cf
import plotly.tools as tls
import numpy as np
import pandas as pd
from pandas import set_option
set_option("display.max_rows", 25)
from IPython.core.display import HTML
LARGE_FIGSIZE = (16, 12)
# ## Describing the data
# The data is taken from 5 csv files, which contain data on the (i) entities, (ii) intermediaries, (iii) officers, (iv) addresses of officers, and the fifth file appears to be a file which allows Neo4j to map data across csv files using node_id as a unique identifier.
#
# More details here: https://offshoreleaks.icij.org/pages/about
# Note that this contains only a fraction of the Panama Papers leaks from the Panama-based offices of Mossack Fonseca. Those leaks are made up of 2.6 terabytes of data, a total of 11.5 million records, which are not included en masse in the public database. The raw documents have not been published. The Mossack Fonseca leaks include vast amounts of email correspondence, bank account details, identity documents and financial transactions, and not all the names on those files appear in the database. Much of the information is buried in emails, power-of-attorney letters and internal notes of Mossack Fonseca employees and cannot easily be extracted in a systematic manner, according to the International Consortium of Investigative Journalists (ICIJ).
#
#
# In[144]:
addresses = pd.read_csv("data/Addresses.csv")
all_edges = pd.read_csv("data/all_edges.csv")
entities = pd.read_csv("data/Entities.csv")
intermediaries = pd.read_csv("data/Intermediaries.csv")
officers = pd.read_csv("data/Officers.csv")
# In[145]:
all_data = {'1. entities' : pd.Series(entities.columns.values),
'2. intermediaries' : pd.Series(intermediaries.columns.values),
'3. officers': pd.Series(officers.columns.values),
'4. addresses' : pd.Series(addresses.columns.values),
'5. all_edges': pd.Series(all_edges.columns.values)}
data_overview = pd.DataFrame(all_data)
data_overview.head(20)
# ## Exploring the entities dataframe
#
# In[146]:
entities.shape
# In[147]:
entities.info()
# In[148]:
entities.columns
# In[150]:
#from the above, the columns of interest are: name, jurisdiction_description, countries,
#node_id and internal_id. The last 3 columns may be useful later on as an index to reference other csv files.
entities_subset = entities[[u'name', u'address', u'jurisdiction_description', u'countries', u'node_id',
u'internal_id']]
entities_subset.head()
# In[151]:
#I'm not sure what jurisdiction and countries mean - which is the source and which is the destination? Let's find
#out if they are identical.
entities.jurisdiction == entities.country_codes
# The output shows that they are not the same. Following some exploratory data analyses (which I've excluded from
#here for brevity), I found out that (i) jurisdiction refer to the tax havens, and (ii) the distinction between
#source country and tax havens is not so clear cut. For instance, based on the second chart below, you'll see that
#several thousands of companies investing into the British Virgin Islands are from British Virgin islands.
# In[9]:
#The country_codes and countries column sometimes contain more than 1 element. This may lead to undercounting. Let's
#fix this.
split = lambda x: pd.Series([i for i in reversed(x.split(';'))])
countries_split = entities_subset['countries'].dropna().apply(split)
countries_split.rename(columns={0:'Country1',1:'Country2',2:'Country3', 3: 'Country4'},inplace=True)
countries_split = countries_split[['Country1', 'Country2', 'Country3', 'Country4']]
print countries_split
# In[154]:
print countries_split.Country3.value_counts()
print countries_split.Country4.value_counts()
# For the sake of simplicity, I am going to drop Country3 and Country4, since countries only appear for a total of
#20 times (less than 0.00001% of our dataset) in these 2 columns.
# In[153]:
#Dropping two columns and merging this back to the main dataframe, and then dropping the country_codes column
countries_split.drop(['Country3', 'Country4'], axis = 1)
entities_subset[['Country1', 'Country2']] = countries_split[['Country1', 'Country2']]
# In[142]:
entities_subset = entities_subset.drop('countries', axis = 1)
entities_subset.head()
# In[13]:
#Let's move on to harmonise the references to British Virgin Islands (in country1 they are referred to as
#Virgin Islands, British).
entities_subset = entities_subset.replace(to_replace='Virgin Islands, British', value='British Virgin Islands')
entities_subset.head()
# ## The data is cleaner now, so let's prepare it for visualisation!
#
# Based on what's available in the entities_subset dataframe, I think we can visualise the following: (i) top countries with companies found in the Panama Papers, (ii) breakdown of these top countries by destination countries, (iii) the breakdown in source and destination countries (using a heatmap), and (iv) zooming into Singapore, we can use the address column to visualise where these companies are located in Singapore (I will enlist the help of Google Maps for this last visualisation)
# ### 1. Top 20 countries with companies found in the Panama Papers
# In[23]:
#First, let's build a dataframe for visualisation
source_country = entities_subset[['Country1', 'Country2']]
rank = source_country.apply(pd.Series.value_counts).fillna(0)
rank['total'] = rank.sum(axis = 1)
rank = rank.sort_values(by = 'total', ascending = False).total
top_20 = rank[:20]
top_20
# In[124]:
#Let's visualise the data! I'll the plotly cufflinks library (why not? it's interactive! try clicking on the chart!)
top_20.iplot(kind ='bar', yTitle = 'Number of Companies', title = 'Top 20 Countries (Source) Listed in Panama Papers')
# ### 2. Breakdown of top countries by destination countries (i.e. jurisdictions)
# In[25]:
#First, let's structure our data into a 2-level dataframe, so that it can be used to create a stacked bar chart.
#Caveat - This section is slightly more messy and I wish there were less trial and error in my approach!
source_country['jurisdiction'] = entities_subset['jurisdiction_description']
source_country
# In[75]:
source_country_2_level_counts = source_country.groupby(['Country1', 'jurisdiction']).size().reset_index()
source_country_2_level_counts.rename(columns={0:'total_companies'},inplace=True)
source_country_2_level_counts
# In[159]:
source_country_2_level_counts_sorted = source_country_2_level_counts.sort_values(by=['total_companies'], ascending = False)
source_country_2_level_counts_sorted.head(10)
# In[106]:
#In the interest of time, I will export the data for visualisation in plotly.
source_country_2_level_counts_sorted.to_csv("data/country_jurisdiction_sorted.csv")
# In[156]:
#I've visualised the data in plotly (it was a breeze!), and it even allows me to embed an interactive chart in
#Jupyter notebook! WOW!)
HTML('')
#Note: You can zoom in by dragging and selecting the section of interest. Double click to zoom out
# ### 3. A heatmap representing the number of companies found in each country, broken down by destination jurisdictions
# In[133]:
#I exported the source_country_2_level_counts_sorted dataframe to create a heatmap on Google Sheets.
HTML('')
# ### 4. Visualising the locations of the Singapore-based companies found in the Panama Papers (with the help of Google Maps)
# In[139]:
HTML('')
#Hover over the placemarkers to see details of the company
# ## Conclusion
#
# ###About Python
#
# 1. Pandas is really powerful and fast
# 2. Plotly is really intuitive and it syncs well with Python and Jupyter Notebook as well
#
# ###About the Panama Papers
#
# Based on the exploratory data analyses above, I've found out that:
# 1. The top 5 source countries of such shady dealings are: British Virgin Islands, Hong Kong, Switzerland, "Not Identified" and Panama. I guess the lawyers and accountants have learnt to hide their trail well. Apart from the clear suspects (BVI, Panama), there is a big chunk which is "Not Identified" - and this makes it quite hard to conclusively identify tax havens.
# 2. Even though Singapore is often touted as a possible tax haven, the figures show that the degree to which it houses companies related to the Panama Papers is much lesser as compared to Hong Kong, Switzerland, British Virgin islands. That said, as a source country, it is ranked 15th (5,867 companies), right after the United States (6,254 companies). As a jurisdiction, it is ranked 11th (668 companies), right after Nevada (1,260 companies) and Hong Kong (1,331).
#
# Do explore the charts above yourself and let me know what else you can conclude from them :-)
#