Understanding the Panama Papers using pandas and plotly

In this notebook, I use the pandas and plotly libraries to clean, wrangle and visualise the Panama Papers dataset. It is in part to help me learn python, pandas and plotly, and in part to help me satisfy my curiosity about the Panama Papers. Specifically, I hope to answer the following questions: (i) which countries do tax evading companies originate from, (ii) which countries are popular tax havens, and (iii) zooming into Singapore, where do tax evaders live?

Everything I learnt about pandas are from two amazing tutorials by Brandon Rhodes (PyCon 2015) https://www.youtube.com/watch?v=5JnMutdy6Fw&list=PLO9pkowc_99YYMFnW9GZkIE93jbip1jZb&index=4 and Jonathan Rocher (SciPy 2015) https://www.youtube.com/watch?v=0CFFTJUZ2dc.

In the process, I learnt how to use pandas to:

  1. Load data
  2. Clean and format data
  3. Transform datasets
  4. Aggregate and summarise data

I also learnt how to use the plotly library for data visualisation and how to import and embed HTML files

Note:

  • This is an exercise in pandas (for data cleaning) and plotly (for visualisation). I haven't learnt about data validation at this point, so please take my conclusions with a pinch of salt.
  • Source: https://offshoreleaks.icij.org/pages/database ICIJ Offshore Leaks Database is licensed under the Open Database License and contents under Creative Commons Attribution-ShareAlike license.
In [143]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import cufflinks as cf
import plotly.tools as tls

import numpy as np
import pandas as pd
from pandas import set_option
set_option("display.max_rows", 25)
from IPython.core.display import HTML


LARGE_FIGSIZE = (16, 12)

Describing the data

The data is taken from 5 csv files, which contain data on the (i) entities, (ii) intermediaries, (iii) officers, (iv) addresses of officers, and the fifth file appears to be a file which allows Neo4j to map data across csv files using node_id as a unique identifier.

More details here: https://offshoreleaks.icij.org/pages/about Note that this contains only a fraction of the Panama Papers leaks from the Panama-based offices of Mossack Fonseca. Those leaks are made up of 2.6 terabytes of data, a total of 11.5 million records, which are not included en masse in the public database. The raw documents have not been published. The Mossack Fonseca leaks include vast amounts of email correspondence, bank account details, identity documents and financial transactions, and not all the names on those files appear in the database. Much of the information is buried in emails, power-of-attorney letters and internal notes of Mossack Fonseca employees and cannot easily be extracted in a systematic manner, according to the International Consortium of Investigative Journalists (ICIJ).

In [144]:
addresses = pd.read_csv("data/Addresses.csv")
all_edges = pd.read_csv("data/all_edges.csv")
entities = pd.read_csv("data/Entities.csv")
intermediaries = pd.read_csv("data/Intermediaries.csv")
officers = pd.read_csv("data/Officers.csv")
In [145]:
all_data = {'1. entities' : pd.Series(entities.columns.values),
            '2. intermediaries' : pd.Series(intermediaries.columns.values),
            '3. officers': pd.Series(officers.columns.values),
            '4. addresses' : pd.Series(addresses.columns.values),
            '5. all_edges': pd.Series(all_edges.columns.values)}

data_overview = pd.DataFrame(all_data)
data_overview.head(20)
Out[145]:
1. entities 2. intermediaries 3. officers 4. addresses 5. all_edges
0 name name name address node_1
1 original_name internal_id icij_id icij_id rel_type
2 former_name address valid_until valid_until node_2
3 jurisdiction valid_until country_codes country_codes NaN
4 jurisdiction_description country_codes countries countries NaN
5 company_type countries node_id node_id NaN
6 address status sourceID sourceID NaN
7 internal_id node_id NaN NaN NaN
8 incorporation_date sourceID NaN NaN NaN
9 inactivation_date NaN NaN NaN NaN
10 struck_off_date NaN NaN NaN NaN
11 dorm_date NaN NaN NaN NaN
12 status NaN NaN NaN NaN
13 service_provider NaN NaN NaN NaN
14 ibcRUC NaN NaN NaN NaN
15 country_codes NaN NaN NaN NaN
16 countries NaN NaN NaN NaN
17 note NaN NaN NaN NaN
18 valid_until NaN NaN NaN NaN
19 node_id NaN NaN NaN NaN

Exploring the entities dataframe

In [146]:
entities.shape
Out[146]:
(319150, 21)
In [147]:
entities.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319150 entries, 0 to 319149
Data columns (total 21 columns):
name                        319146 non-null object
original_name               213634 non-null object
former_name                 6454 non-null object
jurisdiction                319150 non-null object
jurisdiction_description    319150 non-null object
company_type                103227 non-null object
address                     299319 non-null object
internal_id                 213634 non-null float64
incorporation_date          309736 non-null object
inactivation_date           144760 non-null object
struck_off_date             157872 non-null object
dorm_date                   20207 non-null object
status                      310106 non-null object
service_provider            319150 non-null object
ibcRUC                      255375 non-null object
country_codes               318360 non-null object
countries                   318360 non-null object
note                        8984 non-null object
valid_until                 319150 non-null object
node_id                     319150 non-null int64
sourceID                    319150 non-null object
dtypes: float64(1), int64(1), object(19)
memory usage: 51.1+ MB
In [148]:
entities.columns
Out[148]:
Index([u'name', u'original_name', u'former_name', u'jurisdiction',
       u'jurisdiction_description', u'company_type', u'address',
       u'internal_id', u'incorporation_date', u'inactivation_date',
       u'struck_off_date', u'dorm_date', u'status', u'service_provider',
       u'ibcRUC', u'country_codes', u'countries', u'note', u'valid_until',
       u'node_id', u'sourceID'],
      dtype='object')
In [150]:
#from the above, the columns of interest are: name, jurisdiction_description, countries, 
#node_id and internal_id. The last 3 columns may be useful later on as an index to reference other csv files.

entities_subset = entities[[u'name', u'address', u'jurisdiction_description', u'countries', u'node_id',
                     u'internal_id']]
entities_subset.head()
Out[150]:
name address jurisdiction_description countries node_id internal_id
0 ZODIAK LTD Christabel Corporate Services Limited Christab... Undetermined Virgin Islands, British;Cyprus 67028 NaN
1 Zven Ltd. Alesta Consulting, S.A. INVOICE BY EMAIL ONLY Undetermined Not identified;Virgin Islands, British 67243 NaN
2 Anson 11A Ltd. Express Co Registration Pte Ltd 138 Cecil Stre... Undetermined Singapore;Virgin Islands, British 67258 NaN
3 GIADA LTD Christabel Corporate Services Limited Christab... Undetermined Cyprus;Virgin Islands, British 67266 NaN
4 Scott D. Howard Scott D. Howard P.O. Box 811 Brunswick ME 0401... Undetermined United States;Not identified 108050 NaN
In [151]:
#I'm not sure what jurisdiction and countries mean - which is the source and which is the destination? Let's find
#out if they are identical.

entities.jurisdiction == entities.country_codes

# The output shows that they are not the same. Following some exploratory data analyses (which I've excluded from
#here for brevity), I found out that (i) jurisdiction refer to the tax havens, and (ii) the distinction between 
#source country and tax havens is not so clear cut. For instance, based on the second chart below, you'll see that
#several thousands of companies investing into the British Virgin Islands are from British Virgin islands.
Out[151]:
0         False
1         False
2         False
3         False
4         False
5          True
6         False
7         False
8         False
9          True
10        False
11        False
          ...  
319138    False
319139    False
319140    False
319141    False
319142    False
319143    False
319144    False
319145    False
319146    False
319147    False
319148    False
319149    False
dtype: bool
In [9]:
#The country_codes and countries column sometimes contain more than 1 element. This may lead to undercounting. Let's
#fix this.

split = lambda x: pd.Series([i for i in reversed(x.split(';'))])
countries_split = entities_subset['countries'].dropna().apply(split)

countries_split.rename(columns={0:'Country1',1:'Country2',2:'Country3', 3: 'Country4'},inplace=True)
countries_split = countries_split[['Country1', 'Country2', 'Country3', 'Country4']]
print countries_split
                       Country1                 Country2 Country3 Country4
0                        Cyprus  Virgin Islands, British      NaN      NaN
1       Virgin Islands, British           Not identified      NaN      NaN
2       Virgin Islands, British                Singapore      NaN      NaN
3       Virgin Islands, British                   Cyprus      NaN      NaN
4                Not identified            United States      NaN      NaN
5                Not identified                      NaN      NaN      NaN
6       Virgin Islands, British       Russian Federation      NaN      NaN
7                   Netherlands           Not identified      NaN      NaN
...                         ...                      ...      ...      ...
319142                  Andorra                      NaN      NaN      NaN
319143              Isle of Man                      NaN      NaN      NaN
319144                Hong Kong                      NaN      NaN      NaN
319145                Hong Kong                      NaN      NaN      NaN
319146                Hong Kong                      NaN      NaN      NaN
319147                Hong Kong                      NaN      NaN      NaN
319148     United Arab Emirates                      NaN      NaN      NaN
319149     United Arab Emirates                      NaN      NaN      NaN

[318360 rows x 4 columns]
In [154]:
print countries_split.Country3.value_counts()
print countries_split.Country4.value_counts()

# For the sake of simplicity, I am going to drop Country3 and Country4, since countries only appear for a total of
#20 times (less than 0.00001% of our dataset) in these 2 columns.
Virgin Islands, British    4
Cayman Islands             3
Singapore                  2
Hong Kong                  2
Not identified             2
Seychelles                 1
United Kingdom             1
Curaçao                    1
Name: Country3, dtype: int64
Virgin Islands, British    1
Name: Country4, dtype: int64
In [153]:
#Dropping two columns and merging this back to the main dataframe, and then dropping the country_codes column
countries_split.drop(['Country3', 'Country4'], axis = 1)
entities_subset[['Country1', 'Country2']] = countries_split[['Country1', 'Country2']]
In [142]:
entities_subset = entities_subset.drop('countries', axis = 1)
entities_subset.head()
Out[142]:
name address jurisdiction_description node_id internal_id Country2
0 ZODIAK LTD Christabel Corporate Services Limited Christab... Undetermined 67028 NaN British Virgin Islands
1 Zven Ltd. Alesta Consulting, S.A. INVOICE BY EMAIL ONLY Undetermined 67243 NaN Not identified
2 Anson 11A Ltd. Express Co Registration Pte Ltd 138 Cecil Stre... Undetermined 67258 NaN Singapore
3 GIADA LTD Christabel Corporate Services Limited Christab... Undetermined 67266 NaN Cyprus
4 Scott D. Howard Scott D. Howard P.O. Box 811 Brunswick ME 0401... Undetermined 108050 NaN United States
In [13]:
#Let's move on to harmonise the references to British Virgin Islands (in country1 they are referred to as
#Virgin Islands, British).

entities_subset = entities_subset.replace(to_replace='Virgin Islands, British', value='British Virgin Islands')
entities_subset.head()
Out[13]:
name address jurisdiction_description node_id internal_id Country1 Country2
0 ZODIAK LTD Christabel Corporate Services Limited Christab... Undetermined 67028 NaN Cyprus British Virgin Islands
1 Zven Ltd. Alesta Consulting, S.A. INVOICE BY EMAIL ONLY Undetermined 67243 NaN British Virgin Islands Not identified
2 Anson 11A Ltd. Express Co Registration Pte Ltd 138 Cecil Stre... Undetermined 67258 NaN British Virgin Islands Singapore
3 GIADA LTD Christabel Corporate Services Limited Christab... Undetermined 67266 NaN British Virgin Islands Cyprus
4 Scott D. Howard Scott D. Howard P.O. Box 811 Brunswick ME 0401... Undetermined 108050 NaN Not identified United States

The data is cleaner now, so let's prepare it for visualisation!

Based on what's available in the entities_subset dataframe, I think we can visualise the following: (i) top countries with companies found in the Panama Papers, (ii) breakdown of these top countries by destination countries, (iii) the breakdown in source and destination countries (using a heatmap), and (iv) zooming into Singapore, we can use the address column to visualise where these companies are located in Singapore (I will enlist the help of Google Maps for this last visualisation)

1. Top 20 countries with companies found in the Panama Papers

In [23]:
#First, let's build a dataframe for visualisation

source_country = entities_subset[['Country1', 'Country2']]
rank = source_country.apply(pd.Series.value_counts).fillna(0)
rank['total'] = rank.sum(axis = 1)
rank = rank.sort_values(by = 'total', ascending = False).total

top_20 = rank[:20]
top_20
Out[23]:
British Virgin Islands    69087.0
Hong Kong                 51293.0
Switzerland               38077.0
Not identified            25698.0
Panama                    18122.0
United Kingdom            17972.0
Jersey                    14562.0
Russian Federation        11516.0
                           ...   
Cyprus                     6374.0
United States              6254.0
Singapore                  5867.0
Bahamas                    5021.0
Uruguay                    4906.0
Isle of Man                4893.0
China                      4188.0
Monaco                     3168.0
Name: total, dtype: float64
In [124]:
#Let's visualise the data! I'll the plotly cufflinks library (why not? it's interactive! try clicking on the chart!)

top_20.iplot(kind ='bar', yTitle = 'Number of Companies', title = 'Top 20 Countries (Source) Listed in Panama Papers')
Out[124]:

2. Breakdown of top countries by destination countries (i.e. jurisdictions)

In [25]:
#First, let's structure our data into a 2-level dataframe, so that it can be used to create a stacked bar chart.
#Caveat - This section is slightly more messy and I wish there were less trial and error in my approach!

source_country['jurisdiction'] = entities_subset['jurisdiction_description']
source_country
/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Out[25]:
Country1 Country2 jurisdiction
0 Cyprus British Virgin Islands Undetermined
1 British Virgin Islands Not identified Undetermined
2 British Virgin Islands Singapore Undetermined
3 British Virgin Islands Cyprus Undetermined
4 Not identified United States Undetermined
5 Not identified NaN Undetermined
6 British Virgin Islands Russian Federation Undetermined
7 Netherlands Not identified Undetermined
... ... ... ...
319142 Andorra NaN Hong Kong
319143 Isle of Man NaN British Virgin Islands
319144 Hong Kong NaN British Virgin Islands
319145 Hong Kong NaN British Virgin Islands
319146 Hong Kong NaN British Virgin Islands
319147 Hong Kong NaN British Virgin Islands
319148 United Arab Emirates NaN British Virgin Islands
319149 United Arab Emirates NaN British Virgin Islands

319150 rows × 3 columns

In [75]:
source_country_2_level_counts = source_country.groupby(['Country1', 'jurisdiction']).size().reset_index()
source_country_2_level_counts.rename(columns={0:'total_companies'},inplace=True)
source_country_2_level_counts
Out[75]:
Country1 jurisdiction total_companies
0 Albania Bahamas 2
1 American Samoa United Kingdom 1
2 Andorra Bahamas 26
3 Andorra British Virgin Islands 39
4 Andorra Costa Rica 3
5 Andorra Hong Kong 9
6 Andorra Nevada 24
7 Andorra Niue 12
... ... ... ...
947 Virgin Islands, U.S. Bahamas 2
948 Virgin Islands, U.S. British Virgin Islands 1
949 Virgin Islands, U.S. Panama 1
950 Virgin Islands, U.S. Undetermined 7
951 Yemen Panama 1
952 Zambia British Virgin Islands 2
953 Zimbabwe British Virgin Islands 5
954 Zimbabwe Panama 3

955 rows × 3 columns

In [159]:
source_country_2_level_counts_sorted = source_country_2_level_counts.sort_values(by=['total_companies'], ascending = False)
source_country_2_level_counts_sorted.head(10)
Out[159]:
Country1 jurisdiction total_companies
117 British Virgin Islands British Virgin Islands 37843
358 Hong Kong British Virgin Islands 27300
805 Switzerland British Virgin Islands 19650
634 Not identified Undetermined 18193
124 British Virgin Islands Undetermined 14817
811 Switzerland Panama 11590
448 Jersey British Virgin Islands 10467
735 Samoa Samoa 7677
370 Hong Kong Undetermined 6110
651 Panama Panama 6090
In [106]:
#In the interest of time, I will export the data for visualisation in plotly. 
source_country_2_level_counts_sorted.to_csv("data/country_jurisdiction_sorted.csv")
In [156]:
#I've visualised the data in plotly (it was a breeze!), and it even allows me to embed an interactive chart in
#Jupyter notebook! WOW!)
HTML('<iframe id="igraph" scrolling="no" style="border:none;" seamless="seamless" src="https://plot.ly/~davified/17.embed" height="400" width="100%"></iframe>')

#Note: You can zoom in by dragging and selecting the section of interest. Double click to zoom out
Out[156]:

3. A heatmap representing the number of companies found in each country, broken down by destination jurisdictions

In [133]:
#I exported the source_country_2_level_counts_sorted dataframe to create a heatmap on Google Sheets.

HTML('<iframe src="https://docs.google.com/spreadsheets/d/1kqYZrG7GPBPDWsJWBWlBHBV7DCt9kFaWNi83VkVOZac/pubhtml?gid=1600001876&amp;single=true&amp;widget=true&amp;headers=false" width=1000 height=500></iframe>')
Out[133]:

4. Visualising the locations of the Singapore-based companies found in the Panama Papers (with the help of Google Maps)

In [139]:
HTML('<iframe src="https://www.google.com/maps/d/embed?mid=1nrJFPuC2TWjS3vVHQbPCnU2jmFQ" width="1000" height="500"></iframe>')

#Hover over the placemarkers to see details of the company
Out[139]:

Conclusion

About Python

  1. Pandas is really powerful and fast
  2. Plotly is really intuitive and it syncs well with Python and Jupyter Notebook as well

About the Panama Papers

Based on the exploratory data analyses above, I've found out that:

  1. The top 5 source countries of such shady dealings are: British Virgin Islands, Hong Kong, Switzerland, "Not Identified" and Panama. I guess the lawyers and accountants have learnt to hide their trail well. Apart from the clear suspects (BVI, Panama), there is a big chunk which is "Not Identified" - and this makes it quite hard to conclusively identify tax havens.
  2. Even though Singapore is often touted as a possible tax haven, the figures show that the degree to which it houses companies related to the Panama Papers is much lesser as compared to Hong Kong, Switzerland, British Virgin islands. That said, as a source country, it is ranked 15th (5,867 companies), right after the United States (6,254 companies). As a jurisdiction, it is ranked 11th (668 companies), right after Nevada (1,260 companies) and Hong Kong (1,331).

Do explore the charts above yourself and let me know what else you can conclude from them :-)