Understanding the Panama Papers using pandas and plotly¶

In this notebook, I use the pandas and plotly libraries to clean, wrangle and visualise the Panama Papers dataset. It is in part to help me learn python, pandas and plotly, and in part to help me satisfy my curiosity about the Panama Papers. Specifically, I hope to answer the following questions: (i) which countries do tax evading companies originate from, (ii) which countries are popular tax havens, and (iii) zooming into Singapore, where do tax evaders live?

Everything I learnt about pandas are from two amazing tutorials by Brandon Rhodes (PyCon 2015) https://www.youtube.com/watch?v=5JnMutdy6Fw&list=PLO9pkowc_99YYMFnW9GZkIE93jbip1jZb&index=4 and Jonathan Rocher (SciPy 2015) https://www.youtube.com/watch?v=0CFFTJUZ2dc.

In the process, I learnt how to use pandas to:

Load data
Clean and format data
Transform datasets
Aggregate and summarise data

I also learnt how to use the plotly library for data visualisation and how to import and embed HTML files

Note:

This is an exercise in pandas (for data cleaning) and plotly (for visualisation). I haven't learnt about data validation at this point, so please take my conclusions with a pinch of salt.
Source: https://offshoreleaks.icij.org/pages/database ICIJ Offshore Leaks Database is licensed under the Open Database License and contents under Creative Commons Attribution-ShareAlike license.

In [143]:

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import cufflinks as cf
import plotly.tools as tls

import numpy as np
import pandas as pd
from pandas import set_option
set_option("display.max_rows", 25)
from IPython.core.display import HTML


LARGE_FIGSIZE = (16, 12)

Describing the data¶

The data is taken from 5 csv files, which contain data on the (i) entities, (ii) intermediaries, (iii) officers, (iv) addresses of officers, and the fifth file appears to be a file which allows Neo4j to map data across csv files using node_id as a unique identifier.

More details here: https://offshoreleaks.icij.org/pages/about Note that this contains only a fraction of the Panama Papers leaks from the Panama-based offices of Mossack Fonseca. Those leaks are made up of 2.6 terabytes of data, a total of 11.5 million records, which are not included en masse in the public database. The raw documents have not been published. The Mossack Fonseca leaks include vast amounts of email correspondence, bank account details, identity documents and financial transactions, and not all the names on those files appear in the database. Much of the information is buried in emails, power-of-attorney letters and internal notes of Mossack Fonseca employees and cannot easily be extracted in a systematic manner, according to the International Consortium of Investigative Journalists (ICIJ).

In [144]:

addresses = pd.read_csv("data/Addresses.csv")
all_edges = pd.read_csv("data/all_edges.csv")
entities = pd.read_csv("data/Entities.csv")
intermediaries = pd.read_csv("data/Intermediaries.csv")
officers = pd.read_csv("data/Officers.csv")

In [145]:

all_data = {'1. entities' : pd.Series(entities.columns.values),
            '2. intermediaries' : pd.Series(intermediaries.columns.values),
            '3. officers': pd.Series(officers.columns.values),
            '4. addresses' : pd.Series(addresses.columns.values),
            '5. all_edges': pd.Series(all_edges.columns.values)}

data_overview = pd.DataFrame(all_data)
data_overview.head(20)

Out[145]:

	1. entities	2. intermediaries	3. officers	4. addresses	5. all_edges
0	name	name	name	address	node_1
1	original_name	internal_id	icij_id	icij_id	rel_type
2	former_name	address	valid_until	valid_until	node_2
3	jurisdiction	valid_until	country_codes	country_codes	NaN
4	jurisdiction_description	country_codes	countries	countries	NaN
5	company_type	countries	node_id	node_id	NaN
6	address	status	sourceID	sourceID	NaN
7	internal_id	node_id	NaN	NaN	NaN
8	incorporation_date	sourceID	NaN	NaN	NaN
9	inactivation_date	NaN	NaN	NaN	NaN
10	struck_off_date	NaN	NaN	NaN	NaN
11	dorm_date	NaN	NaN	NaN	NaN
12	status	NaN	NaN	NaN	NaN
13	service_provider	NaN	NaN	NaN	NaN
14	ibcRUC	NaN	NaN	NaN	NaN
15	country_codes	NaN	NaN	NaN	NaN
16	countries	NaN	NaN	NaN	NaN
17	note	NaN	NaN	NaN	NaN
18	valid_until	NaN	NaN	NaN	NaN
19	node_id	NaN	NaN	NaN	NaN

Exploring the entities dataframe¶

In [146]:

entities.shape

Out[146]:

(319150, 21)

In [147]:

entities.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319150 entries, 0 to 319149
Data columns (total 21 columns):
name                        319146 non-null object
original_name               213634 non-null object
former_name                 6454 non-null object
jurisdiction                319150 non-null object
jurisdiction_description    319150 non-null object
company_type                103227 non-null object
address                     299319 non-null object
internal_id                 213634 non-null float64
incorporation_date          309736 non-null object
inactivation_date           144760 non-null object
struck_off_date             157872 non-null object
dorm_date                   20207 non-null object
status                      310106 non-null object
service_provider            319150 non-null object
ibcRUC                      255375 non-null object
country_codes               318360 non-null object
countries                   318360 non-null object
note                        8984 non-null object
valid_until                 319150 non-null object
node_id                     319150 non-null int64
sourceID                    319150 non-null object
dtypes: float64(1), int64(1), object(19)
memory usage: 51.1+ MB

In [148]:

entities.columns

Out[148]:

Index([u'name', u'original_name', u'former_name', u'jurisdiction',
       u'jurisdiction_description', u'company_type', u'address',
       u'internal_id', u'incorporation_date', u'inactivation_date',
       u'struck_off_date', u'dorm_date', u'status', u'service_provider',
       u'ibcRUC', u'country_codes', u'countries', u'note', u'valid_until',
       u'node_id', u'sourceID'],
      dtype='object')

In [150]:

#from the above, the columns of interest are: name, jurisdiction_description, countries, 
#node_id and internal_id. The last 3 columns may be useful later on as an index to reference other csv files.

entities_subset = entities[[u'name', u'address', u'jurisdiction_description', u'countries', u'node_id',
                     u'internal_id']]
entities_subset.head()

Out[150]:

	name	address	jurisdiction_description	countries	node_id	internal_id
0	ZODIAK LTD	Christabel Corporate Services Limited Christab...	Undetermined	Virgin Islands, British;Cyprus	67028	NaN
1	Zven Ltd.	Alesta Consulting, S.A. INVOICE BY EMAIL ONLY	Undetermined	Not identified;Virgin Islands, British	67243	NaN
2	Anson 11A Ltd.	Express Co Registration Pte Ltd 138 Cecil Stre...	Undetermined	Singapore;Virgin Islands, British	67258	NaN
3	GIADA LTD	Christabel Corporate Services Limited Christab...	Undetermined	Cyprus;Virgin Islands, British	67266	NaN
4	Scott D. Howard	Scott D. Howard P.O. Box 811 Brunswick ME 0401...	Undetermined	United States;Not identified	108050	NaN

In [151]:

#I'm not sure what jurisdiction and countries mean - which is the source and which is the destination? Let's find
#out if they are identical.

entities.jurisdiction == entities.country_codes

# The output shows that they are not the same. Following some exploratory data analyses (which I've excluded from
#here for brevity), I found out that (i) jurisdiction refer to the tax havens, and (ii) the distinction between 
#source country and tax havens is not so clear cut. For instance, based on the second chart below, you'll see that
#several thousands of companies investing into the British Virgin Islands are from British Virgin islands.

Out[151]:

0         False
1         False
2         False
3         False
4         False
5          True
6         False
7         False
8         False
9          True
10        False
11        False
          ...  
319138    False
319139    False
319140    False
319141    False
319142    False
319143    False
319144    False
319145    False
319146    False
319147    False
319148    False
319149    False
dtype: bool

In [9]:

#The country_codes and countries column sometimes contain more than 1 element. This may lead to undercounting. Let's
#fix this.

split = lambda x: pd.Series([i for i in reversed(x.split(';'))])
countries_split = entities_subset['countries'].dropna().apply(split)

countries_split.rename(columns={0:'Country1',1:'Country2',2:'Country3', 3: 'Country4'},inplace=True)
countries_split = countries_split[['Country1', 'Country2', 'Country3', 'Country4']]
print countries_split

                       Country1                 Country2 Country3 Country4
0                        Cyprus  Virgin Islands, British      NaN      NaN
1       Virgin Islands, British           Not identified      NaN      NaN
2       Virgin Islands, British                Singapore      NaN      NaN
3       Virgin Islands, British                   Cyprus      NaN      NaN
4                Not identified            United States      NaN      NaN
5                Not identified                      NaN      NaN      NaN
6       Virgin Islands, British       Russian Federation      NaN      NaN
7                   Netherlands           Not identified      NaN      NaN
...                         ...                      ...      ...      ...
319142                  Andorra                      NaN      NaN      NaN
319143              Isle of Man                      NaN      NaN      NaN
319144                Hong Kong                      NaN      NaN      NaN
319145                Hong Kong                      NaN      NaN      NaN
319146                Hong Kong                      NaN      NaN      NaN
319147                Hong Kong                      NaN      NaN      NaN
319148     United Arab Emirates                      NaN      NaN      NaN
319149     United Arab Emirates                      NaN      NaN      NaN

[318360 rows x 4 columns]

In [154]:

print countries_split.Country3.value_counts()
print countries_split.Country4.value_counts()

# For the sake of simplicity, I am going to drop Country3 and Country4, since countries only appear for a total of
#20 times (less than 0.00001% of our dataset) in these 2 columns.

Virgin Islands, British    4
Cayman Islands             3
Singapore                  2
Hong Kong                  2
Not identified             2
Seychelles                 1
United Kingdom             1
Curaçao                    1
Name: Country3, dtype: int64
Virgin Islands, British    1
Name: Country4, dtype: int64

In [153]:

#Dropping two columns and merging this back to the main dataframe, and then dropping the country_codes column
countries_split.drop(['Country3', 'Country4'], axis = 1)
entities_subset[['Country1', 'Country2']] = countries_split[['Country1', 'Country2']]

In [142]:

entities_subset = entities_subset.drop('countries', axis = 1)
entities_subset.head()

Out[142]:

	name	address	jurisdiction_description	node_id	internal_id	Country2
0	ZODIAK LTD	Christabel Corporate Services Limited Christab...	Undetermined	67028	NaN	British Virgin Islands
1	Zven Ltd.	Alesta Consulting, S.A. INVOICE BY EMAIL ONLY	Undetermined	67243	NaN	Not identified
2	Anson 11A Ltd.	Express Co Registration Pte Ltd 138 Cecil Stre...	Undetermined	67258	NaN	Singapore
3	GIADA LTD	Christabel Corporate Services Limited Christab...	Undetermined	67266	NaN	Cyprus
4	Scott D. Howard	Scott D. Howard P.O. Box 811 Brunswick ME 0401...	Undetermined	108050	NaN	United States

In [13]:

#Let's move on to harmonise the references to British Virgin Islands (in country1 they are referred to as
#Virgin Islands, British).

entities_subset = entities_subset.replace(to_replace='Virgin Islands, British', value='British Virgin Islands')
entities_subset.head()

Out[13]:

	name	address	jurisdiction_description	node_id	internal_id	Country1	Country2
0	ZODIAK LTD	Christabel Corporate Services Limited Christab...	Undetermined	67028	NaN	Cyprus	British Virgin Islands
1	Zven Ltd.	Alesta Consulting, S.A. INVOICE BY EMAIL ONLY	Undetermined	67243	NaN	British Virgin Islands	Not identified
2	Anson 11A Ltd.	Express Co Registration Pte Ltd 138 Cecil Stre...	Undetermined	67258	NaN	British Virgin Islands	Singapore
3	GIADA LTD	Christabel Corporate Services Limited Christab...	Undetermined	67266	NaN	British Virgin Islands	Cyprus
4	Scott D. Howard	Scott D. Howard P.O. Box 811 Brunswick ME 0401...	Undetermined	108050	NaN	Not identified	United States

The data is cleaner now, so let's prepare it for visualisation!¶

Based on what's available in the entities_subset dataframe, I think we can visualise the following: (i) top countries with companies found in the Panama Papers, (ii) breakdown of these top countries by destination countries, (iii) the breakdown in source and destination countries (using a heatmap), and (iv) zooming into Singapore, we can use the address column to visualise where these companies are located in Singapore (I will enlist the help of Google Maps for this last visualisation)

1. Top 20 countries with companies found in the Panama Papers¶

In [23]:

#First, let's build a dataframe for visualisation

source_country = entities_subset[['Country1', 'Country2']]
rank = source_country.apply(pd.Series.value_counts).fillna(0)
rank['total'] = rank.sum(axis = 1)
rank = rank.sort_values(by = 'total', ascending = False).total

top_20 = rank[:20]
top_20

Out[23]:

British Virgin Islands    69087.0
Hong Kong                 51293.0
Switzerland               38077.0
Not identified            25698.0
Panama                    18122.0
United Kingdom            17972.0
Jersey                    14562.0
Russian Federation        11516.0
                           ...   
Cyprus                     6374.0
United States              6254.0
Singapore                  5867.0
Bahamas                    5021.0
Uruguay                    4906.0
Isle of Man                4893.0
China                      4188.0
Monaco                     3168.0
Name: total, dtype: float64

In [124]:

#Let's visualise the data! I'll the plotly cufflinks library (why not? it's interactive! try clicking on the chart!)

top_20.iplot(kind ='bar', yTitle = 'Number of Companies', title = 'Top 20 Countries (Source) Listed in Panama Papers')

Out[124]:

2. Breakdown of top countries by destination countries (i.e. jurisdictions)¶

In [25]:

#First, let's structure our data into a 2-level dataframe, so that it can be used to create a stacked bar chart.
#Caveat - This section is slightly more messy and I wish there were less trial and error in my approach!

source_country['jurisdiction'] = entities_subset['jurisdiction_description']
source_country

/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Out[25]:

	Country1	Country2	jurisdiction
0	Cyprus	British Virgin Islands	Undetermined
1	British Virgin Islands	Not identified	Undetermined
2	British Virgin Islands	Singapore	Undetermined
3	British Virgin Islands	Cyprus	Undetermined
4	Not identified	United States	Undetermined
5	Not identified	NaN	Undetermined
6	British Virgin Islands	Russian Federation	Undetermined
7	Netherlands	Not identified	Undetermined
...	...	...	...
319142	Andorra	NaN	Hong Kong
319143	Isle of Man	NaN	British Virgin Islands
319144	Hong Kong	NaN	British Virgin Islands
319145	Hong Kong	NaN	British Virgin Islands
319146	Hong Kong	NaN	British Virgin Islands
319147	Hong Kong	NaN	British Virgin Islands
319148	United Arab Emirates	NaN	British Virgin Islands
319149	United Arab Emirates	NaN	British Virgin Islands

319150 rows × 3 columns

In [75]:

source_country_2_level_counts = source_country.groupby(['Country1', 'jurisdiction']).size().reset_index()
source_country_2_level_counts.rename(columns={0:'total_companies'},inplace=True)
source_country_2_level_counts

Out[75]:

	Country1	jurisdiction	total_companies
0	Albania	Bahamas	2
1	American Samoa	United Kingdom	1
2	Andorra	Bahamas	26
3	Andorra	British Virgin Islands	39
4	Andorra	Costa Rica	3
5	Andorra	Hong Kong	9
6	Andorra	Nevada	24
7	Andorra	Niue	12
...	...	...	...
947	Virgin Islands, U.S.	Bahamas	2
948	Virgin Islands, U.S.	British Virgin Islands	1
949	Virgin Islands, U.S.	Panama	1
950	Virgin Islands, U.S.	Undetermined	7
951	Yemen	Panama	1
952	Zambia	British Virgin Islands	2
953	Zimbabwe	British Virgin Islands	5
954	Zimbabwe	Panama	3

955 rows × 3 columns

In [159]:

source_country_2_level_counts_sorted = source_country_2_level_counts.sort_values(by=['total_companies'], ascending = False)
source_country_2_level_counts_sorted.head(10)

Out[159]:

	Country1	jurisdiction	total_companies
117	British Virgin Islands	British Virgin Islands	37843
358	Hong Kong	British Virgin Islands	27300
805	Switzerland	British Virgin Islands	19650
634	Not identified	Undetermined	18193
124	British Virgin Islands	Undetermined	14817
811	Switzerland	Panama	11590
448	Jersey	British Virgin Islands	10467
735	Samoa	Samoa	7677
370	Hong Kong	Undetermined	6110
651	Panama	Panama	6090

In [106]:

#In the interest of time, I will export the data for visualisation in plotly. 
source_country_2_level_counts_sorted.to_csv("data/country_jurisdiction_sorted.csv")

In [156]:

#I've visualised the data in plotly (it was a breeze!), and it even allows me to embed an interactive chart in
#Jupyter notebook! WOW!)
HTML('<iframe id="igraph" scrolling="no" style="border:none;" seamless="seamless" src="https://plot.ly/~davified/17.embed" height="400" width="100%"></iframe>')

#Note: You can zoom in by dragging and selecting the section of interest. Double click to zoom out

Out[156]:

3. A heatmap representing the number of companies found in each country, broken down by destination jurisdictions¶

In [133]:

#I exported the source_country_2_level_counts_sorted dataframe to create a heatmap on Google Sheets.

HTML('<iframe src="https://docs.google.com/spreadsheets/d/1kqYZrG7GPBPDWsJWBWlBHBV7DCt9kFaWNi83VkVOZac/pubhtml?gid=1600001876&amp;single=true&amp;widget=true&amp;headers=false" width=1000 height=500></iframe>')

Out[133]:

4. Visualising the locations of the Singapore-based companies found in the Panama Papers (with the help of Google Maps)¶

In [139]:

HTML('<iframe src="https://www.google.com/maps/d/embed?mid=1nrJFPuC2TWjS3vVHQbPCnU2jmFQ" width="1000" height="500"></iframe>')

#Hover over the placemarkers to see details of the company

Out[139]:

Conclusion¶

About Python¶

Pandas is really powerful and fast
Plotly is really intuitive and it syncs well with Python and Jupyter Notebook as well

About the Panama Papers¶

Based on the exploratory data analyses above, I've found out that:

The top 5 source countries of such shady dealings are: British Virgin Islands, Hong Kong, Switzerland, "Not Identified" and Panama. I guess the lawyers and accountants have learnt to hide their trail well. Apart from the clear suspects (BVI, Panama), there is a big chunk which is "Not Identified" - and this makes it quite hard to conclusively identify tax havens.
Even though Singapore is often touted as a possible tax haven, the figures show that the degree to which it houses companies related to the Panama Papers is much lesser as compared to Hong Kong, Switzerland, British Virgin islands. That said, as a source country, it is ranked 15th (5,867 companies), right after the United States (6,254 companies). As a jurisdiction, it is ranked 11th (668 companies), right after Nevada (1,260 companies) and Hong Kong (1,331).

Do explore the charts above yourself and let me know what else you can conclude from them :-)