In this notebook, I use the pandas and plotly libraries to clean, wrangle and visualise the Panama Papers dataset. It is in part to help me learn python, pandas and plotly, and in part to help me satisfy my curiosity about the Panama Papers. Specifically, I hope to answer the following questions: (i) which countries do tax evading companies originate from, (ii) which countries are popular tax havens, and (iii) zooming into Singapore, where do tax evaders live?
Everything I learnt about pandas are from two amazing tutorials by Brandon Rhodes (PyCon 2015) https://www.youtube.com/watch?v=5JnMutdy6Fw&list=PLO9pkowc_99YYMFnW9GZkIE93jbip1jZb&index=4 and Jonathan Rocher (SciPy 2015) https://www.youtube.com/watch?v=0CFFTJUZ2dc.
In the process, I learnt how to use pandas to:
I also learnt how to use the plotly library for data visualisation and how to import and embed HTML files
Note:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import cufflinks as cf
import plotly.tools as tls
import numpy as np
import pandas as pd
from pandas import set_option
set_option("display.max_rows", 25)
from IPython.core.display import HTML
LARGE_FIGSIZE = (16, 12)
The data is taken from 5 csv files, which contain data on the (i) entities, (ii) intermediaries, (iii) officers, (iv) addresses of officers, and the fifth file appears to be a file which allows Neo4j to map data across csv files using node_id as a unique identifier.
More details here: https://offshoreleaks.icij.org/pages/about Note that this contains only a fraction of the Panama Papers leaks from the Panama-based offices of Mossack Fonseca. Those leaks are made up of 2.6 terabytes of data, a total of 11.5 million records, which are not included en masse in the public database. The raw documents have not been published. The Mossack Fonseca leaks include vast amounts of email correspondence, bank account details, identity documents and financial transactions, and not all the names on those files appear in the database. Much of the information is buried in emails, power-of-attorney letters and internal notes of Mossack Fonseca employees and cannot easily be extracted in a systematic manner, according to the International Consortium of Investigative Journalists (ICIJ).
addresses = pd.read_csv("data/Addresses.csv")
all_edges = pd.read_csv("data/all_edges.csv")
entities = pd.read_csv("data/Entities.csv")
intermediaries = pd.read_csv("data/Intermediaries.csv")
officers = pd.read_csv("data/Officers.csv")
all_data = {'1. entities' : pd.Series(entities.columns.values),
'2. intermediaries' : pd.Series(intermediaries.columns.values),
'3. officers': pd.Series(officers.columns.values),
'4. addresses' : pd.Series(addresses.columns.values),
'5. all_edges': pd.Series(all_edges.columns.values)}
data_overview = pd.DataFrame(all_data)
data_overview.head(20)
1. entities | 2. intermediaries | 3. officers | 4. addresses | 5. all_edges | |
---|---|---|---|---|---|
0 | name | name | name | address | node_1 |
1 | original_name | internal_id | icij_id | icij_id | rel_type |
2 | former_name | address | valid_until | valid_until | node_2 |
3 | jurisdiction | valid_until | country_codes | country_codes | NaN |
4 | jurisdiction_description | country_codes | countries | countries | NaN |
5 | company_type | countries | node_id | node_id | NaN |
6 | address | status | sourceID | sourceID | NaN |
7 | internal_id | node_id | NaN | NaN | NaN |
8 | incorporation_date | sourceID | NaN | NaN | NaN |
9 | inactivation_date | NaN | NaN | NaN | NaN |
10 | struck_off_date | NaN | NaN | NaN | NaN |
11 | dorm_date | NaN | NaN | NaN | NaN |
12 | status | NaN | NaN | NaN | NaN |
13 | service_provider | NaN | NaN | NaN | NaN |
14 | ibcRUC | NaN | NaN | NaN | NaN |
15 | country_codes | NaN | NaN | NaN | NaN |
16 | countries | NaN | NaN | NaN | NaN |
17 | note | NaN | NaN | NaN | NaN |
18 | valid_until | NaN | NaN | NaN | NaN |
19 | node_id | NaN | NaN | NaN | NaN |
entities.shape
(319150, 21)
entities.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 319150 entries, 0 to 319149 Data columns (total 21 columns): name 319146 non-null object original_name 213634 non-null object former_name 6454 non-null object jurisdiction 319150 non-null object jurisdiction_description 319150 non-null object company_type 103227 non-null object address 299319 non-null object internal_id 213634 non-null float64 incorporation_date 309736 non-null object inactivation_date 144760 non-null object struck_off_date 157872 non-null object dorm_date 20207 non-null object status 310106 non-null object service_provider 319150 non-null object ibcRUC 255375 non-null object country_codes 318360 non-null object countries 318360 non-null object note 8984 non-null object valid_until 319150 non-null object node_id 319150 non-null int64 sourceID 319150 non-null object dtypes: float64(1), int64(1), object(19) memory usage: 51.1+ MB
entities.columns
Index([u'name', u'original_name', u'former_name', u'jurisdiction', u'jurisdiction_description', u'company_type', u'address', u'internal_id', u'incorporation_date', u'inactivation_date', u'struck_off_date', u'dorm_date', u'status', u'service_provider', u'ibcRUC', u'country_codes', u'countries', u'note', u'valid_until', u'node_id', u'sourceID'], dtype='object')
#from the above, the columns of interest are: name, jurisdiction_description, countries,
#node_id and internal_id. The last 3 columns may be useful later on as an index to reference other csv files.
entities_subset = entities[[u'name', u'address', u'jurisdiction_description', u'countries', u'node_id',
u'internal_id']]
entities_subset.head()
name | address | jurisdiction_description | countries | node_id | internal_id | |
---|---|---|---|---|---|---|
0 | ZODIAK LTD | Christabel Corporate Services Limited Christab... | Undetermined | Virgin Islands, British;Cyprus | 67028 | NaN |
1 | Zven Ltd. | Alesta Consulting, S.A. INVOICE BY EMAIL ONLY | Undetermined | Not identified;Virgin Islands, British | 67243 | NaN |
2 | Anson 11A Ltd. | Express Co Registration Pte Ltd 138 Cecil Stre... | Undetermined | Singapore;Virgin Islands, British | 67258 | NaN |
3 | GIADA LTD | Christabel Corporate Services Limited Christab... | Undetermined | Cyprus;Virgin Islands, British | 67266 | NaN |
4 | Scott D. Howard | Scott D. Howard P.O. Box 811 Brunswick ME 0401... | Undetermined | United States;Not identified | 108050 | NaN |
#I'm not sure what jurisdiction and countries mean - which is the source and which is the destination? Let's find
#out if they are identical.
entities.jurisdiction == entities.country_codes
# The output shows that they are not the same. Following some exploratory data analyses (which I've excluded from
#here for brevity), I found out that (i) jurisdiction refer to the tax havens, and (ii) the distinction between
#source country and tax havens is not so clear cut. For instance, based on the second chart below, you'll see that
#several thousands of companies investing into the British Virgin Islands are from British Virgin islands.
0 False 1 False 2 False 3 False 4 False 5 True 6 False 7 False 8 False 9 True 10 False 11 False ... 319138 False 319139 False 319140 False 319141 False 319142 False 319143 False 319144 False 319145 False 319146 False 319147 False 319148 False 319149 False dtype: bool
#The country_codes and countries column sometimes contain more than 1 element. This may lead to undercounting. Let's
#fix this.
split = lambda x: pd.Series([i for i in reversed(x.split(';'))])
countries_split = entities_subset['countries'].dropna().apply(split)
countries_split.rename(columns={0:'Country1',1:'Country2',2:'Country3', 3: 'Country4'},inplace=True)
countries_split = countries_split[['Country1', 'Country2', 'Country3', 'Country4']]
print countries_split
Country1 Country2 Country3 Country4 0 Cyprus Virgin Islands, British NaN NaN 1 Virgin Islands, British Not identified NaN NaN 2 Virgin Islands, British Singapore NaN NaN 3 Virgin Islands, British Cyprus NaN NaN 4 Not identified United States NaN NaN 5 Not identified NaN NaN NaN 6 Virgin Islands, British Russian Federation NaN NaN 7 Netherlands Not identified NaN NaN ... ... ... ... ... 319142 Andorra NaN NaN NaN 319143 Isle of Man NaN NaN NaN 319144 Hong Kong NaN NaN NaN 319145 Hong Kong NaN NaN NaN 319146 Hong Kong NaN NaN NaN 319147 Hong Kong NaN NaN NaN 319148 United Arab Emirates NaN NaN NaN 319149 United Arab Emirates NaN NaN NaN [318360 rows x 4 columns]
print countries_split.Country3.value_counts()
print countries_split.Country4.value_counts()
# For the sake of simplicity, I am going to drop Country3 and Country4, since countries only appear for a total of
#20 times (less than 0.00001% of our dataset) in these 2 columns.
Virgin Islands, British 4 Cayman Islands 3 Singapore 2 Hong Kong 2 Not identified 2 Seychelles 1 United Kingdom 1 Curaçao 1 Name: Country3, dtype: int64 Virgin Islands, British 1 Name: Country4, dtype: int64
#Dropping two columns and merging this back to the main dataframe, and then dropping the country_codes column
countries_split.drop(['Country3', 'Country4'], axis = 1)
entities_subset[['Country1', 'Country2']] = countries_split[['Country1', 'Country2']]
entities_subset = entities_subset.drop('countries', axis = 1)
entities_subset.head()
name | address | jurisdiction_description | node_id | internal_id | Country2 | |
---|---|---|---|---|---|---|
0 | ZODIAK LTD | Christabel Corporate Services Limited Christab... | Undetermined | 67028 | NaN | British Virgin Islands |
1 | Zven Ltd. | Alesta Consulting, S.A. INVOICE BY EMAIL ONLY | Undetermined | 67243 | NaN | Not identified |
2 | Anson 11A Ltd. | Express Co Registration Pte Ltd 138 Cecil Stre... | Undetermined | 67258 | NaN | Singapore |
3 | GIADA LTD | Christabel Corporate Services Limited Christab... | Undetermined | 67266 | NaN | Cyprus |
4 | Scott D. Howard | Scott D. Howard P.O. Box 811 Brunswick ME 0401... | Undetermined | 108050 | NaN | United States |
#Let's move on to harmonise the references to British Virgin Islands (in country1 they are referred to as
#Virgin Islands, British).
entities_subset = entities_subset.replace(to_replace='Virgin Islands, British', value='British Virgin Islands')
entities_subset.head()
name | address | jurisdiction_description | node_id | internal_id | Country1 | Country2 | |
---|---|---|---|---|---|---|---|
0 | ZODIAK LTD | Christabel Corporate Services Limited Christab... | Undetermined | 67028 | NaN | Cyprus | British Virgin Islands |
1 | Zven Ltd. | Alesta Consulting, S.A. INVOICE BY EMAIL ONLY | Undetermined | 67243 | NaN | British Virgin Islands | Not identified |
2 | Anson 11A Ltd. | Express Co Registration Pte Ltd 138 Cecil Stre... | Undetermined | 67258 | NaN | British Virgin Islands | Singapore |
3 | GIADA LTD | Christabel Corporate Services Limited Christab... | Undetermined | 67266 | NaN | British Virgin Islands | Cyprus |
4 | Scott D. Howard | Scott D. Howard P.O. Box 811 Brunswick ME 0401... | Undetermined | 108050 | NaN | Not identified | United States |
Based on what's available in the entities_subset dataframe, I think we can visualise the following: (i) top countries with companies found in the Panama Papers, (ii) breakdown of these top countries by destination countries, (iii) the breakdown in source and destination countries (using a heatmap), and (iv) zooming into Singapore, we can use the address column to visualise where these companies are located in Singapore (I will enlist the help of Google Maps for this last visualisation)
#First, let's build a dataframe for visualisation
source_country = entities_subset[['Country1', 'Country2']]
rank = source_country.apply(pd.Series.value_counts).fillna(0)
rank['total'] = rank.sum(axis = 1)
rank = rank.sort_values(by = 'total', ascending = False).total
top_20 = rank[:20]
top_20
British Virgin Islands 69087.0 Hong Kong 51293.0 Switzerland 38077.0 Not identified 25698.0 Panama 18122.0 United Kingdom 17972.0 Jersey 14562.0 Russian Federation 11516.0 ... Cyprus 6374.0 United States 6254.0 Singapore 5867.0 Bahamas 5021.0 Uruguay 4906.0 Isle of Man 4893.0 China 4188.0 Monaco 3168.0 Name: total, dtype: float64
#Let's visualise the data! I'll the plotly cufflinks library (why not? it's interactive! try clicking on the chart!)
top_20.iplot(kind ='bar', yTitle = 'Number of Companies', title = 'Top 20 Countries (Source) Listed in Panama Papers')
#First, let's structure our data into a 2-level dataframe, so that it can be used to create a stacked bar chart.
#Caveat - This section is slightly more messy and I wish there were less trial and error in my approach!
source_country['jurisdiction'] = entities_subset['jurisdiction_description']
source_country
/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Country1 | Country2 | jurisdiction | |
---|---|---|---|
0 | Cyprus | British Virgin Islands | Undetermined |
1 | British Virgin Islands | Not identified | Undetermined |
2 | British Virgin Islands | Singapore | Undetermined |
3 | British Virgin Islands | Cyprus | Undetermined |
4 | Not identified | United States | Undetermined |
5 | Not identified | NaN | Undetermined |
6 | British Virgin Islands | Russian Federation | Undetermined |
7 | Netherlands | Not identified | Undetermined |
... | ... | ... | ... |
319142 | Andorra | NaN | Hong Kong |
319143 | Isle of Man | NaN | British Virgin Islands |
319144 | Hong Kong | NaN | British Virgin Islands |
319145 | Hong Kong | NaN | British Virgin Islands |
319146 | Hong Kong | NaN | British Virgin Islands |
319147 | Hong Kong | NaN | British Virgin Islands |
319148 | United Arab Emirates | NaN | British Virgin Islands |
319149 | United Arab Emirates | NaN | British Virgin Islands |
319150 rows × 3 columns
source_country_2_level_counts = source_country.groupby(['Country1', 'jurisdiction']).size().reset_index()
source_country_2_level_counts.rename(columns={0:'total_companies'},inplace=True)
source_country_2_level_counts
Country1 | jurisdiction | total_companies | |
---|---|---|---|
0 | Albania | Bahamas | 2 |
1 | American Samoa | United Kingdom | 1 |
2 | Andorra | Bahamas | 26 |
3 | Andorra | British Virgin Islands | 39 |
4 | Andorra | Costa Rica | 3 |
5 | Andorra | Hong Kong | 9 |
6 | Andorra | Nevada | 24 |
7 | Andorra | Niue | 12 |
... | ... | ... | ... |
947 | Virgin Islands, U.S. | Bahamas | 2 |
948 | Virgin Islands, U.S. | British Virgin Islands | 1 |
949 | Virgin Islands, U.S. | Panama | 1 |
950 | Virgin Islands, U.S. | Undetermined | 7 |
951 | Yemen | Panama | 1 |
952 | Zambia | British Virgin Islands | 2 |
953 | Zimbabwe | British Virgin Islands | 5 |
954 | Zimbabwe | Panama | 3 |
955 rows × 3 columns
source_country_2_level_counts_sorted = source_country_2_level_counts.sort_values(by=['total_companies'], ascending = False)
source_country_2_level_counts_sorted.head(10)
Country1 | jurisdiction | total_companies | |
---|---|---|---|
117 | British Virgin Islands | British Virgin Islands | 37843 |
358 | Hong Kong | British Virgin Islands | 27300 |
805 | Switzerland | British Virgin Islands | 19650 |
634 | Not identified | Undetermined | 18193 |
124 | British Virgin Islands | Undetermined | 14817 |
811 | Switzerland | Panama | 11590 |
448 | Jersey | British Virgin Islands | 10467 |
735 | Samoa | Samoa | 7677 |
370 | Hong Kong | Undetermined | 6110 |
651 | Panama | Panama | 6090 |
#In the interest of time, I will export the data for visualisation in plotly.
source_country_2_level_counts_sorted.to_csv("data/country_jurisdiction_sorted.csv")
#I've visualised the data in plotly (it was a breeze!), and it even allows me to embed an interactive chart in
#Jupyter notebook! WOW!)
HTML('<iframe id="igraph" scrolling="no" style="border:none;" seamless="seamless" src="https://plot.ly/~davified/17.embed" height="400" width="100%"></iframe>')
#Note: You can zoom in by dragging and selecting the section of interest. Double click to zoom out
#I exported the source_country_2_level_counts_sorted dataframe to create a heatmap on Google Sheets.
HTML('<iframe src="https://docs.google.com/spreadsheets/d/1kqYZrG7GPBPDWsJWBWlBHBV7DCt9kFaWNi83VkVOZac/pubhtml?gid=1600001876&single=true&widget=true&headers=false" width=1000 height=500></iframe>')
HTML('<iframe src="https://www.google.com/maps/d/embed?mid=1nrJFPuC2TWjS3vVHQbPCnU2jmFQ" width="1000" height="500"></iframe>')
#Hover over the placemarkers to see details of the company
Based on the exploratory data analyses above, I've found out that:
Do explore the charts above yourself and let me know what else you can conclude from them :-)