Notebook

My first data science project!¶

Helicopter Escapes¶

Our objective in this project is to answer the following questions:

* In which year is the number of attempted escapes higher?
* In which country is the number of attempted escapes higher?

In order to do this, we will:

* Clean up our data to slice out unnecessary information (see steps up to [21]);
* Format our data in a list that associates years and numbers of attempts (see step [52-53]);
* Plot our results and deduce the answer to the first question; 
* Then we will import a function and display the results to answer the second question (see steps [55-56]).

We begin by importing some helper functions:

In [4]:

from helper import *

Now, let's get the data from the List of helicopter prison escapes Wikipedia article.

In [5]:

url = 'https://en.wikipedia.org/wiki/List_of_helicopter_prison_escapes'
data = data_from_url(url)

Let's print the first three rows.

In [6]:

print(data[:4])

[['August 19, 1971', 'Santa Martha Acatitla', 'Mexico', 'Yes', 'Joel David Kaplan Carlos Antonio Contreras Castro', "Joel David Kaplan was a New York businessman who had been arrested for murder in 1962 in Mexico City and was incarcerated at the Santa Martha Acatitla prison in the Iztapalapa borough of Mexico City. Joel's sister, Judy Kaplan, arranged the means to help Kaplan escape, and on August 19, 1971, a helicopter landed in the prison yard. The guards mistakenly thought this was an official visit. In two minutes, Kaplan and his cellmate Carlos Antonio Contreras, a Venezuelan counterfeiter, were able to board the craft and were piloted away, before any shots were fired.[9] Both men were flown to Texas and then different planes flew Kaplan to California and Castro to Guatemala.[3] The Mexican government never initiated extradition proceedings against Kaplan.[9] The escape is told in a book, The 10-Second Jailbreak: The Helicopter Escape of Joel David Kaplan.[4] It also inspired the 1975 action movie Breakout, which starred Charles Bronson and Robert Duvall.[9]"], ['October 31, 1973', 'Mountjoy Jail', 'Ireland', 'Yes', "JB O'Hagan Seamus TwomeyKevin Mallon", 'On October 31, 1973 an IRA member hijacked a helicopter and forced the pilot to land in the exercise yard of Dublin\'s Mountjoy Jail\'s D Wing at 3:40\xa0p.m., October 31, 1973. Three members of the IRA were able to escape: JB O\'Hagan, Seamus Twomey and Kevin Mallon. Another prisoner who also was in the prison was quoted as saying, "One shamefaced screw apologised to the governor and said he thought it was the new Minister for Defence (Paddy Donegan) arriving. I told him it was our Minister of Defence leaving." The Mountjoy helicopter escape became Republican lore and was immortalized by "The Helicopter Song", which contains the lines "It\'s up like a bird and over the city. There\'s three men a\'missing I heard the warder say".[1]'], ['May 24, 1978', 'United States Penitentiary, Marion', 'United States', 'No', 'Garrett Brock TrapnellMartin Joseph McNallyJames Kenneth Johnson', "43-year-old Barbara Ann Oswald hijacked a Saint Louis-based charter helicopter and forced the pilot to land in the yard at USP Marion. While landing the aircraft, the pilot, Allen Barklage, who was a Vietnam War veteran, struggled with Oswald and managed to wrestle the gun away from her. Barklage then shot and killed Oswald, thwarting the escape.[10] A few months later Oswald's daughter hijacked TWA Flight 541 in an effort to free Trapnell."], ['February 27, 1981', 'Fleury-Mérogis, Essonne, Ile de France', 'France', 'Yes', 'Gérard DupréDaniel Beaumont', "With the help of Serge Coutel, Gérard Dupré and Daniel Beaumont, succeed in the first and double helicopter escape of a French prison, in Fleury-Mérogis (Essonne), the best kept prison of France. The men hijacked a helicopter and its pilot that they rented to fly from Paris to Orléans. The pilot, Claude Fourcade, was taken hostage and was told that they were holding his wife and daughter hostage (which was not true) ... The flight turned into Paris - Fleury -Merogis - Porte d'Orléans.[11]"]]

Now we slice the description out of our data:

In [7]:

index = 0

In [8]:

for row in data:
    data[index] = row[:5]
    index += 1
    

In [9]:

print(data[:3])

[['August 19, 1971', 'Santa Martha Acatitla', 'Mexico', 'Yes', 'Joel David Kaplan Carlos Antonio Contreras Castro'], ['October 31, 1973', 'Mountjoy Jail', 'Ireland', 'Yes', "JB O'Hagan Seamus TwomeyKevin Mallon"], ['May 24, 1978', 'United States Penitentiary, Marion', 'United States', 'No', 'Garrett Brock TrapnellMartin Joseph McNallyJames Kenneth Johnson']]

We reformat the date to only select the year, then define the range of years:

In [10]:

for row in data: 
    row[0] = fetch_year(row[0])

In [11]:

print(data[:3])

[[1971, 'Santa Martha Acatitla', 'Mexico', 'Yes', 'Joel David Kaplan Carlos Antonio Contreras Castro'], [1973, 'Mountjoy Jail', 'Ireland', 'Yes', "JB O'Hagan Seamus TwomeyKevin Mallon"], [1978, 'United States Penitentiary, Marion', 'United States', 'No', 'Garrett Brock TrapnellMartin Joseph McNallyJames Kenneth Johnson']]

In [12]:

min_year = min(data, key=lambda x: x[0])[0]
max_year = max(data, key=lambda x: x[0])[0]

We now create a list whose element look like [year, 0]:

In [13]:

years = []
for y in range(min_year, max_year + 1):
    years.append(y)

attempts_per_year = []
for item in years:
    attempts_per_year.append([item, 0])

print(attempts_per_year)

[[1971, 0], [1972, 0], [1973, 0], [1974, 0], [1975, 0], [1976, 0], [1977, 0], [1978, 0], [1979, 0], [1980, 0], [1981, 0], [1982, 0], [1983, 0], [1984, 0], [1985, 0], [1986, 0], [1987, 0], [1988, 0], [1989, 0], [1990, 0], [1991, 0], [1992, 0], [1993, 0], [1994, 0], [1995, 0], [1996, 0], [1997, 0], [1998, 0], [1999, 0], [2000, 0], [2001, 0], [2002, 0], [2003, 0], [2004, 0], [2005, 0], [2006, 0], [2007, 0], [2008, 0], [2009, 0], [2010, 0], [2011, 0], [2012, 0], [2013, 0], [2014, 0], [2015, 0], [2016, 0], [2017, 0], [2018, 0], [2019, 0], [2020, 0]]

Now we create a nested loop so that we can iterate over our starting data, and for each iteration, we iterate over all the rows in attempts_per_year so that when the year in data matches the year in the iteration, we increase the number of attempts by 1:

In [14]:

for row in data: 
    for ya in attempts_per_year: # Instruction 2 - nothing to do here
        y = ya[0]
        if row[0] == y:
            ya[1] += 1

print(attempts_per_year)
 

[[1971, 1], [1972, 0], [1973, 1], [1974, 0], [1975, 0], [1976, 0], [1977, 0], [1978, 1], [1979, 0], [1980, 0], [1981, 2], [1982, 0], [1983, 1], [1984, 0], [1985, 2], [1986, 3], [1987, 1], [1988, 1], [1989, 2], [1990, 1], [1991, 1], [1992, 2], [1993, 1], [1994, 0], [1995, 0], [1996, 1], [1997, 1], [1998, 0], [1999, 1], [2000, 2], [2001, 3], [2002, 2], [2003, 1], [2004, 0], [2005, 2], [2006, 1], [2007, 3], [2008, 0], [2009, 3], [2010, 1], [2011, 0], [2012, 1], [2013, 2], [2014, 1], [2015, 0], [2016, 1], [2017, 0], [2018, 1], [2019, 0], [2020, 1]]

In [15]:

%matplotlib inline
barplot(attempts_per_year)

The most attempts at breaking out of prison with a helicopter occured in 1986, 2009, 2007, 2006! Now let's answer the second question by importing a specific function and then plotting it.

In [16]:

countries_frequency = df["Country"].value_counts()

In [17]:

print_pretty_table(countries_frequency)

Country	Number of Occurrences
France	15
United States	8
Greece	4
Belgium	4
Canada	4
Brazil	2
United Kingdom	2
Australia	2
Italy	1
Puerto Rico	1
Chile	1
Mexico	1
Netherlands	1
Ireland	1
Russia	1

After that, I will try to answer the three following unguided questions:

* In which countries do helicopter prison breaks have a higher chance of success? 
* How does the number of escapees affect the success?
* Which escapees have done it more than once?

First of all, I try to create a list of [country, 0] like we did with the years. The only difficulty is that I have to only select countries once, so I check for each item that it is not already present in the list before appending it.

In [38]:

countries = []
for row in data:
    country = row[2]
    if country not in countries:
        countries.append(country)

attempts_per_country = []
for item in countries:
    attempts_per_country.append([item, 0])

print(attempts_per_country)

[['Mexico', 0], ['Ireland', 0], ['United States', 0], ['France', 0], ['Canada', 0], ['Australia', 0], ['Brazil', 0], ['Italy', 0], ['United Kingdom', 0], ['Puerto Rico', 0], ['Chile', 0], ['Netherlands', 0], ['Greece', 0], ['Belgium', 0], ['Russia', 0]]

Now I iterate over data and for each row I increase the number of succeeded attempts per country by one if "Succeeded" = "Yes".

In [39]:

for row in data: 
    for sa in attempts_per_country:
        if row[2] == sa[0]:
            if row[3] == "Yes":
                sa[1] += 1

print(attempts_per_country)

[['Mexico', 1], ['Ireland', 1], ['United States', 6], ['France', 11], ['Canada', 3], ['Australia', 1], ['Brazil', 2], ['Italy', 1], ['United Kingdom', 1], ['Puerto Rico', 1], ['Chile', 1], ['Netherlands', 0], ['Greece', 2], ['Belgium', 2], ['Russia', 1]]

Now since I want to know the chance of success, I have to compare the number of succeeded attempts against the total number of attempts for each country (if I want it in percent format, I should multiply by 100 but I can compare them also in the decimal format so I will skip this step):

In order to get the total number of attempts for each country, I use the two objects "countries" and "occurrences" created in the helper file. However, I need them as lists so I use "tolist()" to change their type, then iterate over them and use "append" to create a list in which each element has the structure [country, number of attempts].

In [62]:

countries = df.Country.value_counts().index
list_countries = countries.tolist()
print(list_countries)

occurrences = df.Country.value_counts()
list_occurrences = occurrences.tolist()
print(list_occurrences)

country_occurrences = []
index = 0
for element in list_occurrences: 
    country_occurrences.append([list_countries[index],element])
    index += 1
print(country_occurrences)

['France', 'United States', 'Greece', 'Belgium', 'Canada', 'Brazil', 'United Kingdom', 'Australia', 'Italy', 'Puerto Rico', 'Chile', 'Mexico', 'Netherlands', 'Ireland', 'Russia']
[15, 8, 4, 4, 4, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1]
[['France', 15], ['United States', 8], ['Greece', 4], ['Belgium', 4], ['Canada', 4], ['Brazil', 2], ['United Kingdom', 2], ['Australia', 2], ['Italy', 1], ['Puerto Rico', 1], ['Chile', 1], ['Mexico', 1], ['Netherlands', 1], ['Ireland', 1], ['Russia', 1]]

Now I iterate over both lists and, when the names of the countries match, I divide the number of succeeded attempts (taken from one list) by the number of total attempts for that country (taken from the other list):

In [63]:

for item in attempts_per_country: 
    for row in country_occurrences:
        if item[0] == row[0]:
            item[1] /= row[1]

print(attempts_per_country)

[['Mexico', 1.0], ['Ireland', 1.0], ['United States', 0.75], ['France', 0.7333333333333333], ['Canada', 0.75], ['Australia', 0.5], ['Brazil', 1.0], ['Italy', 1.0], ['United Kingdom', 0.5], ['Puerto Rico', 1.0], ['Chile', 1.0], ['Netherlands', 0.0], ['Greece', 0.5], ['Belgium', 0.5], ['Russia', 1.0]]

In [64]:

%matplotlib inline
barplot(attempts_per_country)

So the higher chance of success is to be found in Mexico, Ireland, Brazil, Italy, Chile, Puerto Rico, and Russia.

Now let's move on to the second question. The most difficult part is managing to determine the number of escapees, since in the original data I only have a single string that puts all them together, with no separator. My intent is to count the number of words in each string and divide it by two, based on the format "Name Surname", and then manually check the cases in which the total number does not add up, since there is no universal naming convention.

First of all, I only select the columns "Escapees" and "Succeeded" from the general data:

In [114]:

escapees = []
for row in data: 
    escapees.append([row[3], row[4]])

print(escapees)

[['Yes', 'Joel David Kaplan Carlos Antonio Contreras Castro'], ['Yes', "JB O'Hagan Seamus TwomeyKevin Mallon"], ['No', 'Garrett Brock TrapnellMartin Joseph McNallyJames Kenneth Johnson'], ['Yes', 'Gérard DupréDaniel Beaumont'], ['No', 'Marina Paquet (hijacker)Giles Arseneault (prisoner)'], ['No', 'David McMillan'], ['Yes', 'James Rodney LeonardWilliam Douglas BallewJesse Glenn Smith'], ['Yes', 'José Carlos dos Reis Encina, a.k.a. "Escadinha"'], ['Yes', 'Michel Vaujour'], ['Yes', 'Samantha Lopez'], ['Yes', 'André BellaïcheGianluigi EspositoLuciano Cipollari'], ['Yes', 'Sydney DraperJohn Kendall'], ['Yes', 'Mahoney Danny Francis MitchellRandy Lackey'], ['No', 'Ben Kramer'], ['Yes', 'Ralph BrownFreddie Gonzales'], ['Yes', 'Robert FordDavid Thomas'], ['Yes', 'William Lane'], ['Yes', '—'], ['No', '—'], ['No', 'Michel Vaujour'], ['Yes', 'Four members of the Manuel Rodriguez Patriotic Front'], ['No', '—'], ['Yes', 'John Killick'], ['Yes', 'Steven Whitsett'], ['Yes', '—'], ['Yes', 'Pascal Payet'], ['Yes', 'Abdelhamid CarnousEmile Forma-SariJean-Philippe Lecase'], ['No', '—'], ['Yes', '—'], ['Yes', 'Orlando Cartagena Jose Rodriguez Victor Diaz Hector Diaz Jose Tapia'], ['Yes', 'Eric AlboreoFranck PerlettoMichel Valero'], ['No', '—'], ['Yes', 'Hubert SellesJean-Claude MorettiMohamed Bessame'], ['Yes', 'Vassilis Paleokostas'], ['Yes', 'Eric Ferdinand'], ['Yes', 'Pascal Payet'], ['No', 'Nordin Benallal'], ['Yes', 'Vasilis PaleokostasAlket Rizai'], ['Yes', 'Alexin JismyFabrice Michel'], ['Yes', 'Ashraf Sekkaki plus three other criminals'], ['No', 'Brian Lawrence'], ['Yes', 'Alexey Shestakov'], ['No', 'Panagiotis Vlastos'], ['Yes', 'Benjamin Hudon-BarbeauDanny Provençal'], ['Yes', 'Yves DenisDenis LefebvreSerge Pomerleau'], ['No', 'Pola RoupaNikos Maziotis'], ['Yes', 'Rédoine Faïd'], ['No', 'Kristel A.']]

Now for each list of names I split the names based on words, then group them by two and round them up (because on the original data the space between two names is often missing, so the number of words is actually often fewer than it should be).

In [115]:

for row in escapees: 
    row[1] = round(len(row[1].split())/2)
    
print(escapees)

[['Yes', 4], ['Yes', 2], ['No', 4], ['Yes', 2], ['No', 2], ['No', 1], ['Yes', 4], ['Yes', 4], ['Yes', 1], ['Yes', 1], ['Yes', 2], ['Yes', 2], ['Yes', 2], ['No', 1], ['Yes', 2], ['Yes', 2], ['Yes', 1], ['Yes', 0], ['No', 0], ['No', 1], ['Yes', 4], ['No', 0], ['Yes', 1], ['Yes', 1], ['Yes', 0], ['Yes', 1], ['Yes', 2], ['No', 0], ['Yes', 0], ['Yes', 5], ['Yes', 2], ['No', 0], ['Yes', 2], ['Yes', 1], ['Yes', 1], ['Yes', 1], ['No', 1], ['Yes', 2], ['Yes', 2], ['Yes', 3], ['No', 1], ['Yes', 1], ['No', 1], ['Yes', 2], ['Yes', 2], ['No', 2], ['Yes', 1], ['No', 1]]

Many names still need to be adjusted because they do not reflect the "Name Surname" format in the original data, so I need to do it manually.

In [118]:

escapees[0][1] = 2
escapees[1][1] = 3
escapees[2][1] = 3
escapees[6][1] = 3
escapees[7][1] = 2
escapees[9][1] = 1
escapees[10][1] = 3
escapees[12][1] = 3
escapees[17][1] = 2
escapees[18][1] = 1
escapees[21][1] = 1
escapees[24][1] = 3
escapees[26][1] = 3
escapees[27][1] = 2
escapees[28][1] = 2
escapees[30][1] = 3
escapees[31][1] = 1
escapees[32][1] = 3
escapees[36][1] = 1
escapees[39][1] = 4
escapees[44][1] = 3

print(escapees)

[['Yes', 2], ['Yes', 3], ['No', 3], ['Yes', 2], ['No', 2], ['No', 1], ['Yes', 3], ['Yes', 2], ['Yes', 1], ['Yes', 1], ['Yes', 3], ['Yes', 2], ['Yes', 3], ['No', 1], ['Yes', 2], ['Yes', 2], ['Yes', 1], ['Yes', 2], ['No', 1], ['No', 1], ['Yes', 4], ['No', 1], ['Yes', 1], ['Yes', 1], ['Yes', 3], ['Yes', 1], ['Yes', 3], ['No', 2], ['Yes', 2], ['Yes', 5], ['Yes', 3], ['No', 1], ['Yes', 3], ['Yes', 1], ['Yes', 1], ['Yes', 1], ['No', 1], ['Yes', 2], ['Yes', 2], ['Yes', 4], ['No', 1], ['Yes', 1], ['No', 1], ['Yes', 2], ['Yes', 3], ['No', 2], ['Yes', 1], ['No', 1]]

Now I want to analyze separately the attempts with 1, 2, 3, or more than 3 escapees, and plot how many of them succeeded and how many did not. This allows me to compare how many attempts succeed in each case, while also counting how many attempts feature a certain number of escapees. Another way I could do it was separating my list in two lists, one with all the attempts that succeeded, and one with all the attempts that did not succeed, and then plot them based on number of escapees, but in this way I would overlook the information about the total number of attempts (because I see that there are many more attempts featuring one or two people rather than three or more).

In [142]:

list_one = []
list_two = []
list_three = []
list_more_than_three = []

for row in escapees: 
    if row[1] == 1:
        list_one.append(row[0])
    elif row[1] == 2:
        list_two.append(row[0])
    elif row[1] == 3:
        list_three.append(row[0])
    else:
        list_more_than_three.append(row[0])

print(list_one)
print(list_two)
print(list_three)
print(list_more_than_three)

['No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'No', 'Yes', 'No']
['Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No']
['Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes']
['Yes', 'Yes', 'Yes']

In [143]:

plt.hist(list_one, bins = 2)
plt.show()

plt.hist(list_two, bins = 2)
plt.show()

plt.hist(list_three, bins = 2)
plt.show()

plt.hist(list_more_than_three, bins = 2)
plt.show()

In conclusion, I would say that there are far more attempts with one or two escapees than with three or more people involved. However, the more people are involved, the more likely it is to succeed (surprisingly)!

Now let's move onto the last question. Which escapees have done it more than once? In order to answer this, I create a list with the names of the escapees, then iterate it. I put every new name I encounter in a list called "list_names", and check: if the name I encounter is already in that list, I put it in a different list, called "repeat", which will only include names of the escapees who did it more than once.

In [156]:

names = []
for row in data:
    names.append(row[4])
    #names.append(row[4].split())

print(names)

['Joel David Kaplan Carlos Antonio Contreras Castro', "JB O'Hagan Seamus TwomeyKevin Mallon", 'Garrett Brock TrapnellMartin Joseph McNallyJames Kenneth Johnson', 'Gérard DupréDaniel Beaumont', 'Marina Paquet (hijacker)Giles Arseneault (prisoner)', 'David McMillan', 'James Rodney LeonardWilliam Douglas BallewJesse Glenn Smith', 'José Carlos dos Reis Encina, a.k.a. "Escadinha"', 'Michel Vaujour', 'Samantha Lopez', 'André BellaïcheGianluigi EspositoLuciano Cipollari', 'Sydney DraperJohn Kendall', 'Mahoney Danny Francis MitchellRandy Lackey', 'Ben Kramer', 'Ralph BrownFreddie Gonzales', 'Robert FordDavid Thomas', 'William Lane', '—', '—', 'Michel Vaujour', 'Four members of the Manuel Rodriguez Patriotic Front', '—', 'John Killick', 'Steven Whitsett', '—', 'Pascal Payet', 'Abdelhamid CarnousEmile Forma-SariJean-Philippe Lecase', '—', '—', 'Orlando Cartagena Jose Rodriguez Victor Diaz Hector Diaz Jose Tapia', 'Eric AlboreoFranck PerlettoMichel Valero', '—', 'Hubert SellesJean-Claude MorettiMohamed Bessame', 'Vassilis Paleokostas', 'Eric Ferdinand', 'Pascal Payet', 'Nordin Benallal', 'Vasilis PaleokostasAlket Rizai', 'Alexin JismyFabrice Michel', 'Ashraf Sekkaki plus three other criminals', 'Brian Lawrence', 'Alexey Shestakov', 'Panagiotis Vlastos', 'Benjamin Hudon-BarbeauDanny Provençal', 'Yves DenisDenis LefebvreSerge Pomerleau', 'Pola RoupaNikos Maziotis', 'Rédoine Faïd', 'Kristel A.']

In [158]:

list_names = []
repeat = []
for item in names: 
    if item in list_names:
        repeat.append(item)
    list_names.append(item)

print(repeat)

['—', 'Michel Vaujour', '—', '—', '—', '—', '—', 'Pascal Payet']

Note that I checked for the entire string "Escapees". This would exclude the cases in which one or more of the escapee appear again, but not with the same accomplices. In order to include this case, I could split the string into single words and then check for single words (and then again, find a way to check that the repeated name does not belong to a homonym, e.g., if I find two "James", I would have to check for the last name too). However, since in the original data, names and last names are not separated by a space or comma, even checking for single strings would not suffice, unless I manually separated each name and last name in the original data. This is a limitation I could not get around, and I wanted to explicitly address it.

However, my answer is: the escapees that did it more than once are Michel Vaujour and Pascal Payet!

In [ ]: