We begin by importing some helper functions.
from helper import *
Now, let's get the data from the [List of helicopter escapes] https://en.wikipedia.org/wiki/List_of_helicopter_prison_escapes article
url = 'https://en.wikipedia.org/wiki/List_of_helicopter_prison_escapes'
data = data_from_url(url)
Let's print the first row to look at the data structure and information provided.
print(data[0])
['August 19, 1971', 'Santa Martha Acatitla', 'Mexico', 'Yes', 'Joel David Kaplan Carlos Antonio Contreras Castro', "Joel David Kaplan was a New York businessman who had been arrested for murder in 1962 in Mexico City and was incarcerated at the Santa Martha Acatitla prison in the Iztapalapa borough of Mexico City. Joel's sister, Judy Kaplan, arranged the means to help Kaplan escape, and on August 19, 1971, a helicopter landed in the prison yard. The guards mistakenly thought this was an official visit. In two minutes, Kaplan and his cellmate Carlos Antonio Contreras, a Venezuelan counterfeiter, were able to board the craft and were piloted away, before any shots were fired.[9] Both men were flown to Texas and then different planes flew Kaplan to California and Contreras to Guatemala.[3] The Mexican government never initiated extradition proceedings against Kaplan.[9] The escape is told in a book, The 10-Second Jailbreak: The Helicopter Escape of Joel David Kaplan.[4] It also inspired the 1975 action movie Breakout, which starred Charles Bronson and Robert Duvall.[9]"]
The data is organized with date, prison name, country, success, name of escapee(s), and a detailed description.
Now, let's remove the last row containing the description which we will not use in the data analysis and then print the first three rows to check it was successfully removed.
index = 0
for row in data:
data[index] = row[:-1]
index += 1
print(data[:3])
[['August 19, 1971', 'Santa Martha Acatitla', 'Mexico', 'Yes', 'Joel David Kaplan Carlos Antonio Contreras Castro'], ['October 31, 1973', 'Mountjoy Jail', 'Ireland', 'Yes', "JB O'Hagan Seamus TwomeyKevin Mallon"], ['May 24, 1978', 'United States Penitentiary, Marion', 'United States', 'No', 'Garrett Brock TrapnellMartin Joseph McNallyJames Kenneth Johnson']]
The first analysis we will do is looking at which year had the most attempts of prison escapes by helicopter.
Now, let's append an item that has only the year on the end. I want to keep the date in tact for other analysis.
for row in data:
row.append(fetch_year(row[0]))
print(data[:3])
[['August 19, 1971', 'Santa Martha Acatitla', 'Mexico', 'Yes', 'Joel David Kaplan Carlos Antonio Contreras Castro', 1971], ['October 31, 1973', 'Mountjoy Jail', 'Ireland', 'Yes', "JB O'Hagan Seamus TwomeyKevin Mallon", 1973], ['May 24, 1978', 'United States Penitentiary, Marion', 'United States', 'No', 'Garrett Brock TrapnellMartin Joseph McNallyJames Kenneth Johnson', 1978]]
From the data set, we will look for the first and last years with escapes and then create an array with those years.
min_year = min(data, key=lambda x: x[5])[5]
max_year = max(data, key=lambda x: x[5])[5]
years = []
for y in range(min_year, max_year + 1):
years.append(y)
print(years)
[1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020]
Starting with the array of years, we will construct an array consisting of years with number of attempts in the given year.
attempts_per_year = []
for year in years:
attempts_per_year.append([year, 0])
print(attempts_per_year)
[[1971, 0], [1972, 0], [1973, 0], [1974, 0], [1975, 0], [1976, 0], [1977, 0], [1978, 0], [1979, 0], [1980, 0], [1981, 0], [1982, 0], [1983, 0], [1984, 0], [1985, 0], [1986, 0], [1987, 0], [1988, 0], [1989, 0], [1990, 0], [1991, 0], [1992, 0], [1993, 0], [1994, 0], [1995, 0], [1996, 0], [1997, 0], [1998, 0], [1999, 0], [2000, 0], [2001, 0], [2002, 0], [2003, 0], [2004, 0], [2005, 0], [2006, 0], [2007, 0], [2008, 0], [2009, 0], [2010, 0], [2011, 0], [2012, 0], [2013, 0], [2014, 0], [2015, 0], [2016, 0], [2017, 0], [2018, 0], [2019, 0], [2020, 0]]
Next, we will loop over the the data and fill out the attempts per year array created above.
for row in data:
for ya in attempts_per_year:
y = ya[0]
if row[5] == y:
ya[1] += 1
print(attempts_per_year)
[[1971, 1], [1972, 0], [1973, 1], [1974, 0], [1975, 0], [1976, 0], [1977, 0], [1978, 1], [1979, 0], [1980, 0], [1981, 2], [1982, 0], [1983, 1], [1984, 0], [1985, 2], [1986, 3], [1987, 1], [1988, 1], [1989, 2], [1990, 1], [1991, 1], [1992, 2], [1993, 1], [1994, 0], [1995, 0], [1996, 1], [1997, 1], [1998, 0], [1999, 1], [2000, 2], [2001, 3], [2002, 2], [2003, 1], [2004, 0], [2005, 2], [2006, 1], [2007, 3], [2008, 0], [2009, 3], [2010, 1], [2011, 0], [2012, 1], [2013, 2], [2014, 1], [2015, 0], [2016, 1], [2017, 0], [2018, 1], [2019, 0], [2020, 1]]
Finally, we will use matplotlib to create a bar graph of the attempts per year to answer our initial question: which year had the greatest number of attempts.
%matplotlib inline
barplot(attempts_per_year)
There were several years with 3 prison breakout attempts with a helicopter: 1986, 2001, 2007, and 2009.
Next, we will look at the frequency of attempts per country and create a table from that data.
countries_frequency = df["Country"].value_counts()
print_pretty_table(countries_frequency)
Country | Number of Occurrences |
---|---|
France | 15 |
United States | 8 |
Belgium | 4 |
Canada | 4 |
Greece | 4 |
Brazil | 2 |
Australia | 2 |
United Kingdom | 2 |
Netherlands | 1 |
Ireland | 1 |
Italy | 1 |
Chile | 1 |
Puerto Rico | 1 |
Mexico | 1 |
Russia | 1 |
Based on the table, France has had the most helpicopter breakout attempts.
Let's determine which month has the most outbreak attempts. I would hypothesize that warmer months would have a greater number of breakout attempts, as the weather would be more favorable.
Just as we did with the year attempts, we will create an array of attempts per month.
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'November', 'December']
attempts_per_month = []
for month in months:
attempts_per_month.append([month, 0])
print(attempts_per_month)
[['January', 0], ['February', 0], ['March', 0], ['April', 0], ['May', 0], ['June', 0], ['July', 0], ['August', 0], ['September', 0], ['November', 0], ['December', 0]]
Next we will inter over the data and fill in the attempts per month array.
for row in data:
date = row[0].split(" ")
datamonth = date[0]
for month in attempts_per_month:
m = month[0]
if datamonth == m:
month[1] += 1
print(attempts_per_month)
[['January', 2], ['February', 5], ['March', 4], ['April', 5], ['May', 4], ['June', 6], ['July', 5], ['August', 2], ['September', 2], ['November', 2], ['December', 8]]
Let's use a barplot like before with the attempts per month array.
%matplotlib inline
barplot(attempts_per_month)
Surprisingly, December had the greatest number of attempts, contrary to my hypothesis. Now let's see if it also had the greatest number of successful attempts.
We will use the same logic as before: creating an array and then iterating over the array and filling in the data. Lastly, displaying a bargraph.
success_attempts_per_month = []
for month in months:
success_attempts_per_month.append([month, 0])
print(success_attempts_per_month)
[['January', 0], ['February', 0], ['March', 0], ['April', 0], ['May', 0], ['June', 0], ['July', 0], ['August', 0], ['September', 0], ['November', 0], ['December', 0]]
for row in data:
date = row[0].split(" ")
datamonth = date[0]
for month in success_attempts_per_month:
m = month[0]
if (datamonth == m) & (row[3] == 'Yes'):
month[1] += 1
print(success_attempts_per_month)
[['January', 2], ['February', 3], ['March', 4], ['April', 4], ['May', 1], ['June', 4], ['July', 4], ['August', 2], ['September', 0], ['November', 2], ['December', 7]]
%matplotlib inline
barplot(success_attempts_per_month)
December remains the best month for a successful outbreak. Let's see if that holds true for a success rate percentage.
As before, we will create an empty array for the rates based on the month.
success_rate_per_month = []
for month in months:
success_rate_per_month.append([month, 0])
print(success_rate_per_month)
[['January', 0], ['February', 0], ['March', 0], ['April', 0], ['May', 0], ['June', 0], ['July', 0], ['August', 0], ['September', 0], ['November', 0], ['December', 0]]
Now, using our two previously created arrays, one for total attempts per month and successful attempts per month, we will fill in our array for the rates.
for i in range(len(success_rate_per_month)):
success_rate_per_month[i][1] = success_attempts_per_month[i][1] / attempts_per_month[i][1]
print(success_rate_per_month)
[['January', 1.0], ['February', 0.6], ['March', 1.0], ['April', 0.8], ['May', 0.25], ['June', 0.6666666666666666], ['July', 0.8], ['August', 1.0], ['September', 0.0], ['November', 1.0], ['December', 0.875]]
%matplotlib inline
barplot(success_rate_per_month)
Looking at the success rate per month now, we see that December is no longer the best month to attempt a helicopter escape, and instead January, March, August, and November have better success rates - actually perfect success rates.