Probability and an Introduction to Jupyter, Python and Pandas¶

Data Science School, Nyeri, Kenya¶

15th June 2015 Neil Lawrence¶

Welcome to the Data Science School in Nyeri, Kenya. In this school we will introduce the basic concepts of machine learning and data science. In particular we will look at tools and techniques that describe how to model. An integrated part of that is how we approach data with the computer. We are choosing to do that with the tool you see in front of you: the Jupyter Notebook.

The notebook provides us with a way of interacting with the data that allows us to give the computer instructions and explore the nature of a data set. It is different to normal coding, but it is related. In this course you will, through intensive practical sessions and labs, develop your understanding of the interaction between data and computers.

The first thing we are going to do is ask you to forget a bit about what you think about normal programming, or 'classical software engineering'. Classical software engineering demands a large amount of design and testing. In data analysis, testing remains very important, but the design is often evolving. The design evolves through a process known as exploratory data analysis. You will learn some of the techniques of exploratory data analysis in this course.

A particular difference between classical software engineering and data analysis is the way in which programs are run. Classically we spend a deal of time working with a text editor, writing code. Compilations are done on a regular basis and aspects of the code are tested (perhaps with unit tests).

Data analysis is more like coding in a debugger. In a debugger (particularly a visual debugger) you interact with the data stored in the memory of the computer to try and understand what is happening in the computer, you need to understand exactly what your bug is: you often have a fixed idea of what the program is trying to do, you are just struggling to find out why it isn't doing it.

Naturally, debugging is an important part of data analysis also, but in some sense it can be seen as its entire premise. You load in a data set into a computer that you don't understand, your entire objective is to understand the data. This is best done by interrogating the data to visualise it or summarize it, just like in a power visual debugger. However, for data science the requirements for visualization and summarization are far greater than in a regular program. When the data is well understood, the actual number of lines of your program may well be very few (particularly if you disregard commands that load in the data and commands which plot your results). If a powerful data science library is available, you may be able to summarize your code with just two or three lines, but the amount of intellectual energy that is expended on writing those three lines is far greater than in standard code.

In the first lecture we will think a little about 'how we got here' in terms of computer science. In the lecture itself, this will be done by taking a subjective perspective, that of my own 'data autobiography'.

Assumed Knowledge¶

Linear Algebra, Probability and Differential Calculus¶

We will be assuming that you have good background in maths. In particular we will be making use of linear algebra (matrix operations including inverse, inner products, determinant etc), probability (sum rule of probability, product rule of probability), and the calculus of differentiation (and integration!). A new concept for the course is multivariate differentiation and integration. This combines linear algebra and differential calculus. These techniques are vital in understanding probability distributions over high dimensional distributions.

Choice of Language¶

In this course we will be using Python for our programming language. A prerequisite of attending this course is that you have learnt at least one programming language in the past. It is not our objective to teach you python. At Level 4 and Masters we expect our students to be able pick up a language as they go. If you have not experienced python before it may be worth your while spending some time understanding the language. There are resources available for you to do this here that are based on the standard console. An introduction to the Jupyter notebook (formerly known as the IPython notebook) is available here.

Choice of Environment¶

We are working in the Jupyter notebook (formerly known as the IPython notebook). It provides an environment for interacting with data in a natural way which is reproducible. We will be learning how to make use of the notebook throughout the course. The notebook allows us to combine code with descriptions, interactive visualizations, plots etc. In fact it allows us to do many of the things we need for data science. Notebooks can also be easily shared through the internet for ease of communication of ideas. The box this text is written in is a markdown box. Below we have a code box.

In [3]:

print "This is the Jupyter notebook"
print "It provides a platform for:"
words = ['Open', 'Data', 'Science']
from random import shuffle
for i in range(3):
    shuffle(words)
    print ' '.join(words)

This is the Jupyter notebook
It provides a platform for:
Data Open Science
Science Open Data
Science Open Data

Have a play with the code in the above box. Think about the following questions: what is the difference between CTRL-enter and SHIFT-enter in running the code? What does the command shuffle do? Can you find out by typing shuffle? in a code box? Once you've had a play with the code we can load in some data using the pandas library for data analysis.

Movie Body Count Example¶

There is a crisis in the movie industry, deaths are occurring on a massive scale. In every feature film the body count is tolling up. But what is the cause of all these deaths? Let's try and investigate.

For our first example of data science, we take inspiration from work by researchers at NJIT. They researchers were comparing the qualities of Python with R (my brief thoughts on the subject are available in a Google+ post here: https://plus.google.com/116220678599902155344/posts/5iKyqcrNN68). They put together a data base of results from the the "Internet Movie Database" and the Movie Body Count website which will allow us to do some preliminary investigation.

We will make use of data that has already been 'scraped' from the Movie Body Count website. Code and the data is available at a github repository. Git is a version control system and github is a website that hosts code that can be accessed through git. By sharing the code publicly through github, the authors are licensing the code publicly and allowing you to access and edit it. As well as accessing the code via github you can also download the zip file. But let's do that in python

In [5]:

import pods
import pandas as pd
data = pods.datasets.movie_body_count()
film_deaths = data['Y']

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-5-54d660a12817> in <module>()
----> 1 import tffutfjytfvjtuyfgkutc
      2 import pandas as pd
      3 data = pods.datasets.movie_body_count()
      4 film_deaths = data['Y']

ImportError: No module named tffutfjytfvjtuyfgkutc

Once the data is downloaded we can unzip it into the same directory where we are running the lab class.

Once it is loaded in the data can be summarized using the describe method in pandas.

In [2]:

film_deaths.describe()
film_deaths

Out[2]:

	Film	Year	Body_Count	MPAA_Rating	Genre	Director	Actors	Length_Minutes	IMDB_Rating
0	24 Hour Party People	2002	7	R	Biography\|Comedy\|Drama\|Music	Michael Winterbottom	Steve Coogan\|John Thomson\|Paul Popplewell\|Lenn...	117	7.4
1	3:10 to Yuma	2007	45	R	Adventure\|Crime\|Drama\|Western	James Mangold	Russell Crowe\|Christian Bale\|Logan Lerman\|Dall...	122	7.8
2	300	2006	0	R	Action\|Fantasy\|History\|War	Zack Snyder	Gerard Butler\|Lena Headey\|Dominic West\|David W...	117	7.8
3	8MM	1999	7	R	Crime\|Mystery\|Thriller	Joel Schumacher	Nicolas Cage\|Joaquin Phoenix\|James Gandolfini\|...	123	6.4
4	The Abominable Dr. Phibes	1971	10	PG-13	Fantasy\|Horror	Robert Fuest	Vincent Price\|Joseph Cotten\|Hugh Griffith\|Terr...	94	7.2
5	Above the Law	1988	18	NaN	Action\|Crime\|Drama\|Thriller	Andrew Davis	Steven Seagal\|Pam Grier\|Henry Silva\|Ron Dean\|D...	99	5.9
6	Action Jackson	1988	17	NaN	Action\|Comedy\|Crime\|Thriller	Craig R. Baxley	Carl Weathers\|Craig T. Nelson\|Vanity\|Sharon St...	96	5.0
7	The Adventures of Ford Fairlane	1990	7	NaN	Action\|Adventure\|Comedy\|Music	Renny Harlin	Andrew Dice Clay\|Wayne Newton\|Priscilla Presle...	104	6.2
8	?on Flux	2005	58	PG-13	Action\|Sci-Fi	Karyn Kusama	Charlize Theron\|Marton Csokas\|Jonny Lee Miller...	93	5.5
9	Akira	1988	119	R	Animation\|Action\|Adventure\|Drama\|Horror\|Myster...	Katsuhiro Ohtomo	Mitsuo Iwata\|Nozomu Sasaki\|Mami Koyama\|Tesshô ...	124	8.1
10	Ali G Indahouse	2002	11	R	Comedy	Mark Mylod	Sacha Baron Cohen\|Emilio Rivera\|Gina La Piana\|...	85	6.2
11	Alien	1979	9	R	Horror\|Sci-Fi	Ridley Scott	Tom Skerritt\|Sigourney Weaver\|Veronica Cartwri...	117	8.5
12	AVPR: Aliens vs Predator - Requiem	2007	115	R	Action\|Horror\|Sci-Fi\|Thriller	Colin Strause\|Greg Strause	Steven Pasquale\|Reiko Aylesworth\|John Ortiz\|Jo...	94	4.7
13	Alpha Dog	2006	3	R	Biography\|Crime\|Drama	Nick Cassavetes	Bruce Willis\|Matthew Barry\|Emile Hirsch\|Fernan...	122	6.9
14	Altered States	1980	5	NaN	Drama\|Fantasy\|Horror\|Sci-Fi\|Thriller	Ken Russell	William Hurt\|Blair Brown\|Bob Balaban\|Charles H...	102	6.9
15	American Gangster	2007	15	R	Biography\|Crime\|Drama	Ridley Scott	Denzel Washington\|Russell Crowe\|Chiwetel Ejiof...	157	7.8
16	American Ninja	1985	114	NaN	Action\|Adventure\|Romance\|Sport	Sam Firstenberg	Michael Dudikoff\|Steve James\|Judie Aronson\|Gui...	95	5.2
17	American Pop	1981	61	NaN	Animation\|Drama\|Music	Ralph Bakshi	Ron Thompson\|Mews Small\|Jerry Holland\|Lisa Jan...	96	7.1
18	American Psycho	2000	18	R	Crime\|Drama	Mary Harron	Christian Bale\|Justin Theroux\|Josh Lucas\|Bill ...	102	7.6
19	American Yakuza	1993	53	NaN	Action\|Crime\|Drama\|Thriller	Frank A. Cappello	Viggo Mortensen\|Ryo Ishibashi\|Michael Nouri\|Fr...	96	5.7
20	Another Day in Paradise	1998	9	R	Crime\|Drama\|Thriller	Larry Clark	James Woods\|Melanie Griffith\|Vincent Kartheise...	101	6.5
21	Apocalypse Now	1979	62	R	Drama\|War	Francis Ford Coppola	Marlon Brando\|Martin Sheen\|Robert Duvall\|Frede...	153	8.5
22	Apocalypto	2006	114	R	Action\|Adventure\|Drama\|Thriller	Mel Gibson	Rudy Youngblood\|Dalia Hernández\|Jonathan Brewe...	139	7.8
23	Appaloosa	2008	10	R	Crime\|Drama\|Western	Ed Harris	Robert Jauregui\|Jeremy Irons\|Timothy V. Murphy...	115	6.8
24	Armageddon	1998	16	PG-13	Action\|Adventure\|Sci-Fi\|Thriller	Michael Bay	Bruce Willis\|Billy Bob Thornton\|Ben Affleck\|Li...	151	6.6
25	Army of Darkness	1992	107	R	Comedy\|Fantasy\|Horror	Sam Raimi	Bruce Campbell\|Embeth Davidtz\|Marcus Gilbert\|I...	81	7.6
26	Assault on Precinct 13	1976	39	NaN	Action\|Crime\|Thriller	John Carpenter	Austin Stoker\|Darwin Joston\|Laurie Zimmer\|Mart...	91	7.4
27	Assault on Precinct 13	2005	21	R	Action\|Drama\|Crime\|Thriller	Jean-François Richet	Ethan Hawke\|Laurence Fishburne\|Gabriel Byrne\|M...	109	6.3
28	Atonement	2007	34	R	Drama\|Mystery\|Romance\|War	Joe Wright	Saoirse Ronan\|Ailidh Mackay\|Brenda Blethyn\|Jul...	123	7.8
29	Austin Powers in Goldmember	2002	6	PG-13	Action\|Comedy\|Crime	Jay Roach	Mike Myers\|Beyoncé Knowles\|Seth Green\|Michael ...	94	6.2
...	...	...	...	...	...	...	...	...	...
391	The Usual Suspects	1995	39	R	Crime\|Mystery\|Thriller	Bryan Singer	Stephen Baldwin\|Gabriel Byrne\|Benicio Del Toro...	106	8.7
392	V for Vendetta	2005	59	R	Action\|Sci-Fi\|Thriller	James McTeigue	Natalie Portman\|Hugo Weaving\|Stephen Rea\|Steph...	132	8.2
393	Valkyrie	2008	18	PG-13	Drama\|History\|Thriller\|War	Bryan Singer	Tom Cruise\|Kenneth Branagh\|Bill Nighy\|Tom Wilk...	121	7.1
394	Vampires: The Turning	2005	53	R	Action\|Horror\|Thriller	Marty Weiss	Colin Egglesfield\|Stephanie Chao\|Roger Yuan\|Pa...	84	3.3
395	Versus	2000	127	R	Action\|Fantasy\|Horror	Ryûhei Kitamura	Tak Sakaguchi\|Hideo Sakaki\|Chieko Misaka\|Kenji...	119	6.6
396	Videodrome	1983	8	NaN	Horror\|Sci-Fi	David Cronenberg	James Woods\|Sonja Smits\|Deborah Harry\|Peter Dv...	87	7.3
397	A View to a Kill	1985	37	NaN	Action\|Adventure\|Crime\|Thriller	John Glen	Roger Moore\|Christopher Walken\|Tanya Roberts\|G...	131	6.3
398	Waist Deep	2006	6	R	Action\|Crime\|Drama\|Thriller	Vondie Curtis-Hall	Tyrese Gibson\|Shawn Parr\|Henry Hunter Hall\|Joh...	97	5.9
399	Walk Hard: The Dewey Cox Story	2007	5	R	Comedy\|Drama\|Music	Jake Kasdan	Nat Faxon\|John C. Reilly\|Tim Meadows\|Conner Ra...	96	6.7
400	Walking Tall	2004	6	PG-13	Action\|Crime	Kevin Bray	Michael Bowen\|Johnny Knoxville\|Dwayne Johnson\|...	86	6.2
401	War	2007	97	R	Action\|Crime\|Thriller	Philip G. Atwell	Jet Li\|Jason Statham\|John Lone\|Devon Aoki\|Luis...	103	6.2
402	War Inc.	2008	73	R	Action\|Comedy\|Thriller	Joshua Seftel	John Cusack\|Hilary Duff\|Marisa Tomei\|Joan Cusa...	107	5.7
403	War of the Worlds	2005	52	PG-13	Adventure\|Sci-Fi\|Thriller	Steven Spielberg	Tom Cruise\|Dakota Fanning\|Miranda Otto\|Justin ...	116	6.5
404	Wasabi	2001	22	R	Action\|Drama\|Comedy\|Crime\|Thriller	Gérard Krawczyk	Jean Reno\|Ryôko Hirosue\|Michel Muller\|Carole B...	94	6.6
405	The Way of the Gun	2000	18	R	Action\|Crime\|Drama\|Thriller	Christopher McQuarrie	Ryan Phillippe\|Benicio Del Toro\|Juliette Lewis...	119	6.7
406	We Were Soldiers	2002	305	R	Action\|Drama\|History\|War	Randall Wallace	Mel Gibson\|Madeleine Stowe\|Greg Kinnear\|Sam El...	138	7.1
407	Where Eagles Dare	1968	100	NaN	Action\|Adventure\|War	Brian G. Hutton	Richard Burton\|Clint Eastwood\|Mary Ure\|Patrick...	158	7.7
408	The Wild Bunch	1969	145	NaN	Western	Sam Peckinpah	William Holden\|Ernest Borgnine\|Robert Ryan\|Edm...	145	8.1
409	X-Men	2000	6	PG-13	Action\|Adventure\|Sci-Fi	Bryan Singer	Hugh Jackman\|Patrick Stewart\|Ian McKellen\|Famk...	104	7.4
410	X2	2003	26	PG-13	Action\|Adventure\|Sci-Fi\|Thriller	Bryan Singer	Patrick Stewart\|Hugh Jackman\|Ian McKellen\|Hall...	133	7.5
411	X-Men: The Last Stand	2006	57	PG-13	Action\|Adventure\|Sci-Fi\|Thriller	Brett Ratner	Hugh Jackman\|Halle Berry\|Ian McKellen\|Patrick ...	104	6.8
412	xXx	2002	75	PG-13	Action\|Thriller	Rob L. Cohen	Vin Diesel\|Asia Argento\|Marton Csokas\|Samuel L...	124	5.8
413	xXx: State of the Union	2005	66	PG-13	Action\|Crime\|Adventure\|Thriller	Lee Tamahori	Willem Dafoe\|Samuel L. Jackson\|Ice Cube\|Scott ...	101	4.2
414	The Yakuza	1974	31	NaN	Action\|Crime\|Drama\|Thriller	Sydney Pollack	Robert Mitchum\|Ken Takakura\|Brian Keith\|Herb E...	123	7.3
415	The Yards	2000	2	R	Crime\|Drama\|Romance\|Thriller	James Gray	Mark Wahlberg\|Joaquin Phoenix\|Charlize Theron\|...	115	6.4
416	You Kill Me	2007	10	R	Comedy\|Crime\|Romance\|Thriller	John Dahl	Ben Kingsley\|Téa Leoni\|Luke Wilson\|Dennis Fari...	93	6.6
417	You Only Live Twice	1967	91	NaN	Action\|Adventure\|Crime\|Thriller	Lewis Gilbert	Sean Connery\|Akiko Wakabayashi\|Mie Hama\|Tetsur...	117	6.9
418	Zodiac	2007	3	R	Crime\|Drama\|Mystery\|Thriller	David Fincher	Jake Gyllenhaal\|Mark Ruffalo\|Anthony Edwards\|R...	157	7.7
419	Zoolander	2001	4	PG-13	Comedy	Ben Stiller	Ben Stiller\|Owen Wilson\|Christine Taylor\|Will ...	89	6.6
420	Zulu	1964	140	NaN	Drama\|History\|War	Cy Endfield	Stanley Baker\|Jack Hawkins\|Ulla Jacobsson\|Jame...	138	7.8

421 rows × 9 columns

In ipython and the jupyter notebook it is possible to see a list of all possible functions and attributes by typing the name of the object followed by . for example in the above case if we type film_deaths. it show the columns available (these are attributes in pandas dataframes) such as Body_Count, and also functions, such as .describe().

For functions we can also see the documentation about the function by following the name with a question mark. This will open a box with documentation at the bottom which can be closed with the x button.

In [4]:

film_deaths.describe?

The film deaths data is stored in an object known as a 'data frame'. Data frames come from the statistical family of programming languages based on S, the most widely used of which is R. The data frame gives us a convenient object for manipulating data. The describe method summarizes which columns there are in the data frame and gives us counts, means, standard deviations and percentiles for the values in those columns. To access a column directly we can write

In [5]:

film_deaths['Year']
#print film_deaths['Body_Count']

Out[5]:

0     2002
1     2007
2     2006
3     1999
4     1971
5     1988
6     1988
7     1990
8     2005
9     1988
10    2002
11    1979
12    2007
13    2006
14    1980
...
406    2002
407    1968
408    1969
409    2000
410    2003
411    2006
412    2002
413    2005
414    1974
415    2000
416    2007
417    1967
418    2007
419    2001
420    1964
Name: Year, Length: 421, dtype: int64

This shows the number of deaths per film across the years. We can plot the data as follows.

In [6]:

# this ensures the plot appears in the web browser
%matplotlib inline 
import pylab as plt # this imports the plotting library in python

plt.plot(film_deaths['Year'], film_deaths['Body_Count'], 'rx')

Out[6]:

[<matplotlib.lines.Line2D at 0x10df91e50>]

You may be curious what the arguments we give to plt.plot are for, now is the perfect time to look at the documentation

In [7]:

plt.plot?

We immediately note that some films have a lot of deaths, which prevent us seeing the detail of the main body of films. First lets identify the films with the most deaths.

In [8]:

film_deaths[film_deaths['Body_Count']>200]

Out[8]:

	Film	Year	Body_Count	MPAA_Rating	Genre	Director	Actors	Length_Minutes	IMDB_Rating
60	Dip huet gaai tau	1990	214	NaN	Crime\|Drama\|Thriller	John Woo	Tony Leung Chiu Wai\|Jacky Cheung\|Waise Lee\|Sim...	136	7.7
117	Equilibrium	2002	236	R	Action\|Drama\|Sci-Fi\|Thriller	Kurt Wimmer	Christian Bale\|Dominic Purcell\|Sean Bean\|Chris...	107	7.6
154	Grindhouse	2007	310	R	Action\|Horror\|Thriller	Robert Rodriguez\|Eli Roth\|Quentin Tarantino\|Ed...	Kurt Russell\|Zoë Bell\|Rosario Dawson\|Vanessa F...	191	7.7
159	Lat sau san taam	1992	307	R	Action\|Crime\|Drama\|Thriller	John Woo	Yun-Fat Chow\|Tony Leung Chiu Wai\|Teresa Mo\|Phi...	128	8.0
193	Kingdom of Heaven	2005	610	R	Action\|Adventure\|Drama\|History\|War	Ridley Scott	Martin Hancock\|Michael Sheen\|Nathalie Cox\|Eriq...	144	7.2
206	The Last Samurai	2003	558	R	Action\|Drama\|History\|War	Edward Zwick	Ken Watanabe\|Tom Cruise\|William Atherton\|Chad ...	154	7.7
222	The Lord of the Rings: The Two Towers	2002	468	PG-13	Action\|Adventure\|Fantasy	Peter Jackson	Bruce Allpress\|Sean Astin\|John Bach\|Sala Baker...	179	8.8
223	The Lord of the Rings: The Return of the King	2003	836	PG-13	Action\|Adventure\|Fantasy	Peter Jackson	Noel Appleby\|Alexandra Astin\|Sean Astin\|David ...	201	8.9
291	Rambo	2008	247	R	Action\|Thriller\|War	Sylvester Stallone	Sylvester Stallone\|Julie Benz\|Matthew Marsden\|...	92	7.1
317	Saving Private Ryan	1998	255	R	Action\|Drama\|War	Steven Spielberg	Tom Hanks\|Tom Sizemore\|Edward Burns\|Barry Pepp...	169	8.6
349	Starship Troopers	1997	256	R	Action\|Sci-Fi	Paul Verhoeven	Casper Van Dien\|Dina Meyer\|Denise Richards\|Jak...	129	7.2
375	Titanic	1997	307	PG-13	Drama\|Romance	James Cameron	Leonardo DiCaprio\|Kate Winslet\|Billy Zane\|Kath...	194	7.7
382	Troy	2004	572	R	Adventure\|Drama	Wolfgang Petersen	Julian Glover\|Brian Cox\|Nathan Jones\|Adoni Mar...	163	7.2
406	We Were Soldiers	2002	305	R	Action\|Drama\|History\|War	Randall Wallace	Mel Gibson\|Madeleine Stowe\|Greg Kinnear\|Sam El...	138	7.1

Here we are using the command film_deaths['Kill_Count']>200 to index the films in the pandas data frame which have over 200 deaths. To sort them in order we can also use the sort command. The result of this command on its own is a data series of True and False values. However, when it is passed to the film_deaths data frame it returns a new data frame which contains only those values for which the data series is True. We can also sort the result. To sort the result by the values in the Kill_Count column in descending order we use the following command.

In [9]:

film_deaths[film_deaths['Body_Count']>200].sort(columns='Body_Count', ascending=False)

Out[9]:

	Film	Year	Body_Count	MPAA_Rating	Genre	Director	Actors	Length_Minutes	IMDB_Rating
223	The Lord of the Rings: The Return of the King	2003	836	PG-13	Action\|Adventure\|Fantasy	Peter Jackson	Noel Appleby\|Alexandra Astin\|Sean Astin\|David ...	201	8.9
193	Kingdom of Heaven	2005	610	R	Action\|Adventure\|Drama\|History\|War	Ridley Scott	Martin Hancock\|Michael Sheen\|Nathalie Cox\|Eriq...	144	7.2
382	Troy	2004	572	R	Adventure\|Drama	Wolfgang Petersen	Julian Glover\|Brian Cox\|Nathan Jones\|Adoni Mar...	163	7.2
206	The Last Samurai	2003	558	R	Action\|Drama\|History\|War	Edward Zwick	Ken Watanabe\|Tom Cruise\|William Atherton\|Chad ...	154	7.7
222	The Lord of the Rings: The Two Towers	2002	468	PG-13	Action\|Adventure\|Fantasy	Peter Jackson	Bruce Allpress\|Sean Astin\|John Bach\|Sala Baker...	179	8.8
154	Grindhouse	2007	310	R	Action\|Horror\|Thriller	Robert Rodriguez\|Eli Roth\|Quentin Tarantino\|Ed...	Kurt Russell\|Zoë Bell\|Rosario Dawson\|Vanessa F...	191	7.7
159	Lat sau san taam	1992	307	R	Action\|Crime\|Drama\|Thriller	John Woo	Yun-Fat Chow\|Tony Leung Chiu Wai\|Teresa Mo\|Phi...	128	8.0
375	Titanic	1997	307	PG-13	Drama\|Romance	James Cameron	Leonardo DiCaprio\|Kate Winslet\|Billy Zane\|Kath...	194	7.7
406	We Were Soldiers	2002	305	R	Action\|Drama\|History\|War	Randall Wallace	Mel Gibson\|Madeleine Stowe\|Greg Kinnear\|Sam El...	138	7.1
349	Starship Troopers	1997	256	R	Action\|Sci-Fi	Paul Verhoeven	Casper Van Dien\|Dina Meyer\|Denise Richards\|Jak...	129	7.2
317	Saving Private Ryan	1998	255	R	Action\|Drama\|War	Steven Spielberg	Tom Hanks\|Tom Sizemore\|Edward Burns\|Barry Pepp...	169	8.6
291	Rambo	2008	247	R	Action\|Thriller\|War	Sylvester Stallone	Sylvester Stallone\|Julie Benz\|Matthew Marsden\|...	92	7.1
117	Equilibrium	2002	236	R	Action\|Drama\|Sci-Fi\|Thriller	Kurt Wimmer	Christian Bale\|Dominic Purcell\|Sean Bean\|Chris...	107	7.6
60	Dip huet gaai tau	1990	214	NaN	Crime\|Drama\|Thriller	John Woo	Tony Leung Chiu Wai\|Jacky Cheung\|Waise Lee\|Sim...	136	7.7

We now see that the 'Lord of the Rings' is a large outlier with a very large number of kills. We can try and determine how much of an outlier by histograming the data.

Plotting the Data¶

In [10]:

film_deaths['Body_Count'].hist(bins=20) # histogram the data with 20 bins.
plt.title('Histogram of Film Kill Count')

Out[10]:

<matplotlib.text.Text at 0x10e2622d0>

We could try and remove these outliers, but another approach would be plot the logarithm of the counts against the year.

In [11]:

plt.plot(film_deaths['Year'], film_deaths['Body_Count'], 'rx')
ax = plt.gca() # obtain a handle to the current axis
ax.set_yscale('log') # use a logarithmic death scale
# give the plot some titles and labels
plt.title('Film Deaths against Year')
plt.ylabel('deaths')
plt.xlabel('year')

Out[11]:

<matplotlib.text.Text at 0x10e2bfa50>

Note a few things. We are interacting with our data. In particular, we are replotting the data according to what we have learned so far. We are using the progamming language as a scripting language to give the computer one command or another, and then the next command we enter is dependent on the result of the previous. This is a very different paradigm to classical software engineering. In classical software engineering we normally write many lines of code (entire object classes or functions) before compiling the code and running it. Our approach is more similar to the approach we take whilst debugging. Historically, researchers interacted with data using a console. A command line window which allowed command entry. The notebook format we are using is slightly different. Each of the code entry boxes acts like a separate console window. We can move up and down the notebook and run each part in a different order. The state of the program is always as we left it after running the previous part.

Probabilities¶

We are now going to do some simple review of probabilities and use this review to explore some aspects of our data.

A probability distribution expresses uncertainty about the outcome of an event. We often encode this uncertainty in a variable. So if we are considering the outcome of an event, $Y$, to be a coin toss, then we might consider $Y=1$ to be heads and $Y=0$ to be tails. We represent the probability of a given outcome with the notation: $$ P(Y=1) = 0.5 $$ The first rule of probability is that the probability must normalize. The sum of the probability of all events must equal 1. So if the probability of heads ($Y=1$) is 0.5, then the probability of tails (the only other possible outcome) is given by $$ P(Y=0) = 1-P(Y=1) = 0.5 $$

Probabilities are often defined as the limit of the ratio between the number of positive outcomes (e.g. heads) given the number of trials. If the number of positive outcomes for event $y$ is denoted by $n$ and the number of trials is denoted by $N$ then this gives the ratio $$ P(Y=y) = \lim_{N\rightarrow \infty}\frac{n_y}{N}. $$ In practice we never get to observe an event infinite times, so rather than considering this we often use the following estimate $$ P(Y=y) \approx \frac{n_y}{N}. $$ Let's use this rule to compute the approximate probability that a film from the movie body count website has over 40 deaths.

In [12]:

deaths = (film_deaths.Body_Count>40).sum()  # number of positive outcomes (in sum True counts as 1, False counts as 0)
total_films = film_deaths.Body_Count.count()

prob_death = float(deaths)/float(total_films)
print "Probability of deaths being greather than 40 is:", prob_death

Probability of deaths being greather than 40 is: 0.377672209026

Conditioning¶

When predicting whether a coin turns up head or tails, we might think that this event is independent of the year or time of day. If we include an observation such as time, then in a probability this is known as condtioning. We use this notation, $P(Y=y|T=t)$, to condition the outcome on a second variable (in this case time). Or, often, for a shorthand we use $P(y|t)$ to represent this distribution (the $Y=$ and $T=$ being implicit). Because we don't believe a coin toss depends on time then we might write that $$ P(y|t) = p(y). $$ However, we might believe that the number of deaths is dependent on the year. For this we can try estimating $P(Y>40 | T=2000)$ and compare the result, for example to $P(Y>40|2002)$ using our empirical estimate of the probability.

In [13]:

for year in [2000, 2002]:
    deaths = (film_deaths.Body_Count[film_deaths.Year==year]>40).sum()
    total_films = (film_deaths.Year==year).sum()

    prob_death = float(deaths)/float(total_films)
    print "Probability of deaths being greather than 40 in year", year, "is:", prob_death

Probability of deaths being greather than 40 in year 2000 is: 0.166666666667
Probability of deaths being greather than 40 in year 2002 is: 0.407407407407

Rules of Probability¶

We've now introduced conditioning and independence to the notion of probability and computed some conditional probabilities on a practical example The scatter plot of deaths vs year that we created above can be seen as a joint probability distribution. We represent a joint probability using the notation $P(Y=y, T=t)$ or $P(y, t)$ for short. Computing a joint probability is equivalent to answering the simultaneous questions, what's the probability that the number of deaths was over 40 and the year was 2002? Or any other question that may occur to us. Again we can easily use pandas to ask such questions.

In [14]:

year = 2000
deaths = (film_deaths.Body_Count[film_deaths.Year==year]>40).sum()
total_films = film_deaths.Body_Count.count() # this is total number of films
prob_death = float(deaths)/float(total_films)
print "Probability of deaths being greather than 40 and year being", year, "is:", prob_death

Probability of deaths being greather than 40 and year being 2000 is: 0.00712589073634

The Product Rule¶

This number is the joint probability, $P(Y, T)$ which is much smaller than the conditional probability. The number can never be bigger than the conditional probabililty because it is computed using the product rule. $$ p(Y=y, T=t) = p(Y=y|T=t)p(T=t) $$ and $$p(T=t)$$ is a probability distribution, which is equal or less than 1, ensuring the joint distribution is typically smaller than the conditional distribution.

The product rule is a fundamental rule of probability, and you must remember it! It gives the relationship between the two questions: 1) What's the probability that a film was made in 2002 and has over 40 deaths? and 2) What's the probability that a film has over 40 deaths given that it was made in 2002?

In our shorter notation we can write the product rule as $$ p(y, t) = p(y|t)p(t) $$ We can see the relation working in practice for our data above by computing the different values for $t=2000$.

In [15]:

p_t = float((film_deaths.Year==2002).sum())/float(film_deaths.Body_Count.count())
p_y_given_t = float((film_deaths.Body_Count[film_deaths.Year==2002]>40).sum())/float((film_deaths.Year==2002).sum())
p_y_and_t = float((film_deaths.Body_Count[film_deaths.Year==2002]>40).sum())/float(film_deaths.Body_Count.count())

print "P(t) is", p_t
print "P(y|t) is", p_y_given_t
print "P(y,t) is", p_y_and_t

P(t) is 0.0641330166271
P(y|t) is 0.407407407407
P(y,t) is 0.0261282660333

The Sum Rule¶

The other fundamental rule of probability is the sum rule this tells us how to get a marginal distribution from the joint distribution. Simply put it says that we need to sum across the value we'd like to remove. $$ P(Y=y) = \sum_{t} P(Y=y, T=t) $$ Or in our shortened notation $$ P(y) = \sum_{t} P(y, t) $$

Bayes' Rule¶

Bayes rule is a very simple rule, it's hardly worth the name of a rule at all. It follows directly from the product rule of probability. Because $P(y, t) = P(y|t)P(t)$ and by symmetry $P(y,t)=P(t,y)=P(t|y)P(y)$ then by equating these two equations and dividing through by $P(y)$ we have $$ P(t|y) = \frac{P(y|t)P(t)}{P(y)} $$ which is known as Bayes' rule (or Bayes's rule, it depends how you choose to pronounce it). It's not difficult to derive, and its importance is more to do with the semantic operation that it enables. Each of these probability distributions represents the answer to a question we have about the world. Bayes rule (via the product rule) tells us how to invert the probability.

More Fun on the Python Data Farm¶

If you want to explore more of the things you can do with movies and python you might be interested in the imdbpy python library.

You can try installing it using easy_install as follows.

In [ ]:

!easy_install -U IMDbPY

If this doesn't work on your machine, try following instructions on (http://imdbpy.sourceforge.net/)

Once you've installed imdbpy you can test it works with the following script, which should list movies with the word 'python' in their title. To run the code in the following box, simply click the box and press SHIFT-enter or CTRL-enter. Then you can try running the code below.

In [ ]:

from imdb import IMDb
ia = IMDb()

for movie in ia.search_movie('python'):
    print movie 

In [ ]:

from IPython.display import YouTubeVideo
YouTubeVideo('GX8VLYUYScM')

In [ ]: