Exploring Movie Body Counts

Author: Ramiro Gómez

A look at movie body counts based on information from the Website Movie Body Counts.

About the data source

Movie Body Counts is a forum where users collect on-screen body counts for a selection of films and the characters and actors who appear in these films. The dataset currently contains counts for 545 films from 1949 to 2013, which is a very small sample of all the films produced in this time frame.

To be counted a kill and/or dead body has to be visible on the screen, implied deaths like those died in the explosion of the Death Star are not counted. For more details on how counts should be conducted see their guidelines, the first one reads:

The "body counts" for this site are mostly "on screen kills/deaths" or fatal/critical/mortal shots/hits of human, humanoid, or creatures (ie monsters, aliens, zombies.) The rule of thumb is "do they bleed" which will leave the concept of cyborgs somewhat open and decided per film. The human and creature counts should be separate. These will be added together for a final tally.

Apart from the small number of films in this dataset, we can safely assume a selection bias. So take this exploration with a grain of salt and don't generalize any of the results. This is mainly a demonstration of some of things you can to with the tools being used and a fun dataset to look at.

The CSV dataset is kindly provided by Randal Olson (@randal_olson), who took the effort of collecting the death toll data from Movie Body Counts and added MPAA and IMDB ratings as well as film lengths.

Import packages

To explore and visualize the data I'll be using several Python packages that greatly facilitate these tasks, namely: NumPy, pandas and matplotlib.

In [1]:
%load_ext signature
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.style.use('ramiro')

chartinfo = 'Author: Ramiro Gómez - ramiro.org • Data: Movie Body Counts - moviebodycounts.com'

Load data and first look

We can directly download the CSV file from the Web and read it into a pandas DataFrame object.

In [2]:
df = pd.read_csv('http://files.figshare.com/1332945/film_death_counts.csv')

To get a grasp of the data let's look at the first few lines of the CSV file.

In [3]:
df.head()
Out[3]:
Film Year Body_Count MPAA_Rating Genre Director Length_Minutes IMDB_Rating
0 24 Hour Party People 2002 7 R Biography|Comedy|Drama|Music Michael Winterbottom 117 7.3
1 28 Days Later 2002 53 R Horror|Sci-Fi|Thriller Danny Boyle 113 7.6
2 28 Weeks Later 2007 212 R Horror|Sci-Fi|Thriller Juan Carlos Fresnadillo 100 7.0
3 30 Days of Night 2007 67 R Horror|Thriller David Slade 113 6.6
4 300 2007 600 R Action|Fantasy|History|War Zack Snyder 117 7.7

This dataset looks pretty well suited for doing some explorations and visualizations. I'll rename some columns to have shorter and a little nicer labels later on.

In [4]:
df.columns = ['Film', 'Year', 'Body count', 'MPAA', 'Genre', 'Director', 'Minutes', u'IMDB']

Let's also add a Film count column to keep track of the number of films when grouping and the body count per minute.

In [5]:
df['Film count'] = 1
df['Body count/min'] = df['Body count'] / df['Minutes'].astype(float)
df.head()
Out[5]:
Film Year Body count MPAA Genre Director Minutes IMDB Film count Body count/min
0 24 Hour Party People 2002 7 R Biography|Comedy|Drama|Music Michael Winterbottom 117 7.3 1 0.059829
1 28 Days Later 2002 53 R Horror|Sci-Fi|Thriller Danny Boyle 113 7.6 1 0.469027
2 28 Weeks Later 2007 212 R Horror|Sci-Fi|Thriller Juan Carlos Fresnadillo 100 7.0 1 2.120000
3 30 Days of Night 2007 67 R Horror|Thriller David Slade 113 6.6 1 0.592920
4 300 2007 600 R Action|Fantasy|History|War Zack Snyder 117 7.7 1 5.128205

Body counts over time

Next we look at how the number of body counts has evolved over the time frame covered by the dataset. To do so the DataFrame is grouped by year calculating the means, medians, and sums of the numeric columns. Also for a change print the last few records.

In [6]:
group_year = df.groupby('Year').agg([np.mean, np.median, sum])
group_year.tail()
Out[6]:
Body count Minutes IMDB Film count Body count/min
mean median sum mean median sum mean median sum mean median sum mean median sum
Year
2007 85.312500 45.5 4095 114.062500 111.0 5475 6.829167 7.00 327.8 1 1 48 0.749838 0.366966 35.992220
2008 68.653846 37.0 1785 109.615385 108.5 2850 6.573077 6.65 170.9 1 1 26 0.635468 0.371208 16.522174
2009 55.000000 59.0 605 112.272727 110.0 1235 6.845455 6.60 75.3 1 1 11 0.518937 0.385621 5.708305
2010 129.750000 126.0 519 115.750000 111.0 463 7.250000 7.25 29.0 1 1 4 1.132110 1.005280 4.528441
2013 156.000000 156.0 156 119.000000 119.0 119 6.500000 6.50 6.5 1 1 1 1.310924 1.310924 1.310924

The group_year DataFrame now contains several columns, that are not useful, like the mean and median film count. We simply don't use them, but instead look at the film and body counts.

With matplotlib's subplots function multiple graphs can be combined into one graphic. This allows comparing several distributions that have differing scales, as is the case for the film, total and average body counts.

In [7]:
df_bc = pd.DataFrame({'mean': group_year['Body count']['mean'],
                      'median': group_year['Body count']['median']})

df_bc_min = pd.DataFrame({'mean': group_year['Body count/min']['mean'], 
                          'median': group_year['Body count/min']['median']})

fig, axes = plt.subplots(nrows=4, ncols=1, figsize=(16, 22))

group_year['Film count']['sum'].plot(kind='bar', ax=axes[0]); axes[0].set_title('Film Count')
group_year['Body count']['sum'].plot(kind='bar', ax=axes[1]); axes[1].set_title('Total Body Count')
df_bc.plot(kind='bar', ax=axes[2]); axes[2].set_title('Body Count by Film')
df_bc_min.plot(kind='bar', ax=axes[3]); axes[3].set_title('Body Count by Minute')

for i in range(4):
    axes[i].set_xlabel('', visible=False)
    
plt.annotate(chartinfo, xy=(0, -1.2), xycoords='axes fraction')
Out[7]:
<matplotlib.text.Annotation at 0x7f8de8a1a150>

What we can safely say is, that most films in our dataset are from 2007. What this also shows quite well is the selection bias. There is only one film reviewed for each of the years 1978 and 2013, both have a pretty high body count.

Most violent films

Now lets see which films have the highest total body counts and body counts per minute. This time we plot two horizontal bar charts next to each other, again using the subplots function.

Note that sorting is ascending by default, so we call tail to get the top 10 films each with the highest total body count and the highest body counts per minute. We could set ascending to False in the sort call and use head, but this would plot the highest value on the bottom. Also the y-axis labels of the right chart a printed on the right, so they don't overlap with the left one.

In [8]:
df_film = df.set_index('Film')

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 8))

bc = df_film.sort('Body count')['Body count'].tail(10)
bc.plot(kind='barh', ax=axes[0])
axes[0].set_title('Total Body Count')

bc_min = df_film.sort('Body count/min')['Body count/min'].tail(10)
bc_min.plot(kind='barh', ax=axes[1])
axes[1].set_title('Body Count per Minute')
axes[1].yaxis.set_ticks_position('right')

for i in range(2):
    axes[i].set_ylabel('', visible=False)
    
plt.annotate(chartinfo, xy=(0, -1.07), xycoords='axes fraction')
Out[8]:
<matplotlib.text.Annotation at 0x7f8de84306d0>

There is a considerable gap between Lord of the Rings and the runner-up Kingdom of Heaven in the left chart, but when you take runtime into account, the later is slightly more violent. Both of them are surpassed by 300 when one looks at deaths by minute, which shouldn't surprise anyone who saw it. Below you can see why.

In [4]:
from IPython.display import IFrame
IFrame('https://www.youtube-nocookie.com/embed/HdNn5TZu6R8', width=800, height=450)
Out[4]:

Most violent directors

Now let's look at directors. As you may have noticed above the Genre column can contain multiple values separated by | characters. This also applies to the Director column, here are some examples.

In [10]:
df[df['Director'].apply(lambda x: -1 != x.find('|'))].head()
Out[10]:
Film Year Body count MPAA Genre Director Minutes IMDB Film count Body count/min
26 Aliens vs. Predator: Requiem 2007 5 R Action|Horror|Sci-Fi|Thriller Colin Strause|Greg Strause 94 4.7 1 0.053191
38 Aqua Teen Hunger Force Colon Movie Film for Th... 2007 67 R Animation|Action|Adventure Matt Maiellaro|Dave Willis 86 6.8 1 0.779070
46 Bangkok Dangerous 2008 38 R Action|Crime|Thriller Oxide Pang Chun|Danny Pang 99 5.4 1 0.383838
47 Barton Fink 1991 3 R Drama|Mystery Joel Coen|Ethan Coen 116 7.7 1 0.025862
83 City of God 2002 60 R Crime|Drama Fernando Meirelles|Katia Lund 130 8.7 1 0.461538

Since I want to group by directors later, I have to decide what to do with these multi-value instances. So what I'll do is create a new data frame with one new row for a single director and multiple new rows with the same count values for films that have more than one director. I also considered dividing the body counts by the number of directors, but decided against it.

The following function does this. I feel that there is a more elegant way with pandas, but it works for arbitrary columns.

In [11]:
def expand_col(df_src, col, sep='|'):
    di = {}
    idx = 0
    for i in df_src.iterrows():
        d = i[1]
        names = d[col].split(sep)
        for name in names:
            # operate on a copy to not overwrite previous director names
            c = d.copy()
            c[col] = name
            di[idx] = c
            idx += 1

    df_new = pd.DataFrame(di).transpose()
    # these two columns are not recognized as numeric
    df_new['Body count'] = df_new['Body count'].astype(float)
    df_new['Body count/min'] = df_new['Body count/min'].astype(float)
    
    return df_new

Now similar to the film ranking let's plot a director ranking.

In [12]:
df_dir = expand_col(df, 'Director')

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 8))

bc_sum = df_dir.groupby('Director').sum().sort('Body count').tail(10)
bc_sum['Body count'].plot(kind='barh', ax=axes[0])
axes[0].set_title('Total Body Count')

bc_mean = df_dir.groupby('Director').agg(np.mean).sort('Body count/min').tail(10)
bc_mean['Body count/min'].plot(kind='barh', ax=axes[1])
axes[1].set_title('Body Count per Minute')
axes[1].yaxis.set_ticks_position('right')

for i in range(2):
    axes[i].set_ylabel('', visible=False)

plt.annotate(chartinfo, xy=(0, -1.07), xycoords='axes fraction')
Out[12]:
<matplotlib.text.Annotation at 0x7f8de82d7f50>

Body counts in film genres

As mentioned above Genre is a multi-value column too. So let's create a new data frame again, where each film can account for multiple genres and look at the frequency distribution of films by genre.

In [13]:
df_genre = expand_col(df, 'Genre')
df_genre['Genre'].value_counts().plot(kind='bar', figsize=(12, 6), title='Genres by film count')

plt.annotate(chartinfo, xy=(0, -1.28), xycoords='axes fraction')
Out[13]:
<matplotlib.text.Annotation at 0x7f8de7fe2a10>

Looking at the total body counts for genres doesn't make much sense since some genres occur much more frequently, instead let's see genres by body counts per minute.

In [14]:
bc_mean = df_genre.groupby('Genre').agg(np.mean).sort('Body count/min', ascending=False)
ax = bc_mean['Body count/min'].plot(kind='bar', figsize=(12, 6), title='Genres by body count per minute')
ax.set_xlabel('', visible=False)
plt.annotate(chartinfo, xy=(0, -1.32), xycoords='axes fraction')
Out[14]:
<matplotlib.text.Annotation at 0x7f8de81fe310>

Not a huge surprise to see war movies on top, and since many of them are also classified as history movies this genre comes in 2nd place. Also several of the most deadly films are counted in these two genres, see some examples below.

In [15]:
df_genre[(df_genre['Genre'] == 'War') | (df_genre['Genre'] == 'History')].sort('Body count/min', ascending=False).head(20)
Out[15]:
Film Year Body count MPAA Genre Director Minutes IMDB Film count Body count/min
14 300 2007 600 R History Zack Snyder 117 7.7 1 5.128205
15 300 2007 600 R War Zack Snyder 117 7.7 1 5.128205
668 Kingdom of Heaven 2005 610 R History Ridley Scott 144 7.1 1 4.236111
669 Kingdom of Heaven 2005 610 R War Ridley Scott 144 7.1 1 4.236111
1204 Tae Guk Gi: The Brotherhood of War 2004 590 R War Je-kyu Kang 140 8.1 1 4.214286
1388 The Last Samurai 2003 558 R History Edward Zwick 154 7.7 1 3.623377
1389 The Last Samurai 2003 558 R War Edward Zwick 154 7.7 1 3.623377
1254 The Big Red One 1980 338 R War Samuel Fuller 113 7.3 1 2.991150
1658 Windtalkers 2002 389 R War John Woo 134 5.9 1 2.902985
956 Rambo 2008 247 R War Sylvester Stallone 92 7.1 1 2.684783
1649 We Were Soldiers 2002 305 R War Randall Wallace 138 7 1 2.210145
1648 We Were Soldiers 2002 305 R History Randall Wallace 138 7 1 2.210145
521 Glory 1989 258 R War Edward Zwick 122 7.9 1 2.114754
520 Glory 1989 258 R History Edward Zwick 122 7.9 1 2.114754
736 Lone Wolf and Cub: White Heaven in Hell 1974 169 Unrated History Yoshiyuki Kuroda 83 7.5 1 2.036145
1243 The Alamo 2004 249 PG-13 History John Lee Hancock 137 5.9 1 1.817518
1244 The Alamo 2004 249 PG-13 War John Lee Hancock 137 5.9 1 1.817518
1642 Waterloo 1970 210 G History Sergey Bondarchuk 123 7.2 1 1.707317
815 Mongol: The Rise of Genghis Khan 2007 200 R History Sergey Bodrov 126 7.3 1 1.587302
1393 The Last of the Mohicans 1992 172 PG-13 History Michael Mann 112 7.8 1 1.535714

MPAA and IMDB Ratings

Finally let's look at the MPAA and IMDB ratings and how they relate to the movie body counts by creating two scatter plots.

Since MPAA ratings are not numeric, their values need to be mapped to numbers in some way to produce a scatter plot. We can use the value_counts method to get a sorted series of different MPPA ratings and their counts.

In [16]:
ratings = df['MPAA'].value_counts()
ratings
Out[16]:
R           338
PG-13       118
PG           35
Unrated      28
Approved      9
M             5
GP            4
X             4
G             3
NR            1
dtype: int64

Next the different rating names are used as keys of a dictionary mapped to a list of integers of the same length. This dictionary is then used to map the different rating values of the MPAA column to the corresponding integers.

In [17]:
rating_names = ratings.index
rating_index = range(len(rating_names))
rating_map = dict(zip(rating_names, rating_index))
mpaa = df['MPAA'].apply(lambda x: rating_map[x])

Now we can create a scatter plot with the following few lines of code, where the MPAA ratings are show on the x-axis, the body counts per minute on the y-axis and the circle sizes are determined by the total body counts of the movies.

In [18]:
fig, ax = plt.subplots(figsize=(14, 10))
ax.scatter(mpaa, df['Body count/min'], s=df['Body count'], alpha=.5)
ax.set_title('Body counts and MPAA ratings')
ax.set_xlabel('MPAA Rating')
ax.set_xticks(rating_index)
ax.set_xticklabels(rating_names)
ax.set_ylabel('Body count per minute')
plt.annotate(chartinfo, xy=(0, -1.12), xycoords='axes fraction')
Out[18]:
<matplotlib.text.Annotation at 0x7f8de8368190>

One of the things this diagram shows is that the film with highest body count and also a pretty high body count/min is rated PG-13. Looking back at the film rankings above, we know that it is Lord of the Rings: Return of the King, but wouldn't it be nice to have labels for at least some of the circles? Yes, so annotating graphs will be demonstrated in the next scatter plot, which shows body counts and IMDB ratings.

To not mess up the graph, only the 3 movies with the highest body count will be labeled. They go into a list of lists called annotations, where the inner lists are made up of the label text and the x and the y positions of the labels. Then after setting up the basic plot the annotations are added to it in a loop. The label positions can be adjusted with the xytext, textcoords, ha, and va arguments to the annotate method.

In [19]:
bc_top = df.sort('Body count', ascending=False)[:3]
annotations = []
for r in bc_top.iterrows():
    annotations.append([r[1]['Film'], r[1]['IMDB'], r[1]['Body count/min']])

fig, ax = plt.subplots(figsize=(14, 10))
ax.scatter(df['IMDB'], df['Body count/min'], s=df['Body count'], alpha=.5)
ax.set_title('Body count and IMDB ratings')
ax.set_xlabel('IMDB Rating')
ax.set_ylabel('Body count per minute')

for annotation, x, y in annotations:
    plt.annotate(
        annotation,
        xy=(x, y),
        xytext=(0, 30),
        textcoords='offset points',
        ha='center',
        va='bottom',
        size=12.5,
        bbox=dict(boxstyle='round,pad=0.5', fc='yellow', alpha=0.5),
        arrowprops=dict(arrowstyle='-'))

plt.annotate(chartinfo, xy=(0, -1.12), xycoords='axes fraction')
Out[19]:
<matplotlib.text.Annotation at 0x7f8de81e6fd0>

Summary

This notebook demonstrates some of the basic features of pandas, NumPy and matplotlib for processing, exploring and visualizing data.

Due to the dataset's limitations mentioned in the introduction, I refrained from interpreting the results too much. The focus of this notebook is how you can use these tools to get to know a dataset. They offer many more possibilities and advanced features are all free and open source. I can only recommend using them and will certainly keep on doing so myself.

In [20]:
%signature
Out[20]:
Author: Ramiro Gómez • Last edited: July 31, 2015