from IPython.display import YouTubeVideo, Image, HTML
YouTubeVideo('0Q14rHLvMco')
For some reason I've been thinking a lot about LOST lately--thinking about it enough that I rewatched the pilot a few nights ago. I got to thinking: how have all of the actors fared in their post-LOST careers? Despite it's trials and tribulations, did acting on LOST give a sense of purpose, just like Jack felt with the Island? Is it time yet for a career revitalizing LOST reboot?
Normally these questions are relegated to some very simple slide show listicle. However, we don't have to settle for that! We've got data! We can perform a far more interesting analysis than googling "Matthew Fox."
I scraped all of this data from IMDB, following the process below:
Minor Roles: To eliminate minor roles, I only counted roles where that actor appeared on the main cast list for that movie/tv. For example: Jorge Garcia was in two episodes of How I Met Your Mother, but doesn't appear on the main HIMYM IMDB cast page.
Year info: For TV shows, the year included is the year that TV show premiered. It's not the year in which an actor might have appeared on the show. For example: everyone who appeared in LOST will have a year of 2004, regardless of when they actually started on the show. Actors this impacts:
Language: I'll use actors to refer to both actors and actresses throughout this exploration. I'll use the term media to refer to the general collection of TV or Movies.
import pandas as pd
%matplotlib inline
We'll first read in the dataset and see what our data looks like.
we_have_to_go_back = pd.read_csv('./data/LOST_clean.csv')
print "Total rows:", len(we_have_to_go_back)
we_have_to_go_back.head()
Total rows: 353
actor | title | score | start_year | type | |
---|---|---|---|---|---|
0 | Jorge Garcia | The Wedding Ringer | 6.8 | 2015 | Movie |
1 | Jorge Garcia | Cooties | 5.3 | 2014 | Movie |
2 | Jorge Garcia | iSteve | 5.4 | 2013 | Movie |
3 | Jorge Garcia | The Ordained | 6.9 | 2013 | TV Movie |
4 | Jorge Garcia | Alcatraz | 7.1 | 2012 | TV Series |
We have 353 total rows listing the actor, the title of the media, the IMDB score, the year that media first aired, and the type of media. Let's first take a look at what different types of media we're working with.
we_have_to_go_back['type'].value_counts()
Movie 177 TV Movie 79 TV Series 55 Other/Unknown 23 Video Game 19 dtype: int64
We're only going to include the data from Television or Film, and exclude Other/Unknown and Video Game.
big_and_small_screen = we_have_to_go_back[(we_have_to_go_back['type'] == 'TV Series') |
(we_have_to_go_back['type'] == 'Movie') |
(we_have_to_go_back['type'] == 'TV Movie')]
Now that we've got a clean dataset, let's get a little more information about the scores. LOST's IMDB score is an 8.5, but we have no context to understand whether that's high or low. (Sidebar: Here is a good analysis of the distribution of all IMDB scores)
We've also got to remove the duplicates for this next step. LOST is listed 15 times (once for each actor) hence the spike around 8.5. We'll assume a duplicate is an item with the same title and score.
Let's look at the distribution with a histogram, and also print out some summary statistics.
big_and_small_screen.drop_duplicates(['title','score'])['score'].hist(bins=16)
big_and_small_screen.drop_duplicates(['title','score'])['score'].describe()
count 296.000000 mean 6.399324 std 1.059773 min 2.900000 25% 5.800000 50% 6.500000 75% 7.100000 max 9.000000 dtype: float64
Comparing LOST's 8.5 score to these numbers shows us a few things:
Also notice the top scored media for any actor is 9.0. Out of curiosity, let's take a look at the top 5 scored items in our dataset:
big_and_small_screen.sort('score', ascending=0).head(5)
actor | title | score | start_year | type | |
---|---|---|---|---|---|
171 | Terry O'Quinn | Guts and Glory: The Rise and Fall of Oliver North | 9.0 | 1989 | TV Movie |
295 | Harold Perrineau | Oz | 8.9 | 1997 | TV Series |
23 | Naveen Andrews | Lost | 8.5 | 2004 | TV Series |
284 | Harold Perrineau | Lost | 8.5 | 2004 | TV Series |
337 | Ken Leung | Lost | 8.5 | 2004 | TV Series |
Don't tell Terry O'Quinn what he can't do, because he can clearly star in a highly rated 1989 TV Movie.
YouTubeVideo('arMtFxv7jlw')
Even in our listing of top 5 scores, we already see LOST appearing in there. It seems time to ask the ultimate question:
The next cell finds the maximum score for each actor, then prints the row that score appears on.
big_and_small_screen.ix[big_and_small_screen.groupby('actor')['score'].idxmax()]
actor | title | score | start_year | type | |
---|---|---|---|---|---|
80 | Daniel Dae Kim | Lost | 8.5 | 2004 | TV Series |
255 | Dominic Monaghan | Lost | 8.5 | 2004 | TV Series |
315 | Elizabeth Mitchell | Lost | 8.5 | 2004 | TV Series |
193 | Emilie de Ravin | Lost | 8.5 | 2004 | TV Series |
116 | Evangeline Lilly | Lost | 8.5 | 2004 | TV Series |
295 | Harold Perrineau | Oz | 8.9 | 1997 | TV Series |
233 | Henry Ian Cusick | Lost | 8.5 | 2004 | TV Series |
6 | Jorge Garcia | Lost | 8.5 | 2004 | TV Series |
67 | Josh Holloway | Lost | 8.5 | 2004 | TV Series |
337 | Ken Leung | Lost | 8.5 | 2004 | TV Series |
50 | Matthew Fox | Lost | 8.5 | 2004 | TV Series |
205 | Michael Emerson | Person of Interest | 8.5 | 2011 | TV Series |
23 | Naveen Andrews | Lost | 8.5 | 2004 | TV Series |
171 | Terry O'Quinn | Guts and Glory: The Rise and Fall of Oliver North | 9.0 | 1989 | TV Movie |
102 | Yunjin Kim | Lost | 8.5 | 2004 | TV Series |
Of the 15 of the most frequent actors on LOST, only 2 of them have ever had a major role in something that has a score higher than LOST. Note that Person of Interest for Michael Emerson is rated the same as LOST, so we're excluding him from the club.
We can also explore how many appearances each actor has had before and after LOST. In order to do that, we'll flag every entry as post-LOST if it started after 2004, then count the number of titles that come before or after.
# side note: not happy with this code... there must be a better way.
big_and_small_screen['post_lost'] = big_and_small_screen['start_year'] > 2004
before_and_after = pd.pivot_table(big_and_small_screen, columns=['post_lost'],
values=['start_year'], index=['actor'], aggfunc=np.size).reset_index()
before_and_after['more_after_lost'] = (before_and_after['start_year'][True] - before_and_after['start_year'][False] > 0)
before_and_after
actor | start_year | more_after_lost | ||
---|---|---|---|---|
post_lost | False | True | ||
0 | Daniel Dae Kim | 9 | 4 | False |
1 | Dominic Monaghan | 6 | 10 | True |
2 | Elizabeth Mitchell | 15 | 8 | False |
3 | Emilie de Ravin | 3 | 12 | True |
4 | Evangeline Lilly | 1 | 3 | True |
5 | Harold Perrineau | 16 | 21 | True |
6 | Henry Ian Cusick | 8 | 9 | True |
7 | Jorge Garcia | 7 | 8 | True |
8 | Josh Holloway | 6 | 8 | True |
9 | Ken Leung | 12 | 8 | False |
10 | Matthew Fox | 7 | 5 | False |
11 | Michael Emerson | 9 | 10 | True |
12 | Naveen Andrews | 18 | 9 | False |
13 | Terry O'Quinn | 61 | 5 | False |
14 | Yunjin Kim | 5 | 8 | True |
9 out of 15 actors had more major roles after 2004. This is a pretty naive comparison, though, since a recurring role on a TV show is only going to count for 1, while an actor who chooses to go to the big screen is going to have multiple movies they're starring in. It also doesn't take into account things like Terry O'Quinn's massive 61 roles before LOST.
On that note, let's see if there's a difference in what type of media the actors starred in before and after LOST. We'll count the number of Movies, TV, or TV Movies to each actors name before and after LOST, then see which of those categories is the highest.
pre_LOST_roles = big_and_small_screen[big_and_small_screen['post_lost'] == False]
actor_type_counts = pre_LOST_roles.groupby(['actor','type']).size().reset_index()
actor_type_counts.columns = ['actor','type','occurrences']
actor_type_counts.ix[actor_type_counts.groupby('actor')['occurrences'].idxmax()]
actor | type | occurrences | |
---|---|---|---|
0 | Daniel Dae Kim | Movie | 4 |
5 | Dominic Monaghan | TV Series | 3 |
6 | Elizabeth Mitchell | Movie | 6 |
10 | Emilie de Ravin | TV Series | 2 |
11 | Evangeline Lilly | TV Series | 1 |
12 | Harold Perrineau | Movie | 13 |
16 | Henry Ian Cusick | TV Movie | 4 |
18 | Jorge Garcia | Movie | 5 |
21 | Josh Holloway | Movie | 4 |
24 | Ken Leung | Movie | 9 |
29 | Matthew Fox | TV Series | 4 |
30 | Michael Emerson | Movie | 6 |
33 | Naveen Andrews | Movie | 12 |
37 | Terry O'Quinn | TV Movie | 32 |
39 | Yunjin Kim | Movie | 4 |
Note that these also include starring in LOST itself. Henry Ian Cusick is in good company with Terry O'Quinn as a major TV Movie actor! Alright!
Let's quick tally tally up the types:
actor_type_counts.ix[actor_type_counts.groupby('actor')['occurrences'].idxmax()]['type'].value_counts()
Movie 9 TV Series 4 TV Movie 2 dtype: int64
post_LOST_roles = big_and_small_screen[big_and_small_screen['post_lost'] == True]
actor_type_counts = post_LOST_roles.groupby(['actor','type']).size().reset_index()
actor_type_counts.columns = ['actor','type','occurrences']
actor_type_counts.ix[actor_type_counts.groupby('actor')['occurrences'].idxmax()]
actor | type | occurrences | |
---|---|---|---|
0 | Daniel Dae Kim | Movie | 3 |
2 | Dominic Monaghan | Movie | 6 |
5 | Elizabeth Mitchell | Movie | 4 |
8 | Emilie de Ravin | Movie | 7 |
11 | Evangeline Lilly | Movie | 3 |
12 | Harold Perrineau | Movie | 15 |
15 | Henry Ian Cusick | Movie | 8 |
17 | Jorge Garcia | Movie | 6 |
20 | Josh Holloway | Movie | 6 |
23 | Ken Leung | Movie | 3 |
26 | Matthew Fox | Movie | 5 |
27 | Michael Emerson | Movie | 6 |
30 | Naveen Andrews | Movie | 5 |
34 | Terry O'Quinn | TV Series | 3 |
35 | Yunjin Kim | Movie | 6 |
Everyone but Terry O'Quinn seemed to go to the big screen after LOST.
Our quick, rudimentary analysis gave us some wonderful insight into the acting lives of 15 actors from LOST. Here's what we've learned: