Python Pandas Tutorial

Note: This is an abridged version of Python Pandas Tutorial: A Complete Introduction for Beginners, by George McIntire, Brendan Martin, and Lauren Washington.

In [8]:
import pandas as pd

Core components of pandas: Series and DataFrames

The primary two components of pandas are the Series and DataFrame.

A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of Series.

DataFrames and Series are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean.

Creating DataFrames from scratch

There are many ways to create a DataFrame from scratch, but a great option is to just use a simple dict.

Let's say we have a fruit stand that sells apples and oranges. We want to have a column for each fruit and a row for each customer purchase. To organize this as a dictionary for pandas we could do something like:

In [9]:
data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}

And then pass it to the pandas DataFrame constructor:

In [10]:
purchases = pd.DataFrame(data)

purchases
Out[10]:
apples oranges
0 3 0
1 2 3
2 0 7
3 1 2

The Index of this DataFrame was given to us on creation as the numbers 0-3, but we could also create our own when we initialize the DataFrame.

Let's have customer names as our index:

In [11]:
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])

purchases
Out[11]:
apples oranges
June 3 0
Robert 2 3
Lily 0 7
David 1 2

So now we could locate a customer's order by using their name:

In [12]:
purchases.loc['June']
Out[12]:
apples     3
oranges    0
Name: June, dtype: int64

We can also access colums:

In [13]:
purchases['oranges']
Out[13]:
June      0
Robert    3
Lily      7
David     2
Name: oranges, dtype: int64

Reading data from CSVs

With CSV files all you need is a single line to load in the data:

In [14]:
df = pd.read_csv('purchases.csv')

df
Out[14]:
Unnamed: 0 apples oranges
0 June 3 0
1 Robert 2 3
2 Lily 0 7
3 David 1 2

CSVs don't have indexes like our DataFrames, so all we need to do is just designate the index_col when reading:

In [15]:
df = pd.read_csv('purchases.csv', index_col=0)

df
Out[15]:
apples oranges
June 3 0
Robert 2 3
Lily 0 7
David 1 2

Most important DataFrame operations

Let's load in the IMDB movies dataset to begin:

In [16]:
movies_df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")

We're loading this dataset from a CSV and designating the movie titles to be our index.

Viewing your data

The first thing to do when opening a new dataset is print out a few rows to keep as a visual reference. We accomplish this with .head():

In [17]:
movies_df.head()
Out[17]:
Rank Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
Title
Guardians of the Galaxy 1 Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0
Prometheus 2 Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0
Split 3 Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0
Sing 4 Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59.0
Suicide Squad 5 Action,Adventure,Fantasy A secret government agency recruits some of th... David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D... 2016 123 6.2 393727 325.02 40.0

.head() outputs the first five rows of your DataFrame by default, but we could also pass a number as well: movies_df.head(10) would output the top ten rows, for example.

To see the last five rows use .tail(). tail() also accepts a number, and in this case we printing the bottom two rows.:

In [18]:
movies_df.tail(2)
Out[18]:
Rank Genre Description Director Actors Year Runtime (Minutes) Rating Votes Revenue (Millions) Metascore
Title
Search Party 999 Adventure,Comedy A pair of friends embark on a mission to reuni... Scot Armstrong Adam Pally, T.J. Miller, Thomas Middleditch,Sh... 2014 93 5.6 4881 NaN 22.0
Nine Lives 1000 Comedy,Family,Fantasy A stuffy businessman finds himself trapped ins... Barry Sonnenfeld Kevin Spacey, Jennifer Garner, Robbie Amell,Ch... 2016 87 5.3 12435 19.64 11.0

Getting info about your data

.info() should be one of the very first commands you run after loading your data:

In [19]:
movies_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 11 columns):
Rank                  1000 non-null int64
Genre                 1000 non-null object
Description           1000 non-null object
Director              1000 non-null object
Actors                1000 non-null object
Year                  1000 non-null int64
Runtime (Minutes)     1000 non-null int64
Rating                1000 non-null float64
Votes                 1000 non-null int64
Revenue (Millions)    872 non-null float64
Metascore             936 non-null float64
dtypes: float64(3), int64(4), object(4)
memory usage: 93.8+ KB

.info() provides the essential details about your dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using.

Notice in our movies dataset we have some obvious missing values in the Revenue and Metascore columns. We'll look at how to handle those in a bit.

In [20]:
movies_df.shape
Out[20]:
(1000, 11)

Note that .shape has no parentheses and is a simple tuple of format (rows, columns). So we have 1000 rows and 11 columns in our movies DataFrame.

You'll be going to .shape a lot when cleaning and transforming data. For example, you might filter some rows based on some criteria and then want to know quickly how many rows were removed.

Column cleanup

Many times datasets will have verbose column names with symbols, upper and lowercase words, spaces, and typos. To make selecting data by column name easier we can spend a little time cleaning up their names.

Here's how to print the column names of our dataset:

In [21]:
movies_df.columns
Out[21]:
Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

We can use the .rename() method to rename certain or all columns via a dict. We don't want parentheses, so let's rename those:

In [22]:
movies_df.rename(columns={
        'Runtime (Minutes)': 'Runtime', 
        'Revenue (Millions)': 'Revenue_millions'
    }, inplace=True)


movies_df.columns
Out[22]:
Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year', 'Runtime',
       'Rating', 'Votes', 'Revenue_millions', 'Metascore'],
      dtype='object')

Excellent. But what if we want to lowercase all names? Instead of using .rename() we could also set a list of names to the columns like so:

In [23]:
movies_df.columns = ['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime', 
                     'rating', 'votes', 'revenue_millions', 'metascore']


movies_df.columns
Out[23]:
Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
       'rating', 'votes', 'revenue_millions', 'metascore'],
      dtype='object')

But that's too much work. Instead of just renaming each column manually we can do a list comprehension:

In [24]:
movies_df.columns = [col.lower() for col in movies_df]

movies_df.columns
Out[24]:
Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
       'rating', 'votes', 'revenue_millions', 'metascore'],
      dtype='object')

Understanding your variables

Using describe() on an entire DataFrame we can get a summary of the distribution of continuous variables:

In [25]:
movies_df.describe()
Out[25]:
rank year runtime rating votes revenue_millions metascore
count 1000.000000 1000.000000 1000.000000 1000.000000 1.000000e+03 872.000000 936.000000
mean 500.500000 2012.783000 113.172000 6.723200 1.698083e+05 82.956376 58.985043
std 288.819436 3.205962 18.810908 0.945429 1.887626e+05 103.253540 17.194757
min 1.000000 2006.000000 66.000000 1.900000 6.100000e+01 0.000000 11.000000
25% 250.750000 2010.000000 100.000000 6.200000 3.630900e+04 13.270000 47.000000
50% 500.500000 2014.000000 111.000000 6.800000 1.107990e+05 47.985000 59.500000
75% 750.250000 2016.000000 123.000000 7.400000 2.399098e+05 113.715000 72.000000
max 1000.000000 2016.000000 191.000000 9.000000 1.791916e+06 936.630000 100.000000

Understanding which numbers are continuous also comes in handy when thinking about the type of plot to use to represent your data visually.

.describe() can also be used on a categorical variable to get the count of rows, unique count of categories, top category, and freq of top category:

In [26]:
movies_df['genre'].describe()
Out[26]:
count                        1000
unique                        207
top       Action,Adventure,Sci-Fi
freq                           50
Name: genre, dtype: object

This tells us that the genre column has 207 unique values, the top value is Action/Adventure/Sci-Fi, which shows up 50 times (freq).

.value_counts() can tell us the frequency of all values in a column:

In [27]:
movies_df['genre'].value_counts().head(10)
Out[27]:
Action,Adventure,Sci-Fi       50
Drama                         48
Comedy,Drama,Romance          35
Comedy                        32
Drama,Romance                 31
Action,Adventure,Fantasy      27
Animation,Adventure,Comedy    27
Comedy,Drama                  27
Comedy,Romance                26
Crime,Drama,Thriller          24
Name: genre, dtype: int64

Relationships between continuous variables

By using the correlation method .corr() we can generate the relationship between each continuous variable:

In [28]:
movies_df.corr()
Out[28]:
rank year runtime rating votes revenue_millions metascore
rank 1.000000 -0.261605 -0.221739 -0.219555 -0.283876 -0.271592 -0.191869
year -0.261605 1.000000 -0.164900 -0.211219 -0.411904 -0.126790 -0.079305
runtime -0.221739 -0.164900 1.000000 0.392214 0.407062 0.267953 0.211978
rating -0.219555 -0.211219 0.392214 1.000000 0.511537 0.217654 0.631897
votes -0.283876 -0.411904 0.407062 0.511537 1.000000 0.639661 0.325684
revenue_millions -0.271592 -0.126790 0.267953 0.217654 0.639661 1.000000 0.142397
metascore -0.191869 -0.079305 0.211978 0.631897 0.325684 0.142397 1.000000

Correlation tables are a numerical representation of the bivariate relationships in the dataset.

Positive numbers indicate a positive correlation — one goes up the other goes up — and negative numbers represent an inverse correlation — one goes up the other goes down. 1.0 indicates a perfect correlation.

So looking in the first row, first column we see rank has a perfect correlation with itself, which is obvious. On the other hand, the correlation between votes and revenue_millions is 0.6. A little more interesting.

Examining bivariate relationships comes in handy when you have an outcome or dependent variable in mind and would like to see the features most correlated to the increase or decrease of the outcome. You can visually represent bivariate relationships with scatterplots (seen below in the plotting section).

For a deeper look into data summarizations check out Essential Statistics for Data Science.

DataFrame slicing, selecting, extracting

Below are the other methods of slicing, selecting, and extracting you'll need to use constantly.

By column

You already saw how to extract a column using square brackets like this:

In [37]:
genre_col = movies_df['genre']

type(genre_col)
Out[37]:
pandas.core.series.Series

This will return a Series. To extract a column as a DataFrame, you need to pass a list of column names. In our case that's just a single column:

In [38]:
genre_col = movies_df[['genre']]

type(genre_col)
Out[38]:
pandas.core.frame.DataFrame

Since it's just a list, adding another column name is easy:

In [39]:
subset = movies_df[['genre', 'rating']]

subset.head()
Out[39]:
genre rating
Title
Guardians of the Galaxy Action,Adventure,Sci-Fi 8.1
Prometheus Adventure,Mystery,Sci-Fi 7.0
Split Horror,Thriller 7.3
Sing Animation,Comedy,Family 7.2
Suicide Squad Action,Adventure,Fantasy 6.2

Now we'll look at getting data by rows.

By rows

For rows, we have two options:

  • .loc - locates by name
  • .iloc- locates by numerical index

Remember that we are still indexed by movie Title, so to use .loc we give it the Title of a movie:

In [40]:
prom = movies_df.loc["Prometheus"]

prom
Out[40]:
rank                                                                2
genre                                        Adventure,Mystery,Sci-Fi
description         Following clues to the origin of mankind, a te...
director                                                 Ridley Scott
actors              Noomi Rapace, Logan Marshall-Green, Michael Fa...
year                                                             2012
runtime                                                           124
rating                                                              7
votes                                                          485820
revenue_millions                                               126.46
metascore                                                          65
Name: Prometheus, dtype: object

On the other hand, with iloc we give it the numerical index of Prometheus:

In [41]:
prom = movies_df.iloc[1]

loc and iloc can be thought of as similar to Python list slicing. To show this even further, let's select multiple rows.

How would you do it with a list? In Python, just slice with brackets like example_list[1:4]. It's works the same way in pandas:

In [42]:
movie_subset = movies_df.loc['Prometheus':'Sing']

movie_subset = movies_df.iloc[1:4]

movie_subset
Out[42]:
rank genre description director actors year runtime rating votes revenue_millions metascore
Title
Prometheus 2 Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0
Split 3 Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0
Sing 4 Animation,Comedy,Family In a city of humanoid animals, a hustling thea... Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma... 2016 108 7.2 60545 270.32 59.0

One important distinction between using .loc and .iloc to select multiple rows is that .loc includes the movie Sing in the result, but when using .iloc we're getting rows 1:4 but the movie at index 4 (Suicide Squad) is not included.

Slicing with .iloc follows the same rules as slicing with lists, the object at the index at the end is not included.

Conditional selections

We’ve gone over how to select columns and rows, but what if we want to make a conditional selection?

For example, what if we want to filter our movies DataFrame to show only films directed by Ridley Scott or films with a rating greater than or equal to 8.0?

To do that, we take a column from the DataFrame and apply a Boolean condition to it. Here's an example of a Boolean condition:

In [43]:
condition = (movies_df['director'] == "Ridley Scott")

condition.head()
Out[43]:
Title
Guardians of the Galaxy    False
Prometheus                  True
Split                      False
Sing                       False
Suicide Squad              False
Name: director, dtype: bool

Similar to isnull(), this returns a Series of True and False values: True for films directed by Ridley Scott and False for ones not directed by him.

We want to filter out all movies not directed by Ridley Scott, in other words, we don’t want the False films. To return the rows where that condition is True we have to pass this operation into the DataFrame:

In [44]:
movies_df[movies_df['director'] == "Ridley Scott"].head()
Out[44]:
rank genre description director actors year runtime rating votes revenue_millions metascore
Title
Prometheus 2 Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0
The Martian 103 Adventure,Drama,Sci-Fi An astronaut becomes stranded on Mars after hi... Ridley Scott Matt Damon, Jessica Chastain, Kristen Wiig, Ka... 2015 144 8.0 556097 228.43 80.0
Robin Hood 388 Action,Adventure,Drama In 12th century England, Robin and his band of... Ridley Scott Russell Crowe, Cate Blanchett, Matthew Macfady... 2010 140 6.7 221117 105.22 53.0
American Gangster 471 Biography,Crime,Drama In 1970s America, a detective works to bring d... Ridley Scott Denzel Washington, Russell Crowe, Chiwetel Eji... 2007 157 7.8 337835 130.13 76.0
Exodus: Gods and Kings 517 Action,Adventure,Drama The defiant leader Moses rises up against the ... Ridley Scott Christian Bale, Joel Edgerton, Ben Kingsley, S... 2014 150 6.0 137299 65.01 52.0

You can get used to looking at these conditionals by reading it like:

Select movies_df where movies_df director equals Ridley Scott

Let's look at conditional selections using numerical values by filtering the DataFrame by ratings:

In [45]:
movies_df[movies_df['rating'] >= 8.6].head(3)
Out[45]:
rank genre description director actors year runtime rating votes revenue_millions metascore
Title
Interstellar 37 Adventure,Drama,Sci-Fi A team of explorers travel through a wormhole ... Christopher Nolan Matthew McConaughey, Anne Hathaway, Jessica Ch... 2014 169 8.6 1047747 187.99 74.0
The Dark Knight 55 Action,Crime,Drama When the menace known as the Joker wreaks havo... Christopher Nolan Christian Bale, Heath Ledger, Aaron Eckhart,Mi... 2008 152 9.0 1791916 533.32 82.0
Inception 81 Action,Adventure,Sci-Fi A thief, who steals corporate secrets through ... Christopher Nolan Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen... 2010 148 8.8 1583625 292.57 74.0

We can make some richer conditionals by using logical operators | for "or" and & for "and".

Let's filter the the DataFrame to show only movies by Christopher Nolan OR Ridley Scott:

In [46]:
movies_df[(movies_df['director'] == 'Christopher Nolan') | (movies_df['director'] == 'Ridley Scott')].head()
Out[46]:
rank genre description director actors year runtime rating votes revenue_millions metascore
Title
Prometheus 2 Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0
Interstellar 37 Adventure,Drama,Sci-Fi A team of explorers travel through a wormhole ... Christopher Nolan Matthew McConaughey, Anne Hathaway, Jessica Ch... 2014 169 8.6 1047747 187.99 74.0
The Dark Knight 55 Action,Crime,Drama When the menace known as the Joker wreaks havo... Christopher Nolan Christian Bale, Heath Ledger, Aaron Eckhart,Mi... 2008 152 9.0 1791916 533.32 82.0
The Prestige 65 Drama,Mystery,Sci-Fi Two stage magicians engage in competitive one-... Christopher Nolan Christian Bale, Hugh Jackman, Scarlett Johanss... 2006 130 8.5 913152 53.08 66.0
Inception 81 Action,Adventure,Sci-Fi A thief, who steals corporate secrets through ... Christopher Nolan Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen... 2010 148 8.8 1583625 292.57 74.0

We need to make sure to group evaluations with parentheses so Python knows how to evaluate the conditional.

Using the isin() method we could make this more concise though:

In [47]:
movies_df[movies_df['director'].isin(['Christopher Nolan', 'Ridley Scott'])].head()
Out[47]:
rank genre description director actors year runtime rating votes revenue_millions metascore
Title
Prometheus 2 Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0
Interstellar 37 Adventure,Drama,Sci-Fi A team of explorers travel through a wormhole ... Christopher Nolan Matthew McConaughey, Anne Hathaway, Jessica Ch... 2014 169 8.6 1047747 187.99 74.0
The Dark Knight 55 Action,Crime,Drama When the menace known as the Joker wreaks havo... Christopher Nolan Christian Bale, Heath Ledger, Aaron Eckhart,Mi... 2008 152 9.0 1791916 533.32 82.0
The Prestige 65 Drama,Mystery,Sci-Fi Two stage magicians engage in competitive one-... Christopher Nolan Christian Bale, Hugh Jackman, Scarlett Johanss... 2006 130 8.5 913152 53.08 66.0
Inception 81 Action,Adventure,Sci-Fi A thief, who steals corporate secrets through ... Christopher Nolan Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen... 2010 148 8.8 1583625 292.57 74.0

Let's say we want all movies that were released between 2005 and 2010, have a rating above 8.0, but made below the 25th percentile in revenue.

Here's how we could do all of that:

In [48]:
movies_df[
    ((movies_df['year'] >= 2005) & (movies_df['year'] <= 2010))
    & (movies_df['rating'] > 8.0)
    & (movies_df['revenue_millions'] < movies_df['revenue_millions'].quantile(0.25))
]
Out[48]:
rank genre description director actors year runtime rating votes revenue_millions metascore
Title
3 Idiots 431 Comedy,Drama Two friends are searching for their long lost ... Rajkumar Hirani Aamir Khan, Madhavan, Mona Singh, Sharman Joshi 2009 170 8.4 238789 6.52 67.0
The Lives of Others 477 Drama,Thriller In 1984 East Berlin, an agent of the secret po... Florian Henckel von Donnersmarck Ulrich Mühe, Martina Gedeck,Sebastian Koch, Ul... 2006 137 8.5 278103 11.28 89.0
Incendies 714 Drama,Mystery,War Twins journey to the Middle East to discover t... Denis Villeneuve Lubna Azabal, Mélissa Désormeaux-Poulin, Maxim... 2010 131 8.2 92863 6.86 80.0
Taare Zameen Par 992 Drama,Family,Music An eight-year-old boy is thought to be a lazy ... Aamir Khan Darsheel Safary, Aamir Khan, Tanay Chheda, Sac... 2007 165 8.5 102697 1.20 42.0

If you recall up when we used .describe() the 25th percentile for revenue was about 17.4, and we can access this value directly by using the quantile() method with a float of 0.25.

Applying functions

It is possible to iterate over a DataFrame or Series as you would with a list, but doing so — especially on large datasets — is very slow.

An efficient alternative is to apply() a function to the dataset. For example, we could use a function to convert movies with an 8.0 or greater to a string value of "good" and the rest to "bad" and use this transformed values to create a new column.

First we would create a function that, when given a rating, determines if it's good or bad:

In [50]:
def rating_function(x):
    if x >= 8.0:
        return "good"
    else:
        return "bad"

Now we want to send the entire rating column through this function, which is what apply() does:

In [52]:
movies_df["rating_category"] = movies_df["rating"].apply(rating_function)

movies_df.head(3)
Out[52]:
rank genre description director actors year runtime rating votes revenue_millions metascore rating_category
Title
Guardians of the Galaxy 1 Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0 good
Prometheus 2 Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0 bad
Split 3 Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0 bad

The .apply() method passes every value in the rating column through the rating_function and then returns a new Series. This Series is then assigned to a new column called rating_category.

You can also use anonymous functions as well. This lambda function achieves the same result as rating_function:

In [54]:
movies_df["rating_category"] = movies_df["rating"].apply(lambda x: 'good' if x >= 8.0 else 'bad')

movies_df.head(3)
Out[54]:
rank genre description director actors year runtime rating votes revenue_millions metascore rating_category
Title
Guardians of the Galaxy 1 Action,Adventure,Sci-Fi A group of intergalactic criminals are forced ... James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... 2014 121 8.1 757074 333.13 76.0 good
Prometheus 2 Adventure,Mystery,Sci-Fi Following clues to the origin of mankind, a te... Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael Fa... 2012 124 7.0 485820 126.46 65.0 bad
Split 3 Horror,Thriller Three girls are kidnapped by a man with a diag... M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar... 2016 117 7.3 157606 138.12 62.0 bad

Brief Plotting

Another great thing about pandas is that it integrates with Matplotlib, so you get the ability to plot directly off DataFrames and Series. To get started we need to import Matplotlib (pip install matplotlib):

In [57]:
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 20, 'figure.figsize': (10, 8)}) # set font and plot size to be larger

Now we can begin. There won't be a lot of coverage on plotting, but it should be enough to explore you're data easily.

Side note: For categorical variables utilize Bar Charts* and Boxplots. For continuous variables utilize Histograms, Scatterplots, Line graphs, and Boxplots.

Let's plot the relationship between ratings and revenue. All we need to do is call .plot() on movies_df with some info about how to construct the plot:

In [58]:
movies_df.plot(kind='scatter', x='rating', y='revenue_millions', title='Revenue (millions) vs Rating');

What's with the semicolon? It's not a syntax error, just a way to hide the <matplotlib.axes._subplots.AxesSubplot at 0x26613b5cc18> output when plotting in Jupyter notebooks.

If we want to plot a simple Histogram based on a single column, we can call plot on a column:

In [59]:
movies_df['rating'].plot(kind='hist', title='Rating');

Do you remember the .describe() example at the beginning of this tutorial? Well, there's a graphical representation of the interquartile range, called the Boxplot. Let's recall what describe() gives us on the ratings column:

In [60]:
movies_df['rating'].describe()
Out[60]:
count    1000.000000
mean        6.723200
std         0.945429
min         1.900000
25%         6.200000
50%         6.800000
75%         7.400000
max         9.000000
Name: rating, dtype: float64

Using a Boxplot we can visualize this data:

In [61]:
movies_df['rating'].plot(kind="box");
Source: *Flowing Data*

By combining categorical and continuous data, we can create a Boxplot of revenue that is grouped by the Rating Category we created above:

In [62]:
movies_df.boxplot(column='revenue_millions', by='rating_category');

That's the general idea of plotting with pandas. There's too many plots to mention, so definitely take a look at the plot() docs here for more information on what it can do.