How do I merge DataFrames in pandas? (video)

Table of contents

  1. Selecting a Function
  2. Joining (Merging) DataFrames
  3. What if...?
  4. Four Types of Joins

Part 1: Selecting a Function

Taken from Merging DataFrames with pandas (DataCamp course):

  • df1.append(df2): stacking vertically
  • pd.concat([df1, df2]):
    • stacking many horizontally or vertically
    • simple inner/outer joins on Indexes
  • df1.join(df2): inner/outer/left/right joins on Indexes
  • pd.merge(df1, df2): many joins on multiple columns

Part 2: Joining (Merging) DataFrames

Using the MovieLens 100k data, let's create two DataFrames:

  • movies: shows information about movies, namely a unique movie_id and its title
  • ratings: shows the rating that a particular user_id gave to a particular movie_id at a particular timestamp
In [1]:
import pandas as pd

Movies

In [2]:
movie_cols = ['movie_id', 'title']
movies = pd.read_table('data/u.item', sep='|', header=None, names=movie_cols, usecols=[0, 1])
movies.head()
Out[2]:
movie_id title
0 1 Toy Story (1995)
1 2 GoldenEye (1995)
2 3 Four Rooms (1995)
3 4 Get Shorty (1995)
4 5 Copycat (1995)
In [3]:
movies.shape
Out[3]:
(1682, 2)
In [4]:
movies.movie_id.nunique()
Out[4]:
1682

Ratings

In [5]:
rating_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('data/u.data', sep='\t', header=None, names=rating_cols)
ratings.head()
Out[5]:
user_id movie_id rating timestamp
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596
In [6]:
ratings.shape
Out[6]:
(100000, 4)
In [7]:
ratings.movie_id.nunique()
Out[7]:
1682
In [8]:
ratings.loc[ratings.movie_id == 1, :].head()
Out[8]:
user_id movie_id rating timestamp
24 308 1 4 887736532
454 287 1 5 875334088
957 148 1 4 877019411
971 280 1 4 891700426
1324 66 1 3 883601324

Merging Movies and Ratings

Let's pretend that you want to examine the ratings DataFrame, but you want to know the title of each movie rather than its movie_id. The best way to accomplish this objective is by "joining" (or "merging") the DataFrames using the Pandas merge function:

In [9]:
movies.columns
Out[9]:
Index(['movie_id', 'title'], dtype='object')
In [10]:
ratings.columns
Out[10]:
Index(['user_id', 'movie_id', 'rating', 'timestamp'], dtype='object')
In [11]:
movie_ratings = pd.merge(movies, ratings)
movie_ratings.columns
Out[11]:
Index(['movie_id', 'title', 'user_id', 'rating', 'timestamp'], dtype='object')
In [12]:
movie_ratings.head()
Out[12]:
movie_id title user_id rating timestamp
0 1 Toy Story (1995) 308 4 887736532
1 1 Toy Story (1995) 287 5 875334088
2 1 Toy Story (1995) 148 4 877019411
3 1 Toy Story (1995) 280 4 891700426
4 1 Toy Story (1995) 66 3 883601324
In [13]:
movie_ratings.shape
Out[13]:
(100000, 5)

Here's what just happened:

  • Pandas noticed that movies and ratings had one column in common, namely movie_id. This is the "key" on which the DataFrames will be joined.
  • The first movie_id in movies is 1. Thus, Pandas looked through every row in the ratings DataFrame, searching for a movie_id of 1. Every time it found such a row, it recorded the user_id, rating, and timestamp listed in that row. In this case, it found 452 matching rows.
  • The second movie_id in movies is 2. Again, Pandas did a search of ratings and found 131 matching rows.
  • This process was repeated for all of the remaining rows in movies.

At the end of the process, the movie_ratings DataFrame is created, which contains the two columns from movies (movie_id and title) and the three other colums from ratings (user_id, rating, and timestamp).

  • movie_id 1 and its title are listed 452 times, next to the user_id, rating, and timestamp for each of the 452 matching ratings.
  • movie_id 2 and its title are listed 131 times, next to the user_id, rating, and timestamp for each of the 131 matching ratings.
  • And so on, for every movie in the dataset.
In [14]:
print(movies.shape)
print(ratings.shape)
print(movie_ratings.shape)
(1682, 2)
(100000, 4)
(100000, 5)

Notice the shapes of the three DataFrames:

  • There are 1682 rows in the movies DataFrame.
  • There are 100000 rows in the ratings DataFrame.
  • The merge function resulted in a movie_ratings DataFrame with 100000 rows, because every row from ratings matched a row from movies.
  • The movie_ratings DataFrame has 5 columns, namely the 2 columns from movies, plus the 4 columns from ratings, minus the 1 column in common.

By default, the merge function joins the DataFrames using all column names that are in common (movie_id, in this case). The documentation explains how you can override this behavior.

Part 3: What if...?

What if the columns you want to join on don't have the same name?

In [15]:
movies.columns = ['m_id', 'title']
movies.columns
Out[15]:
Index(['m_id', 'title'], dtype='object')
In [16]:
ratings.columns
Out[16]:
Index(['user_id', 'movie_id', 'rating', 'timestamp'], dtype='object')
In [17]:
pd.merge(movies, ratings, left_on='m_id', right_on='movie_id').head()
Out[17]:
m_id title user_id movie_id rating timestamp
0 1 Toy Story (1995) 308 1 4 887736532
1 1 Toy Story (1995) 287 1 5 875334088
2 1 Toy Story (1995) 148 1 4 877019411
3 1 Toy Story (1995) 280 1 4 891700426
4 1 Toy Story (1995) 66 1 3 883601324

What if you want to join on one index?

In [18]:
movies = movies.set_index('m_id')
movies.head()
Out[18]:
title
m_id
1 Toy Story (1995)
2 GoldenEye (1995)
3 Four Rooms (1995)
4 Get Shorty (1995)
5 Copycat (1995)
In [19]:
pd.merge(movies, ratings, left_index=True, right_on='movie_id').head()
Out[19]:
title user_id movie_id rating timestamp
24 Toy Story (1995) 308 1 4 887736532
454 Toy Story (1995) 287 1 5 875334088
957 Toy Story (1995) 148 1 4 877019411
971 Toy Story (1995) 280 1 4 891700426
1324 Toy Story (1995) 66 1 3 883601324

What if you want to join on two indexes?

In [20]:
ratings = ratings.set_index('movie_id')
ratings.head()
Out[20]:
user_id rating timestamp
movie_id
242 196 3 881250949
302 186 3 891717742
377 22 1 878887116
51 244 2 880606923
346 166 1 886397596
In [21]:
pd.merge(movies, ratings, left_index=True, right_index=True).head()
Out[21]:
title user_id rating timestamp
1 Toy Story (1995) 308 4 887736532
1 Toy Story (1995) 287 5 875334088
1 Toy Story (1995) 148 4 877019411
1 Toy Story (1995) 280 4 891700426
1 Toy Story (1995) 66 3 883601324

Part 4: Four Types of Joins

There are actually four types of joins supported by the Pandas merge function. Here's how they are described by the documentation:

  • inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys
  • outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically
  • left: use only keys from left frame, similar to a SQL left outer join; preserve key order
  • right: use only keys from right frame, similar to a SQL right outer join; preserve key order

The default is the "inner join", which was used when creating the movie_ratings DataFrame.

It's easiest to understand the different types by looking at some simple examples:

Example DataFrames A and B

In [22]:
A = pd.DataFrame({'color': ['green', 'yellow', 'red'], 'num':[1, 2, 3]})
A
Out[22]:
color num
0 green 1
1 yellow 2
2 red 3
In [23]:
B = pd.DataFrame({'color': ['green', 'yellow', 'pink'], 'size':['S', 'M', 'L']})
B
Out[23]:
color size
0 green S
1 yellow M
2 pink L

Inner join

Only include observations found in both A and B:

In [24]:
pd.merge(A, B, how='inner')
Out[24]:
color num size
0 green 1 S
1 yellow 2 M

Outer join

Include observations found in either A or B:

In [25]:
pd.merge(A, B, how='outer')
Out[25]:
color num size
0 green 1.0 S
1 yellow 2.0 M
2 red 3.0 NaN
3 pink NaN L

Left join

Include all observations found in A:

In [26]:
pd.merge(A, B, how='left')
Out[26]:
color num size
0 green 1 S
1 yellow 2 M
2 red 3 NaN

Right join

Include all observations found in B:

In [27]:
pd.merge(A, B, how='right')
Out[27]:
color num size
0 green 1.0 S
1 yellow 2.0 M
2 pink NaN L