Joining (Merging) DataFrames¶

Using the MovieLens 100k data, let's create two DataFrames:

movies: shows information about movies, namely a unique movie_id and its title
ratings: shows the rating that a particular user_id gave to a particular movie_id at a particular timestamp

Movies¶

In [1]:

import pandas as pd
movie_url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.item'
movie_cols = ['movie_id', 'title']
movies = pd.read_table(movie_url, sep='|', header=None, names=movie_cols, usecols=[0, 1])
movies.head()

Out[1]:

	movie_id	title
0	1	Toy Story (1995)
1	2	GoldenEye (1995)
2	3	Four Rooms (1995)
3	4	Get Shorty (1995)
4	5	Copycat (1995)

Ratings¶

In [2]:

rating_url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.data'
rating_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table(rating_url, sep='\t', header=None, names=rating_cols)
ratings.head()

Out[2]:

	user_id	movie_id	rating	timestamp
0	196	242	3	881250949
1	186	302	3	891717742
2	22	377	1	878887116
3	244	51	2	880606923
4	166	346	1	886397596

Let's pretend that you want to examine the ratings DataFrame, but you want to know the title of each movie rather than its movie_id. The best way to accomplish this objective is by "joining" (or "merging") the DataFrames using the Pandas merge function:

In [3]:

movie_ratings = pd.merge(movies, ratings)
movie_ratings.head()

Out[3]:

	movie_id	title	user_id	rating	timestamp
0	1	Toy Story (1995)	308	4	887736532
1	1	Toy Story (1995)	287	5	875334088
2	1	Toy Story (1995)	148	4	877019411
3	1	Toy Story (1995)	280	4	891700426
4	1	Toy Story (1995)	66	3	883601324

Here's what just happened:

Pandas noticed that movies and ratings had one column in common, namely movie_id. This is the "key" on which the DataFrames will be joined.
The first movie_id in movies is 1. Thus, Pandas looked through every row in the ratings DataFrame, searching for a movie_id of 1. Every time it found such a row, it recorded the user_id, rating, and timestamp listed in that row. In this case, it found 452 matching rows.
The second movie_id in movies is 2. Again, Pandas did a search of ratings and found 131 matching rows.
This process was repeated for all of the remaining rows in movies.

At the end of the process, the movie_ratings DataFrame is created, which contains the two columns from movies (movie_id and title) and the three other colums from ratings (user_id, rating, and timestamp).

movie_id 1 and its title are listed 452 times, next to the user_id, rating, and timestamp for each of the 452 matching ratings.
movie_id 2 and its title are listed 131 times, next to the user_id, rating, and timestamp for each of the 131 matching ratings.
And so on, for every movie in the dataset.

In [4]:

print movies.shape
print ratings.shape
print movie_ratings.shape

(1682, 2)
(100000, 4)
(100000, 5)

Notice the shapes of the three DataFrames:

There are 1682 rows in the movies DataFrame.
There are 100000 rows in the ratings DataFrame.
The merge function resulted in a movie_ratings DataFrame with 100000 rows, because every row from ratings matched a row from movies.
The movie_ratings DataFrame has 5 columns, namely the 2 columns from movies, plus the 4 columns from ratings, minus the 1 column in common.

By default, the merge function joins the DataFrames using all column names that are in common (movie_id, in this case). The documentation explains how you can override this behavior.

Four Types of Joins¶

There are actually four types of joins supported by the Pandas merge function. Here's how they are described by the documentation:

inner: use intersection of keys from both frames (SQL: inner join)
outer: use union of keys from both frames (SQL: full outer join)
left: use only keys from left frame (SQL: left outer join)
right: use only keys from right frame (SQL: right outer join)

The default is the "inner join", which was used when creating the movie_ratings DataFrame.

It's easiest to understand the different types by looking at some simple examples:

Example DataFrames A and B¶

In [5]:

A = pd.DataFrame({'color': ['green', 'yellow', 'red'], 'num':[1, 2, 3]})
A

Out[5]:

	color	num
0	green	1
1	yellow	2
2	red	3

In [6]:

B = pd.DataFrame({'color': ['green', 'yellow', 'pink'], 'size':['S', 'M', 'L']})
B

Out[6]:

	color	size
0	green	S
1	yellow	M
2	pink	L

Inner join¶

Only include observations found in both A and B:

In [7]:

pd.merge(A, B, how='inner')

Out[7]:

	color	num	size
0	green	1	S
1	yellow	2	M

Outer join¶

Include observations found in either A or B:

In [8]:

pd.merge(A, B, how='outer')

Out[8]:

	color	num	size
0	green	1	S
1	yellow	2	M
2	red	3	NaN
3	pink	NaN	L

Left join¶

Include all observations found in A:

In [9]:

pd.merge(A, B, how='left')

Out[9]:

	color	num	size
0	green	1	S
1	yellow	2	M
2	red	3	NaN

Right join¶

Include all observations found in B:

In [10]:

pd.merge(A, B, how='right')

Out[10]:

	color	num	size
0	green	1	S
1	yellow	2	M
2	pink	NaN	L