Using the MovieLens 100k data, let's create two DataFrames:
import pandas as pd
movie_url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.item'
movie_cols = ['movie_id', 'title']
movies = pd.read_table(movie_url, sep='|', header=None, names=movie_cols, usecols=[0, 1])
movies.head()
movie_id | title | |
---|---|---|
0 | 1 | Toy Story (1995) |
1 | 2 | GoldenEye (1995) |
2 | 3 | Four Rooms (1995) |
3 | 4 | Get Shorty (1995) |
4 | 5 | Copycat (1995) |
rating_url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.data'
rating_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table(rating_url, sep='\t', header=None, names=rating_cols)
ratings.head()
user_id | movie_id | rating | timestamp | |
---|---|---|---|---|
0 | 196 | 242 | 3 | 881250949 |
1 | 186 | 302 | 3 | 891717742 |
2 | 22 | 377 | 1 | 878887116 |
3 | 244 | 51 | 2 | 880606923 |
4 | 166 | 346 | 1 | 886397596 |
Let's pretend that you want to examine the ratings DataFrame, but you want to know the title of each movie rather than its movie_id. The best way to accomplish this objective is by "joining" (or "merging") the DataFrames using the Pandas merge
function:
movie_ratings = pd.merge(movies, ratings)
movie_ratings.head()
movie_id | title | user_id | rating | timestamp | |
---|---|---|---|---|---|
0 | 1 | Toy Story (1995) | 308 | 4 | 887736532 |
1 | 1 | Toy Story (1995) | 287 | 5 | 875334088 |
2 | 1 | Toy Story (1995) | 148 | 4 | 877019411 |
3 | 1 | Toy Story (1995) | 280 | 4 | 891700426 |
4 | 1 | Toy Story (1995) | 66 | 3 | 883601324 |
Here's what just happened:
At the end of the process, the movie_ratings DataFrame is created, which contains the two columns from movies (movie_id and title) and the three other colums from ratings (user_id, rating, and timestamp).
print movies.shape
print ratings.shape
print movie_ratings.shape
(1682, 2) (100000, 4) (100000, 5)
Notice the shapes of the three DataFrames:
merge
function resulted in a movie_ratings DataFrame with 100000 rows, because every row from ratings matched a row from movies.By default, the merge
function joins the DataFrames using all column names that are in common (movie_id, in this case). The documentation explains how you can override this behavior.
There are actually four types of joins supported by the Pandas merge
function. Here's how they are described by the documentation:
The default is the "inner join", which was used when creating the movie_ratings DataFrame.
It's easiest to understand the different types by looking at some simple examples:
A = pd.DataFrame({'color': ['green', 'yellow', 'red'], 'num':[1, 2, 3]})
A
color | num | |
---|---|---|
0 | green | 1 |
1 | yellow | 2 |
2 | red | 3 |
B = pd.DataFrame({'color': ['green', 'yellow', 'pink'], 'size':['S', 'M', 'L']})
B
color | size | |
---|---|---|
0 | green | S |
1 | yellow | M |
2 | pink | L |
Only include observations found in both A and B:
pd.merge(A, B, how='inner')
color | num | size | |
---|---|---|---|
0 | green | 1 | S |
1 | yellow | 2 | M |
Include observations found in either A or B:
pd.merge(A, B, how='outer')
color | num | size | |
---|---|---|---|
0 | green | 1 | S |
1 | yellow | 2 | M |
2 | red | 3 | NaN |
3 | pink | NaN | L |
Include all observations found in A:
pd.merge(A, B, how='left')
color | num | size | |
---|---|---|---|
0 | green | 1 | S |
1 | yellow | 2 | M |
2 | red | 3 | NaN |
Include all observations found in B:
pd.merge(A, B, how='right')
color | num | size | |
---|---|---|---|
0 | green | 1 | S |
1 | yellow | 2 | M |
2 | pink | NaN | L |