Taken from Merging DataFrames with pandas (DataCamp course):
df1.append(df2)
: stacking verticallypd.concat([df1, df2])
:df1.join(df2)
: inner/outer/left/right joins on Indexespd.merge(df1, df2)
: many joins on multiple columnsUsing the MovieLens 100k data, let's create two DataFrames:
import pandas as pd
movie_cols = ['movie_id', 'title']
movies = pd.read_table('data/u.item', sep='|', header=None, names=movie_cols, usecols=[0, 1])
movies.head()
movie_id | title | |
---|---|---|
0 | 1 | Toy Story (1995) |
1 | 2 | GoldenEye (1995) |
2 | 3 | Four Rooms (1995) |
3 | 4 | Get Shorty (1995) |
4 | 5 | Copycat (1995) |
movies.shape
(1682, 2)
movies.movie_id.nunique()
1682
rating_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('data/u.data', sep='\t', header=None, names=rating_cols)
ratings.head()
user_id | movie_id | rating | timestamp | |
---|---|---|---|---|
0 | 196 | 242 | 3 | 881250949 |
1 | 186 | 302 | 3 | 891717742 |
2 | 22 | 377 | 1 | 878887116 |
3 | 244 | 51 | 2 | 880606923 |
4 | 166 | 346 | 1 | 886397596 |
ratings.shape
(100000, 4)
ratings.movie_id.nunique()
1682
ratings.loc[ratings.movie_id == 1, :].head()
user_id | movie_id | rating | timestamp | |
---|---|---|---|---|
24 | 308 | 1 | 4 | 887736532 |
454 | 287 | 1 | 5 | 875334088 |
957 | 148 | 1 | 4 | 877019411 |
971 | 280 | 1 | 4 | 891700426 |
1324 | 66 | 1 | 3 | 883601324 |
Let's pretend that you want to examine the ratings DataFrame, but you want to know the title of each movie rather than its movie_id. The best way to accomplish this objective is by "joining" (or "merging") the DataFrames using the Pandas merge
function:
movies.columns
Index(['movie_id', 'title'], dtype='object')
ratings.columns
Index(['user_id', 'movie_id', 'rating', 'timestamp'], dtype='object')
movie_ratings = pd.merge(movies, ratings)
movie_ratings.columns
Index(['movie_id', 'title', 'user_id', 'rating', 'timestamp'], dtype='object')
movie_ratings.head()
movie_id | title | user_id | rating | timestamp | |
---|---|---|---|---|---|
0 | 1 | Toy Story (1995) | 308 | 4 | 887736532 |
1 | 1 | Toy Story (1995) | 287 | 5 | 875334088 |
2 | 1 | Toy Story (1995) | 148 | 4 | 877019411 |
3 | 1 | Toy Story (1995) | 280 | 4 | 891700426 |
4 | 1 | Toy Story (1995) | 66 | 3 | 883601324 |
movie_ratings.shape
(100000, 5)
Here's what just happened:
At the end of the process, the movie_ratings DataFrame is created, which contains the two columns from movies (movie_id and title) and the three other colums from ratings (user_id, rating, and timestamp).
print(movies.shape)
print(ratings.shape)
print(movie_ratings.shape)
(1682, 2) (100000, 4) (100000, 5)
Notice the shapes of the three DataFrames:
merge
function resulted in a movie_ratings DataFrame with 100000 rows, because every row from ratings matched a row from movies.By default, the merge
function joins the DataFrames using all column names that are in common (movie_id, in this case). The documentation explains how you can override this behavior.
movies.columns = ['m_id', 'title']
movies.columns
Index(['m_id', 'title'], dtype='object')
ratings.columns
Index(['user_id', 'movie_id', 'rating', 'timestamp'], dtype='object')
pd.merge(movies, ratings, left_on='m_id', right_on='movie_id').head()
m_id | title | user_id | movie_id | rating | timestamp | |
---|---|---|---|---|---|---|
0 | 1 | Toy Story (1995) | 308 | 1 | 4 | 887736532 |
1 | 1 | Toy Story (1995) | 287 | 1 | 5 | 875334088 |
2 | 1 | Toy Story (1995) | 148 | 1 | 4 | 877019411 |
3 | 1 | Toy Story (1995) | 280 | 1 | 4 | 891700426 |
4 | 1 | Toy Story (1995) | 66 | 1 | 3 | 883601324 |
movies = movies.set_index('m_id')
movies.head()
title | |
---|---|
m_id | |
1 | Toy Story (1995) |
2 | GoldenEye (1995) |
3 | Four Rooms (1995) |
4 | Get Shorty (1995) |
5 | Copycat (1995) |
pd.merge(movies, ratings, left_index=True, right_on='movie_id').head()
title | user_id | movie_id | rating | timestamp | |
---|---|---|---|---|---|
24 | Toy Story (1995) | 308 | 1 | 4 | 887736532 |
454 | Toy Story (1995) | 287 | 1 | 5 | 875334088 |
957 | Toy Story (1995) | 148 | 1 | 4 | 877019411 |
971 | Toy Story (1995) | 280 | 1 | 4 | 891700426 |
1324 | Toy Story (1995) | 66 | 1 | 3 | 883601324 |
ratings = ratings.set_index('movie_id')
ratings.head()
user_id | rating | timestamp | |
---|---|---|---|
movie_id | |||
242 | 196 | 3 | 881250949 |
302 | 186 | 3 | 891717742 |
377 | 22 | 1 | 878887116 |
51 | 244 | 2 | 880606923 |
346 | 166 | 1 | 886397596 |
pd.merge(movies, ratings, left_index=True, right_index=True).head()
title | user_id | rating | timestamp | |
---|---|---|---|---|
1 | Toy Story (1995) | 308 | 4 | 887736532 |
1 | Toy Story (1995) | 287 | 5 | 875334088 |
1 | Toy Story (1995) | 148 | 4 | 877019411 |
1 | Toy Story (1995) | 280 | 4 | 891700426 |
1 | Toy Story (1995) | 66 | 3 | 883601324 |
There are actually four types of joins supported by the Pandas merge
function. Here's how they are described by the documentation:
The default is the "inner join", which was used when creating the movie_ratings DataFrame.
It's easiest to understand the different types by looking at some simple examples:
A = pd.DataFrame({'color': ['green', 'yellow', 'red'], 'num':[1, 2, 3]})
A
color | num | |
---|---|---|
0 | green | 1 |
1 | yellow | 2 |
2 | red | 3 |
B = pd.DataFrame({'color': ['green', 'yellow', 'pink'], 'size':['S', 'M', 'L']})
B
color | size | |
---|---|---|
0 | green | S |
1 | yellow | M |
2 | pink | L |
Only include observations found in both A and B:
pd.merge(A, B, how='inner')
color | num | size | |
---|---|---|---|
0 | green | 1 | S |
1 | yellow | 2 | M |
Include observations found in either A or B:
pd.merge(A, B, how='outer')
color | num | size | |
---|---|---|---|
0 | green | 1.0 | S |
1 | yellow | 2.0 | M |
2 | red | 3.0 | NaN |
3 | pink | NaN | L |
Include all observations found in A:
pd.merge(A, B, how='left')
color | num | size | |
---|---|---|---|
0 | green | 1 | S |
1 | yellow | 2 | M |
2 | red | 3 | NaN |
Include all observations found in B:
pd.merge(A, B, how='right')
color | num | size | |
---|---|---|---|
0 | green | 1.0 | S |
1 | yellow | 2.0 | M |
2 | pink | NaN | L |