This notebook explains how to identify and handle duplicate rows with pandas
.
This tutorial uses:
import pandas as pd
We will create a dataframe that contains multiple occurances of duplication for this example.
df = pd.DataFrame({'A': ['A']*2 + ['A', 'A', 'B', 'A', 'B']*3 + ['A', 'A', 'B'],
'B': ['A']*2 + ['A', 'a', 'B', 'A', 'b']*3 + ['A', 'a', 'B'],
'C': ['A']*2 + ['A', 'B', 'C']*5 + ['A', 'A', 'B'],
'D': ['A']*2 + ['A', 'a', 'B']*5 + ['A', 'A', 'B']
})
df
A | B | C | D | |
---|---|---|---|---|
0 | A | A | A | A |
1 | A | A | A | A |
2 | A | A | A | A |
3 | A | a | B | a |
4 | B | B | C | B |
5 | A | A | A | A |
6 | B | b | B | a |
7 | A | A | C | B |
8 | A | a | A | A |
9 | B | B | B | a |
10 | A | A | C | B |
11 | B | b | A | A |
12 | A | A | B | a |
13 | A | a | C | B |
14 | B | B | A | A |
15 | A | A | B | a |
16 | B | b | C | B |
17 | A | A | A | A |
18 | A | a | A | A |
19 | B | B | B | B |
The function duplicated
will return a Boolean series indicating if that row is a duplicate. The parameter keep
can take on the values 'first' (default) to label the first duplicate False and the rest True, 'last' to mark the last duplicate False and the rest True, or False to mark all duplicates True.
dups = df.duplicated()
dups
0 False 1 True 2 True 3 False 4 False 5 True 6 False 7 False 8 False 9 False 10 True 11 False 12 False 13 False 14 False 15 True 16 False 17 True 18 True 19 False dtype: bool
To see the duplicate rows, use the Boolean series dups to select rows from the original dataframe.
df[dups]
A | B | C | D | |
---|---|---|---|---|
1 | A | A | A | A |
2 | A | A | A | A |
5 | A | A | A | A |
10 | A | A | C | B |
15 | A | A | B | a |
17 | A | A | A | A |
18 | A | a | A | A |
The function duplicated
will return a Boolean series indicating if that row is a duplicate based on just the specified columns when the parameter subset
is passed a list of the columns to use (in this case, A and B).
dups = df.duplicated(subset=['A', 'B'])
dups
0 False 1 True 2 True 3 False 4 False 5 True 6 False 7 True 8 True 9 True 10 True 11 True 12 True 13 True 14 True 15 True 16 True 17 True 18 True 19 True dtype: bool
df[dups]
A | B | C | D | |
---|---|---|---|---|
1 | A | A | A | A |
2 | A | A | A | A |
5 | A | A | A | A |
7 | A | A | C | B |
8 | A | a | A | A |
9 | B | B | B | a |
10 | A | A | C | B |
11 | B | b | A | A |
12 | A | A | B | a |
13 | A | a | C | B |
14 | B | B | A | A |
15 | A | A | B | a |
16 | B | b | C | B |
17 | A | A | A | A |
18 | A | a | A | A |
19 | B | B | B | B |
The function drop_duplicates
will return a dataframe after dropping duplicates. The parameter keep
can take on the values 'first' (default) to keep the first duplicate and drop the rest, 'last' to keep the last duplicate and drop the rest, or False to drop all duplicates.
dedup_df = df.drop_duplicates()
dedup_df
A | B | C | D | |
---|---|---|---|---|
0 | A | A | A | A |
3 | A | a | B | a |
4 | B | B | C | B |
6 | B | b | B | a |
7 | A | A | C | B |
8 | A | a | A | A |
9 | B | B | B | a |
11 | B | b | A | A |
12 | A | A | B | a |
13 | A | a | C | B |
14 | B | B | A | A |
16 | B | b | C | B |
19 | B | B | B | B |
The function drop_duplicates
will return a dataframe after dropping all duplicates based on just the specified columns when the parameter subset
is passed a list of the columns to use (in this case, A and B).
dedup_df = df.drop_duplicates(subset=['A', 'B'])
dedup_df
A | B | C | D | |
---|---|---|---|---|
0 | A | A | A | A |
3 | A | a | B | a |
4 | B | B | C | B |
6 | B | b | B | a |