Pandas Duplicate Data¶

This notebook explains how to identify and handle duplicate rows with pandas.

Packages¶

This tutorial uses:

pandas

In [1]:

import pandas as pd

Creating the data¶

We will create a dataframe that contains multiple occurances of duplication for this example.

In [2]:

df = pd.DataFrame({'A': ['A']*2 + ['A', 'A', 'B', 'A', 'B']*3 + ['A', 'A', 'B'],
                   'B': ['A']*2 + ['A', 'a', 'B', 'A', 'b']*3 + ['A', 'a', 'B'],
                   'C': ['A']*2 + ['A', 'B', 'C']*5 + ['A', 'A', 'B'],
                   'D': ['A']*2 + ['A', 'a', 'B']*5 + ['A', 'A', 'B']
                  })
df

Out[2]:

	A	B	C	D
0	A	A	A	A
1	A	A	A	A
2	A	A	A	A
3	A	a	B	a
4	B	B	C	B
5	A	A	A	A
6	B	b	B	a
7	A	A	C	B
8	A	a	A	A
9	B	B	B	a
10	A	A	C	B
11	B	b	A	A
12	A	A	B	a
13	A	a	C	B
14	B	B	A	A
15	A	A	B	a
16	B	b	C	B
17	A	A	A	A
18	A	a	A	A
19	B	B	B	B

Identify duplicates¶

Duplicate in all columns¶

The function duplicated will return a Boolean series indicating if that row is a duplicate. The parameter keep can take on the values 'first' (default) to label the first duplicate False and the rest True, 'last' to mark the last duplicate False and the rest True, or False to mark all duplicates True.

In [3]:

dups = df.duplicated()
dups

Out[3]:

0     False
1      True
2      True
3     False
4     False
5      True
6     False
7     False
8     False
9     False
10     True
11    False
12    False
13    False
14    False
15     True
16    False
17     True
18     True
19    False
dtype: bool

To see the duplicate rows, use the Boolean series dups to select rows from the original dataframe.

In [4]:

df[dups]

Out[4]:

	A	B	C	D
1	A	A	A	A
2	A	A	A	A
5	A	A	A	A
10	A	A	C	B
15	A	A	B	a
17	A	A	A	A
18	A	a	A	A

Duplicate in selected columns¶

The function duplicated will return a Boolean series indicating if that row is a duplicate based on just the specified columns when the parameter subset is passed a list of the columns to use (in this case, A and B).

In [5]:

dups = df.duplicated(subset=['A', 'B'])
dups

Out[5]:

0     False
1      True
2      True
3     False
4     False
5      True
6     False
7      True
8      True
9      True
10     True
11     True
12     True
13     True
14     True
15     True
16     True
17     True
18     True
19     True
dtype: bool

In [6]:

df[dups]

Out[6]:

	A	B	C	D
1	A	A	A	A
2	A	A	A	A
5	A	A	A	A
7	A	A	C	B
8	A	a	A	A
9	B	B	B	a
10	A	A	C	B
11	B	b	A	A
12	A	A	B	a
13	A	a	C	B
14	B	B	A	A
15	A	A	B	a
16	B	b	C	B
17	A	A	A	A
18	A	a	A	A
19	B	B	B	B

Delete duplicates¶

Delete only if all columns are duplicated¶

The function drop_duplicates will return a dataframe after dropping duplicates. The parameter keep can take on the values 'first' (default) to keep the first duplicate and drop the rest, 'last' to keep the last duplicate and drop the rest, or False to drop all duplicates.

In [7]:

dedup_df = df.drop_duplicates()
dedup_df

Out[7]:

	A	B	C	D
0	A	A	A	A
3	A	a	B	a
4	B	B	C	B
6	B	b	B	a
7	A	A	C	B
8	A	a	A	A
9	B	B	B	a
11	B	b	A	A
12	A	A	B	a
13	A	a	C	B
14	B	B	A	A
16	B	b	C	B
19	B	B	B	B

Delete only if specified columns are duplicated¶

The function drop_duplicates will return a dataframe after dropping all duplicates based on just the specified columns when the parameter subset is passed a list of the columns to use (in this case, A and B).

In [8]:

dedup_df = df.drop_duplicates(subset=['A', 'B'])
dedup_df

Out[8]:

	A	B	C	D
0	A	A	A	A
3	A	a	B	a
4	B	B	C	B
6	B	b	B	a

	A	B	C	D
0	A	A	A	A
1	A	A	A	A
2	A	A	A	A
3	A	a	B	a
4	B	B	C	B
5	A	A	A	A
6	B	b	B	a
7	A	A	C	B
8	A	a	A	A
9	B	B	B	a
10	A	A	C	B
11	B	b	A	A
12	A	A	B	a
13	A	a	C	B
14	B	B	A	A
15	A	A	B	a
16	B	b	C	B
17	A	A	A	A
18	A	a	A	A
19	B	B	B	B

	A	B	C	D
1	A	A	A	A
2	A	A	A	A
5	A	A	A	A
7	A	A	C	B
8	A	a	A	A
9	B	B	B	a
10	A	A	C	B
11	B	b	A	A
12	A	A	B	a
13	A	a	C	B
14	B	B	A	A
15	A	A	B	a
16	B	b	C	B
17	A	A	A	A
18	A	a	A	A
19	B	B	B	B

	A	B	C	D
0	A	A	A	A
3	A	a	B	a
4	B	B	C	B
6	B	b	B	a
7	A	A	C	B
8	A	a	A	A
9	B	B	B	a
11	B	b	A	A
12	A	A	B	a
13	A	a	C	B
14	B	B	A	A
16	B	b	C	B
19	B	B	B	B

	A	B	C	D
0	A	A	A	A
1	A	A	A	A
2	A	A	A	A
3	A	a	B	a
4	B	B	C	B
5	A	A	A	A
6	B	b	B	a
7	A	A	C	B
8	A	a	A	A
9	B	B	B	a
10	A	A	C	B
11	B	b	A	A
12	A	A	B	a
13	A	a	C	B
14	B	B	A	A
15	A	A	B	a
16	B	b	C	B
17	A	A	A	A
18	A	a	A	A
19	B	B	B	B

	A	B	C	D
1	A	A	A	A
2	A	A	A	A
5	A	A	A	A
7	A	A	C	B
8	A	a	A	A
9	B	B	B	a
10	A	A	C	B
11	B	b	A	A
12	A	A	B	a
13	A	a	C	B
14	B	B	A	A
15	A	A	B	a
16	B	b	C	B
17	A	A	A	A
18	A	a	A	A
19	B	B	B	B

	A	B	C	D
0	A	A	A	A
3	A	a	B	a
4	B	B	C	B
6	B	b	B	a
7	A	A	C	B
8	A	a	A	A
9	B	B	B	a
11	B	b	A	A
12	A	A	B	a
13	A	a	C	B
14	B	B	A	A
16	B	b	C	B
19	B	B	B	B

	A	B	C	D
0	A	A	A	A
1	A	A	A	A
2	A	A	A	A
3	A	a	B	a
4	B	B	C	B
5	A	A	A	A
6	B	b	B	a
7	A	A	C	B
8	A	a	A	A
9	B	B	B	a
10	A	A	C	B
11	B	b	A	A
12	A	A	B	a
13	A	a	C	B
14	B	B	A	A
15	A	A	B	a
16	B	b	C	B
17	A	A	A	A
18	A	a	A	A
19	B	B	B	B

	A	B	C	D
1	A	A	A	A
2	A	A	A	A
5	A	A	A	A
7	A	A	C	B
8	A	a	A	A
9	B	B	B	a
10	A	A	C	B
11	B	b	A	A
12	A	A	B	a
13	A	a	C	B
14	B	B	A	A
15	A	A	B	a
16	B	b	C	B
17	A	A	A	A
18	A	a	A	A
19	B	B	B	B

	A	B	C	D
0	A	A	A	A
3	A	a	B	a
4	B	B	C	B
6	B	b	B	a
7	A	A	C	B
8	A	a	A	A
9	B	B	B	a
11	B	b	A	A
12	A	A	B	a
13	A	a	C	B
14	B	B	A	A
16	B	b	C	B
19	B	B	B	B