14.7 C
London
Tuesday, July 2, 2024
HomePandas in PythonDataFrame Functions in PythonHow to Drop Duplicate Rows in a Pandas DataFrame

How to Drop Duplicate Rows in a Pandas DataFrame

Related stories

Learn About Opening an Automobile Repair Shop in India

Starting a car repair shop is quite a good...

Unlocking the Power: Embracing the Benefits of Tax-Free Investing

  Unlocking the Power: Embracing the Benefits of Tax-Free Investing For...

Income Splitting in Canada for 2023

  Income Splitting in Canada for 2023 The federal government’s expanded...

Can I Deduct Home Office Expenses on my Tax Return 2023?

Can I Deduct Home Office Expenses on my Tax...

Canadian Tax – Personal Tax Deadline 2022

  Canadian Tax – Personal Tax Deadline 2022 Resources and Tools...

The easiest way to drop duplicate rows in a pandas DataFrame is by using the drop_duplicates() function, which uses the following syntax:

df.drop_duplicates(subset=None, keep=’first’, inplace=False)

where:

  • subset: Which columns to consider for identifying duplicates. Default is all columns.
  • keep: Indicates which duplicates (if any) to keep. 
    • first: Delete all duplicate rows except first.
    • last: Delete all duplicate rows except last.
    • False: Delete all duplicates.
  • inplace: Indicates whether to drop duplicates in place or return a copy of the DataFrame.

This tutorial provides several examples of how to use this function in practice on the following DataFrame:

import pandas as pd

#create DataFrame
df = pd.DataFrame({'team': ['a', 'b', 'b', 'c', 'c', 'd'],
                   'points': [3, 7, 7, 8, 8, 9],
                   'assists': [8, 6, 7, 9, 9, 3]})

#display DataFrame
print(df)

  team  points  assists
0    a       3        8
1    b       7        6
2    b       7        7
3    c       8        9
4    c       8        9
5    d       9        3

Example 1: Remove Duplicates Across All Columns

The following code shows how to remove rows that have duplicate values across all columns:

df.drop_duplicates()

        team	points	assists
0	a	3	8
1	b	7	6
2	b	7	7
3	c	8	9
5	d	9	3

By default, the drop_duplicates() function deletes all duplicates except the first.

However, we could use the keep=False argument to delete all duplicates entirely:

df.drop_duplicates(keep=False)

	team	points	assists
0	a	3	8
1	b	7	6
2	b	7	7
5	d	9	3

Example 2: Remove Duplicates Across Specific Columns

The following code shows how to remove rows that have duplicate values across just the columns titled team and points:

df.drop_duplicates(subset=['team', 'points'])

        team	points	assists
0	a	3	8
1	b	7	6
3	c	8	9
5	d	9	3

Additional Resources

How to Drop Duplicate Columns in Pandas
How to Sort Values in a Pandas DataFrame
How to Filter a Pandas DataFrame on Multiple Conditions
How to Insert a Column Into a Pandas DataFrame

Subscribe

- Never miss a story with notifications

- Gain full access to our premium content

- Browse free from up to 5 devices at once

Latest stories