How to Delete Duplicate Rows in SQL?

ayushi9821704 28 Aug, 2024

4 min read

Introduction

Managing databases often means dealing with duplicate records that can complicate data analysis and operations. Whether you’re cleaning up customer lists, transaction logs, or other datasets, removing duplicate rows is vital for maintaining data quality. This guide will explore practical techniques for deleting duplicate rows in SQL databases, including detailed syntax and real-world examples to help you efficiently address and eliminate these duplicates.

Overview

Identify the common causes of duplicate records in SQL databases.
Discover various methods to pinpoint and remove duplicate entries.
Understand SQL syntax and practical approaches for duplicate removal.
Learn best practices to ensure data integrity while cleaning up duplicates.

How to Delete Duplicate Rows in SQL?

Removing duplicate rows in SQL can be achieved through several methods. Each approach has its own advantages depending on the database system you’re using and the specific needs of your task. Below are some effective techniques for deleting duplicate records.

Common Causes of Duplicate Rows

Duplicate rows can appear in your database due to several reasons:

Data Entry Errors: Human mistakes during data input.
Merging Datasets: Combining data from multiple sources without proper de-duplication.
Improper Import Procedures: Incorrect data import processes can lead to duplication.

Identifying Duplicate Rows

Before deleting duplicates, you need to locate them. Duplicates often occur when multiple rows contain identical values in one or more columns. Here’s how to identify such duplicates:

Syntax:

SELECT column1, column2, COUNT(*)
FROM table_name
GROUP BY column1, column2
HAVING COUNT(*) > 1;

Example:

Suppose you have a table employees with the following data:

id	name	email
1	Alice	[email protected]
2	Bob	[email protected]
3	Carol	[email protected]
4	Alice	[email protected]
5	Dave	[email protected]

To find duplicate emails:

SELECT email, COUNT(*)
FROM employees
GROUP BY email
HAVING COUNT(*) > 1;

Output:

email	COUNT(*)
[email protected]	2

This query identifies emails that appear more than once in the table.

Deleting Duplicates Using `ROW_NUMBER()`

A powerful method for removing duplicates involves the ROW_NUMBER() window function, which assigns a unique sequential number to each row within a partition.

Syntax:

WITH CTE AS (
    SELECT column1, column2, 
           ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY (SELECT NULL)) AS rn
    FROM table_name
)
DELETE FROM CTE
WHERE rn > 1;

Example:

To eliminate duplicate rows from the employees table based on email:

sqlCopy codeWITH CTE AS (
    SELECT id, name, email, 
           ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn
    FROM employees
)
DELETE FROM CTE
WHERE rn > 1;

Output:

After running the above query, the table will be cleaned up, resulting in:

id	name	email
1	Alice	[email protected]
2	Bob	[email protected]
3	Carol	[email protected]
5	Dave	[email protected]

The duplicate row with id = 4 has been removed.

Deleting Duplicates Using a Self Join

Another effective strategy involves using a self join to detect and delete duplicate rows.

Syntax:

DELETE t1
FROM table_name t1
JOIN table_name t2
ON t1.column1 = t2.column1
AND t1.column2 = t2.column2
AND t1.id < t2.id;

Example:

To remove duplicate entries from the employees table:

sqlCopy codeDELETE e1
FROM employees e1
JOIN employees e2
ON e1.email = e2.email
AND e1.id < e2.id;

Output:

After executing this query, the table will look like:

id	name	email
1	Alice	[email protected]
2	Bob	[email protected]
3	Carol	[email protected]
5	Dave	[email protected]

The row with id = 4 is deleted, leaving only unique entries.

Deleting Duplicates Using `DISTINCT` in a New Table

Sometimes, creating a new table with unique records and replacing the old table is the safest method.

Syntax:

CREATE TABLE new_table AS
SELECT DISTINCT *
FROM old_table;

DROP TABLE old_table;

ALTER TABLE new_table RENAME TO old_table;

Example:

To clean up duplicates in the employees table:

sqlCopy codeCREATE TABLE employees_unique AS
SELECT DISTINCT *
FROM employees;

DROP TABLE employees;

ALTER TABLE employees_unique RENAME TO employees;

Output:

The new table employees will now have:

id	name	email
1	Alice	[email protected]
2	Bob	[email protected]
3	Carol	[email protected]
5	Dave	[email protected]

The employees table is now free of duplicates.

Best Practices for Avoiding Duplicates

Implement Data Validation Rules: Ensure data is validated before insertion.
Use Unique Constraints: Apply unique constraints to columns to prevent duplicate entries.
Regular Data Audits: Periodically check for duplicates and clean data to maintain accuracy.

Conclusion

Effectively managing duplicate rows is a crucial aspect of database maintenance. By using methods like ROW_NUMBER(), self joins, or creating new tables, you can efficiently remove duplicates and maintain a clean dataset. Each method offers different advantages depending on your needs, so select the one that best suits your specific scenario. Always remember to back up your data before performing any deletion operations to safeguard against accidental loss.

Frequently Asked Questions

Q1. What are some common reasons for duplicate rows in SQL databases?

A. Duplicates can arise from data entry errors, issues during data import, or incorrect merging of datasets.

Q2. How can I avoid accidentally deleting important data when removing duplicates?

A. Make sure to back up your data before performing deletions and carefully review your queries to target only the intended records.

Q3. Is it possible to remove duplicates without affecting the original table?

A. Yes, you can create a new table with unique records and then replace the original table with this new one.

Q4. What distinguishes ROW_NUMBER() from DISTINCT for removing duplicates?

A. ROW_NUMBER() provides more control by allowing you to keep specific rows based on criteria, whereas DISTINCT simply eliminates duplicate rows in the new table.

ayushi9821704 28 Aug, 2024

My name is Ayushi Trivedi. I am a B. Tech graduate. I have 3 years of experience working as an educator and content editor. I have worked with various python libraries, like numpy, pandas, seaborn, matplotlib, scikit, imblearn, linear regression and many more. I am also an author. My first book named #turning25 has been published and is available on amazon and flipkart. Here, I am technical content editor at Analytics Vidhya. I feel proud and happy to be AVian. I have a great team to work with. I love building the bridge between the technology and the learner.

Beginner Data Analysis Database SQL