Duplicate Rows/XML-Nodes Identification and Clean-up from Database

S. M. Ahasan Habib

4.70/5 (16 votes)

Oct 21, 2013

CPOL

8 min read

56279

284

In this article I explain how can we identify duplicate rows from database table and clean them up. I also cover duplicate XML nodes inside XML type fields having the same problem and a solution.

Download script source

Man Duplicate Rows

Introduction

Everyone knows duplicate rows create so many problems:

Incorrect output
clue less exceptions
working/tested code failure
database size increase
application crash
many more.

In this article I explain how can we identify duplicate rows from the database table and clean-up them. I also cover duplicate XML nodes inside XML type fields with the same problem. In real life, there are various unique scenarios that raise and handle all those with various techniques. It is difficult to cover all the scenarios in a single article and provide solutions. Here my intention is try to understand the basic part of duplicate rows/XML-node issues and solutions so that in different scenarios we can handle those appropriately.

Background

Nobody creates duplicate rows intentionally in a database. Duplicate rows are created easily when tables have no primary/unique key, no proper data validation, and when business logic changes but data correction of existing data is not properly completed. Besides that there are a few scenarios that cause creating duplicate rows:

Importing data from other databases, the source database contains duplicate rows.
Re-structure/reorganize database, and time duplicate rows may created.
Existing application bugs.
Data-entry operation enters duplicate rows.

Identify Unique and Duplicate Rows

Identifying unique and duplicate rows from the database table I create a sample table and rows, insert those rows to that table, and execute SQL queries. My table structure consists of three fields:

Id
Name
DateOfBirth

create table MyTestTable(Id int, Name varchar(50), DateOfBirth datetime); 
insert into MyTestTable(Id, Name, DateOfBirth)
values(1, 'A', '01-Jan-2010'), (2, 'B', '01-Jan-2011'), (2, 'B', '01-Jan-2011'),
(3, 'C', '01-Jan-2012'),(4, 'D', '01-Jan-2013'),(4, 'D', '01-Jan-2013');

After analyzing data, it will be clear that ids 1 and 3 are unique rows, the remaining are duplicate. Identifying unique rows:

select max(Id) as Id, Name, DateOfBirth FROM MyTestTable
group by Name, DateOfBirth
having count(*) = 1;

In query, group by the fields (Name and DateOfBirth) and use the Max aggregate function (ID field if more columns exist in the table, then we need to group by all fields there). If we use Min instead of Max, the result would be same. We filter rows with the count(*) = 1 clause. Actually we request the database for giving groups where each group contains a single record.

Now for duplicate rows from the table, just need to change the filter clause count(*) > 1

select max(Id) as Id, Name, DateOfBirth FROM MyTestTable
group by Name, DateOfBirth
having count(*) > 1;

To know the frequency of duplicate rows:

select id,name,dateofbirth, count(*) as Total
  from MyTestTable
  group by id,name,dateofbirth
  having count(*) > 1

Now I will explain a real life scenario. In the sample rows I did not insert time in the dateofbirth field. But often datetime field contains time with date. To better understand this, I clear the table first and insert sample rows again with time.

truncate table  MyTestTable;--delete all rows from the table   
insert into MyTestTable (Id, Name, DateOfBirth)--insert sample data again
values (1, 'A', '01-Jan-2010 12:00:01'), (2, 'B', '01-Jan-2011 12:00:05'), (2, 'B', '01-Jan-2011 12:00:06'),
(3, 'C', '01-Jan-2012 12:00:01'),(4, 'D', '01-Jan-2013 12:00:07'),(4, 'D', '01-Jan-2013 12:00:08');

After analysis of the new row-set, you will see there are no duplicate rows because of time. Adding time with date makes each row unique.

Run duplicate rows identification query again:

select max(Id) as Id, Name, DateOfBirth FROM MyTestTable
group by Name, DateOfBirth
having count(*) > 1

It will return empty result. That proves no duplicate rows exist in this table. But when you consider duplicate rows, often time they should be ignored for business policy. So I re-write the query considering this:

select max(Id) as Id, Name, max(DateOfBirth) as DateOfBirth FROM MyTestTable
group by  Name, convert(varchar(20),DateOfBirth, 104)
having count(*) > 1

Ignoring time from datetime, I use T-SQL Convert scalar and Max aggregate function.

Delete Duplicate Rows

If the database contains duplicate rows, we must clean up those. Previously we identified the duplicate rows. Now we will start the clean up operation. We may think deleting all duplicate rows is very easy and straightforward. See the SQL below:

delete from MyTestTable
where id in(select max(Id) FROM MyTestTable
group by  Name, convert(varchar(20),DateOfBirth, 104)
having count(*) > 1)

Now there is no duplicate rows. But it deletes real data too. But how? After analysis of the data you will find that ids 2 and 4 each have double rows, and the first row was valid. The next one is actually duplicate.

So we should keep the first of each and delete the next of that group. The delete query will be tricky. We need to take help from CTE (Common Table Expression) and the row_number function. CTE will help to point to a particular row and the row_number function will create a unique row-id for each row so that we can track each row uniquely. The delete query is:

WITH cte AS
(
    select row_number() over(order by id asc) as rowid, Id, Name, DateOfBirth 
    from MyTestTable
)
DELETE FROM cte WHERE rowid in (select max(rowid) from cte  
group by id, Name, DateOfBirth having count(*) > 1);

After execution of the above query, the output is:

I want to say something regarding the delete query. The table has no unique field like a primary key by which we can identify each row. So we create a dynamic unique field using the row_number function. The rowid field is actually virtual and no physical existence is there. So we will not execute the delete statement directly from the table based on that field. We use CTE for executing the delete statement based on the virtual field and clean up duplicate rows from the table.

Please wait! The story is not finished yet. In our sample, rows 2 and 4 each have two rows. So the delete query simply deletes the next row. If more than two duplicate rows are found then, what will happen? The query will delete the last duplicate row only, though I use the Max aggregate function.

select max(rowid) from cte

We understand that the problem will not be solved yet but we are very close to a solution.

We need a complete T-SQL query for deleting all duplicate rows except the first one.

Solution-1

while (1=1)--infinite loop start
begin
    WITH cte AS
    (
    select row_number() over(order by id asc) as rowid, Id, Name, DateOfBirth from MyTestTable
    )
    delete from cte where rowid in (select max(rowid) from cte group by id, Name, DateOfBirth having count(*) > 1);
        --inifinite loop finish when no duplicate rows found
    if @@ROWCOUNT = 0 break;    
end

Start an infinite loop for deleting duplicate rows until a single duplicate row is found. The infinite loop will finish when the SQL Server global variable @@RowCount returns 0. It will return zero when no duplicate rows are found in the source table.

Solution-1 works fine but the code is not very elegant and is a little complex because an infinite loop is used here. Many people dislike it because of the lack of simplicity.

Solution-2

WITH cte AS
(
    select row_number() over (partition by id,Name,
       DateOfBirth order by id asc) as rowid,      Id, Name, DateOfBirth 
    from MyTestTable
)
DELETE FROM cte WHERE rowid > 1;

Solution-2 is simpler than Solution-1. The virtual field Rowid is generated group wise using the row_number function with id, name, DateOfBith fields treated as a group. The generated rowid starts from 1 for each group. The new filter condition is rowid > 1. That means each group only keeps a row whose id is 1, the remaining ones will be deleted.

Duplicate XML node Identification and Delete

XML data type in SQL Server comes with the 2005 version. There are many reasons we should use the XML type field

Store multiple fields value to a single field.
Store schema and data both together.
Store hierarchical data.
Dynamic data with schema is defined at runtime.
Store schema which can be configurable at runtime.
Store various data templates.
Data need to validate against any fixed schema.
And many more

Inside the XML field duplicate nodes might be created for various reasons. It is equally important as identifying and cleaning-up duplicate rows from the table, duplicate nodes from an XML field should be identified and cleaned-up for data accuracy and to have methods to handle various exceptions.

Identify Duplicate XML Nodes

To accurately visualize the problem and solution we create a sample table named MyTestTable2 with two columns:

Id (datatype-int)
Data (datatype-xml)

create table MyTestTable2(Id int, Data xml);  
declare @data1 xml=
'<employee>
    <code>1</code>
    <name>A</name>
    <code>1</code>
</employee>',
@data2 xml='<employee>
    <code>2</code>
    <code>2</code>
    <name>B</name>
</employee>',
@data3 xml='<employee>
    <code>3</code>
        <name>C</name>
</employee>';

insert into MyTestTable2(Id, data) values(1, @data1), (2, @data2), (3, @data3);

Now the table with data looks like:

select all id, data  from MyTestTable2;

After data analysis we see that only the row with ID 1 contains a duplicate XML node named code that exists twice in the employee element (repeat after name node). The following SQL helps to identify rows where duplicate XML nodes exist.

select Id, Data from MyTestTable2 where data.value('count(/employee/code)', 'int') > 1;

It will return a single row whose id is 1. So we understand that the query is correct for identifying duplicate XML nodes containing rows.

Clean-up Duplicate XML Node

The following query will delete duplicate nodes from the source table.

update MyTestTable2 set 
data.modify('delete /employee/code[position()>1]')  
where data.value('count(/employee/code)', 'int') > 1;

After executing the above query, the output will be as follows:

select id, data from MyTestTable2;

Now we see there there is no duplicate XML node in the data field (XML type) in the table.

When to Delete?

Duplicate rows try to identify and delete as early as possible. If it is late then that duplicate row might be used as a reference row (foreign key). If it is used as a reference key then a referential integrity error will be raised and delete operation becomes complex. If you face that situation then first you need to update the reference table/reference field with a valid key and then go for delete.

Any way for continuous search and delete of duplicate rows?

We can write a stored procedure for identifying duplicate rows/XML nodes and delete them and execute that stored procedure in a regular interval. We can use the following tools for executing stored procedure periodically:

SQL Server Agent
Windows Service
Windows Task Scheduler

If you choose Windows Service or Windows Task Scheduler then you need to create a console/Windows service application and inside that application use ADO.NET or Enterprise Library for accessing your database and execute the stored procedure. If you need to use Windows Service then you can also use the Windows Timer object for periodic execution. My personal choice is Windows Task Scheduler. It is simpler and easily configurable.

Any Precautions?

Duplicate delete stored procedure (SP) must be well tested. You should take enough time to test it properly. It is better to take help from testing team/third person if you have the option.

Delete operation is very risky especially when you do it in the production database. The client never accepts data loss. It is not acceptable for anyone that you go for deleting duplicate rows but simultaneously delete real data too. So before delete operation you must make sure source data is properly backed-up so that if any mistakes happen you can recover data.

Points of Interest

I try to explain and show the various ways to create duplicate records in a database as well as a node in an XML type field. Duplicate data creates maintenance overheads and produces various types of bugs. So we need to clean up the production database on a regular basis specially in projects/products which have a long maintenance life.