Sql server query returning 6 lac records, removing duplicate records by column taking long time

Question

0.00/5 (No votes)

See more:

My SQL query is returning around 6 lac records. From these records I have to remove rows which Is having repeated value for one of the columns : AccountID. I am iterating through the dataset in C# code using for loop and deleting records by checking value of the column AccountID if it already exists. This process is taking very long time. Pls suggest alternate ways to remove the duplicate records based on the value of column: AccountID.

What I have tried:

public DataTable RemoveDuplicateRows(DataTable dTable, string colName)
{
   Hashtable hTable = new Hashtable();
   ArrayList duplicateList = new ArrayList();

   //Add list of all the unique item value to hashtable, which stores combination of key, value pair.
   //And add duplicate item value in arraylist.
   foreach (DataRow drow in dTable.Rows)
   {
      if (hTable.Contains(drow[colName]))
         duplicateList.Add(drow);
      else
         hTable.Add(drow[colName], string.Empty); 
   }

   //Removing a list of duplicate items from datatable.
   foreach (DataRow dRow in duplicateList)
      dTable.Rows.Remove(dRow);

   //Datatable which contains unique records will be return as output.
      return dTable;
}

Posted 16-Dec-18 22:02pm

Member 10281217

Updated 16-Dec-18 23:07pm

Add a Solution

Comments

Maciej Los 17-Dec-18 4:33am

6 lacs of records?!? It should take some time.
Why don't you do that on SQL server side?

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Afzaal Ahmad Zeeshan · Answer 1 · 2018-12-16T23:07:00

Quote:
My SQL query is returning around 6 lac records.

You should have a really good reason to get that much of records; building a reporting server?

Quote:
I have to remove rows which Is having repeated value for one of the columns : AccountID.

As Maciej already mentioned, do this operation in the database. The reason is quite obvious,

Quote:
I am iterating through the dataset in C# code using for loop and deleting records by checking value of the column AccountID if it already exists

See for yourself, you ask the server to give you 6lac records (each record takes time, and then on the network it takes time as well, you might be using it locally, but your users are going to have it on a remote connection). Once you are done with that you are now forcing your CPU to find the duplicates and remove them. Worst way to write a SQL query and then find and remove duplicates. A better way to do this would be either using SELECT DISTINCT, or by grouping the data together using GROUP BY.

Quote:
This process is taking very long time. Pls suggest alternate ways to remove the duplicate records based on the value of column: AccountID.

Do you want to permanently remove the records? I doubt that, so you want to get the data aggregated, and non-duplicated. In this case, please use the approach I mentioned above.

If you want to remove the data, and do not want to allow the duplicate data (which is unlikely, but still), then you need to use the column as a primary key for the table.

You see the pattern in the my answer above? I am asking you to make the database engine do all this stuff for you. The benefit is that database engine will automatically skip the records it does not need to return, and your CPU would only be rendering the results—not worrying about filtering.

SELECT Examples (Transact-SQL) | Microsoft Docs[^] (See the DISTINCT examples)
GROUP BY (Transact-SQL) | Microsoft Docs[^]

A quick analysis of your algorithm, that you copied from here, c# - Best way to remove duplicate entries from a data table - Stack Overflow[^] and you clearly missed a better answer than this stupid one, that was posted on the same thread, c# - Best way to remove duplicate entries from a data table - Stack Overflow[^]

public DataTable RemoveDuplicateRows(DataTable dTable, string colName)
{
   Hashtable hTable = new Hashtable();
   ArrayList duplicateList = new ArrayList();

   //Add list of all the unique item value to hashtable, which stores combination of key, value pair.
   //And add duplicate item value in arraylist.
   foreach (DataRow drow in dTable.Rows)
   {
      if (hTable.Contains(drow[colName]))
         duplicateList.Add(drow);
      else
         hTable.Add(drow[colName], string.Empty); 
   }

   //Removing a list of duplicate items from datatable.
   foreach (DataRow dRow in duplicateList)
      dTable.Rows.Remove(dRow);

   //Datatable which contains unique records will be return as output.
      return dTable;
}

I would tell you that this is not a good approach, since you have the access to the database, write better query. This approach means your 6 lac records be stored again in the memory, and then an iteration be done on them to verify if they are duplicate or not.

This also means, that if your database table has no duplicates, this algorithm is still going to run on it, half a million times, possibly storing every entity in the hash table, and worse.

I can go on, but you get the idea. :-)