Click here to Skip to main content
15,914,820 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
My SQL query is returning around 6 lac records. From these records I have to remove rows which Is having repeated value for one of the columns : AccountID. I am iterating through the dataset in C# code using for loop and deleting records by checking value of the column AccountID if it already exists. This process is taking very long time. Pls suggest alternate ways to remove the duplicate records based on the value of column: AccountID.

What I have tried:

public DataTable RemoveDuplicateRows(DataTable dTable, string colName)
{
   Hashtable hTable = new Hashtable();
   ArrayList duplicateList = new ArrayList();

   //Add list of all the unique item value to hashtable, which stores combination of key, value pair.
   //And add duplicate item value in arraylist.
   foreach (DataRow drow in dTable.Rows)
   {
      if (hTable.Contains(drow[colName]))
         duplicateList.Add(drow);
      else
         hTable.Add(drow[colName], string.Empty); 
   }

   //Removing a list of duplicate items from datatable.
   foreach (DataRow dRow in duplicateList)
      dTable.Rows.Remove(dRow);

   //Datatable which contains unique records will be return as output.
      return dTable;
}
Posted
Updated 16-Dec-18 23:07pm
Comments
Maciej Los 17-Dec-18 4:33am    
6 lacs of records?!? It should take some time.
Why don't you do that on SQL server side?

1 solution

Quote:
My SQL query is returning around 6 lac records.
You should have a really good reason to get that much of records; building a reporting server?
Quote:
I have to remove rows which Is having repeated value for one of the columns : AccountID.
As Maciej already mentioned, do this operation in the database. The reason is quite obvious,
Quote:
I am iterating through the dataset in C# code using for loop and deleting records by checking value of the column AccountID if it already exists
See for yourself, you ask the server to give you 6lac records (each record takes time, and then on the network it takes time as well, you might be using it locally, but your users are going to have it on a remote connection). Once you are done with that you are now forcing your CPU to find the duplicates and remove them. Worst way to write a SQL query and then find and remove duplicates. A better way to do this would be either using SELECT DISTINCT, or by grouping the data together using GROUP BY.
Quote:
This process is taking very long time. Pls suggest alternate ways to remove the duplicate records based on the value of column: AccountID.
Do you want to permanently remove the records? I doubt that, so you want to get the data aggregated, and non-duplicated. In this case, please use the approach I mentioned above.

If you want to remove the data, and do not want to allow the duplicate data (which is unlikely, but still), then you need to use the column as a primary key for the table.

You see the pattern in the my answer above? I am asking you to make the database engine do all this stuff for you. The benefit is that database engine will automatically skip the records it does not need to return, and your CPU would only be rendering the results—not worrying about filtering.

SELECT Examples (Transact-SQL) | Microsoft Docs[^] (See the DISTINCT examples)
GROUP BY (Transact-SQL) | Microsoft Docs[^]

A quick analysis of your algorithm, that you copied from here, c# - Best way to remove duplicate entries from a data table - Stack Overflow[^] and you clearly missed a better answer than this stupid one, that was posted on the same thread, c# - Best way to remove duplicate entries from a data table - Stack Overflow[^]
public DataTable RemoveDuplicateRows(DataTable dTable, string colName)
{
   Hashtable hTable = new Hashtable();
   ArrayList duplicateList = new ArrayList();

   //Add list of all the unique item value to hashtable, which stores combination of key, value pair.
   //And add duplicate item value in arraylist.
   foreach (DataRow drow in dTable.Rows)
   {
      if (hTable.Contains(drow[colName]))
         duplicateList.Add(drow);
      else
         hTable.Add(drow[colName], string.Empty); 
   }

   //Removing a list of duplicate items from datatable.
   foreach (DataRow dRow in duplicateList)
      dTable.Rows.Remove(dRow);

   //Datatable which contains unique records will be return as output.
      return dTable;
}
I would tell you that this is not a good approach, since you have the access to the database, write better query. This approach means your 6 lac records be stored again in the memory, and then an iteration be done on them to verify if they are duplicate or not.

This also means, that if your database table has no duplicates, this algorithm is still going to run on it, half a million times, possibly storing every entity in the hash table, and worse.

I can go on, but you get the idea. :-)
 
Share this answer
 
v2
Comments
Maciej Los 17-Dec-18 6:21am    
5ed!
Afzaal Ahmad Zeeshan 17-Dec-18 7:26am    
Thank you, Maciej.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900