Click here to Skip to main content
15,867,686 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
SQL
SELECT a.*
FROM MRSVoid.dbo.Customer_Dataset$ a
CROSS JOIN
(SELECT 
[Customer_LastName]
      ,[Customer_FirstName]
      ,[Customer_AddressLine1]

      ,[Customer_HomePhone]
      ,[Customer_InternetEmail]
FROM MRSVoid.dbo.Customer_Dataset$ 
GROUP BY [Customer_LastName],
         [Customer_FirstName],
		 [Customer_AddressLine1],
		 [Customer_InternetEmail],
		 [Customer_HomePhone]
HAVING count(*) > 1) b
where ((a.Customer_LastName = b.Customer_LastName) OR (a.Customer_LastName is NULL AND b.Customer_LastName is NULL))
AND ((a.Customer_FirstName = b.Customer_FirstName) OR (a.Customer_FirstName is NULL AND b.Customer_FirstName is NULL))
AND ((a.Customer_AddressLine1 = b.Customer_AddressLine1) OR (a.Customer_AddressLine1 is NULL AND b.Customer_AddressLine1 is NULL))
AND ((a.Customer_InternetEmail = b.Customer_InternetEmail) OR (a.Customer_InternetEmail is NULL AND b.Customer_InternetEmail is NULL))
AND ((a.Customer_HomePhone = b.Customer_HomePhone) OR (a.Customer_HomePhone is NULL AND b.Customer_HomePhone is NULL))
order by Customer_AddressLine1


This query gives me duplicate rows from a dataset, now I need to merge into a single record per group, and the data merged in such a way that we have the most complete set of attributes as possible. Example:
a. If two duplicate records share an email address, but only one has a full mailing address, the resultant merged record should have both the email and the mailing address.
b. If two duplicate records have different values for one of the following, the merged record should use the more recent attribute as identified by the ModifiedOn and/or CreatedOn timestamp values.

Sample data
ID	CreatedOn	ModifiedOn	Customer_LastName	Customer_FirstName	Customer_AddressLine1	Customer_City	Customer_State	Customer_Zip	Customer_HomePhone	Customer_InternetEmail
27196	2012-11-14 18:51:07.000	2012-11-17 15:28:45.000	NULL	David	98 Pelmor Dr	Marmora	OR	85044	NULL	NULL
14983	2012-11-18 14:02:44.000	2012-11-18 14:02:44.000	NULL	David	98 Pelmor Dr	Marmora	OR	85044	NULL	NULL


What I have tried:

I am not getting a approach where to start, so need help in finding the best approach.
Posted
Updated 19-Feb-19 8:54am
v3
Comments
Santosh kumar Pithani 19-Feb-19 1:33am    
Use row_number() rank function AS ROw_number()over(partition by [Customer_LastName],
[Customer_FirstName],
[Customer_AddressLine1],
[Customer_InternetEmail],
[Customer_HomePhone] ORDER BY ISNULL(CreatedOn,ModifiedOn)) AS rn " in your query and later filter with "where rn>=2"
Member 10183768 19-Feb-19 2:18am    
Sorry I am not aware of this function, can you help me correct this.

SELECT a.*, ROw_number()over(partition by a.[Customer_LastName],
a.[Customer_FirstName],
a.[Customer_AddressLine1],
a.[Customer_InternetEmail],
a.[Customer_HomePhone] ORDER BY ISNULL(CreatedOn,ModifiedOn)) AS rn
FROM MRSVoid.dbo.Customer_Dataset$ a
CROSS JOIN
(SELECT
[Customer_LastName]
,[Customer_FirstName]
,[Customer_AddressLine1]

,[Customer_HomePhone]
,[Customer_InternetEmail]
FROM MRSVoid.dbo.Customer_Dataset$
GROUP BY [Customer_LastName],
[Customer_FirstName],
[Customer_AddressLine1],
[Customer_InternetEmail],
[Customer_HomePhone]
HAVING count(*) > 1) b
where ((a.Customer_LastName = b.Customer_LastName) OR (a.Customer_LastName is NULL AND b.Customer_LastName is NULL))
AND ((a.Customer_FirstName = b.Customer_FirstName) OR (a.Customer_FirstName is NULL AND b.Customer_FirstName is NULL))
AND ((a.Customer_AddressLine1 = b.Customer_AddressLine1) OR (a.Customer_AddressLine1 is NULL AND b.Customer_AddressLine1 is NULL))
AND ((a.Customer_InternetEmail = b.Customer_InternetEmail) OR (a.Customer_InternetEmail is NULL AND b.Customer_InternetEmail is NULL))
AND ((a.Customer_HomePhone = b.Customer_HomePhone) OR (a.Customer_HomePhone is NULL AND b.Customer_HomePhone is NULL))
Santosh kumar Pithani 19-Feb-19 3:18am    
--Im not Adding filter condition just sample query no need to group by
;WITH CTE AS(
SELECT a.*, Count(1)over(partition by a.[Customer_LastName],
a.[Customer_FirstName],
a.[Customer_AddressLine1],
a.[Customer_InternetEmail],
a.[Customer_HomePhone] ORDER BY ISNULL(CreatedOn,ModifiedOn)) AS rn
FROM MRSVoid.dbo.Customer_Dataset$ a )
select Distinct * from CTE WHERE rn>=2
Richard Deeming 22-Feb-19 11:41am    
REPOST
You have already posted this:
https://www.codeproject.com/Questions/1277670/Help-me-de-duplicate-a-provided-dataset[^]

If you want to update your question to add more information, go to the original question and click the green "Improve question" link. DO NOT post your update as a new question.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900