Introduction
In this tutorial I will show how to use Fuzzy Lookup for cleaning dirty data.
Background
Cleansing data before it is stored in a reporting database is necessary to provide value to consumers of business intelligence applications.
Fuzzy logic is an approach to computing based on "degrees of truth" rather than the usual "true or false" (1 or 0) Boolean logic on which the modern computer
is based. The idea of fuzzy logic was first advanced by Dr. Lotfi Zadeh of the University of California at Berkeley in the 1960s.
What you need
This script was tested in SQL Server 2008.
Create project
First create the database and table from this script.
create database Test
go
use Test
go
create table fuzzyLookupSource
(
firstName varchar(10),
LastName varchar(10),
BirthDate datetime
)
insert into fuzzyLookupSource
select 'Masud','Pervez','02/07/1980' union all
select 'Tamirul','Islam','03/31/1983' union all
select 'Animesh','Chandra','04/09/1980' union all
select 'Shahriar','Bin Elahi','05/05/1980' union all
select 'Masud','Rana','04/15/1980'
GO
create table fuzzyLookupReference
(
firstName varchar(10),
LastName varchar(10),
BirthDate datetime
)
insert into fuzzyLookupReference
select 'Masud','Parvez','02/07/1980' union all
select 'Tamirul','Islam','03/31/1983' union all
select 'Animesh','Chandra De','06/05/1980' union all
select 'Shahriar','Elahi','04/09/1980'
GO
Then open SQL Server Business Intelligence Development Studio.
![Image 1](/KB/database/528243/ssis_fuzzy_lookup1.png)
Then go to File->New->Project and select Integration Service Project.
![Image 2](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Select "Data Flow Task" from "Control Menu" and Drag it on "Control Flow" tab. Then double click it.
![Image 3](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Click Connection for new connection or select from existing connection.
![Image 4](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Click New button to create new Data Connection or select from left tab.
![Image 5](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Select "Server Name","Authentication" and "Database" which will be "Test" for this example. Click Test Connection for checking then click OK.
![Image 6](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Select "Table", then click OK.
![Image 7](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Select "Fuzzy Lookup" from "Data Flow Transformation" and Drag it on "Data Flow" tab. And connect extended green arrow from “OLE DB Source” to your fuzzy lookup.
Double click on “Fuzzy Lookup” task to configure it.
![Image 8](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Select "OLE DB Connection" and "Reference Table name" in "Reference Table" tab.
![Image 9](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Map Lookup column and Output Column in "Columns tab. Add prefix "Ref_" in output column filed.
![Image 10](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Let all value as it is in "Advanced" tab.
![Image 11](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Select "Conditional Split" from "Data Flow Transformation" and Drag it on "Data Flow" tab.
and connect extended green arrow from “Fuzzy Lookup” to your "Conditional Split". Double click on “Conditional Split” task to configure it.
![Image 12](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Create two output. One is "Solid Matched" which Condition is "_Similarity > 0.85 && _Confidence > 0.8"
and another is "Likely Matched" which condition is "_Similarity > .65 && _Confidence > 0.75". Click OK.
![Image 13](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Select "Derived Column" from "Data Flow Transformation" and Drag it on "Data Flow" tab.
and connect extended green arrow from “Conditional Split” to your "Derived Column".
![Image 14](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Select Output as "Solid Matched" and click OK.
![Image 15](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Double click on “Derived Column” task to configure it.
![Image 16](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Select another "Derived Column" from "Data Flow Transformation" and Drag it on "Data Flow" tab.
and connect extended green arrow from “Conditional Split” to your "Derived Column 1".
![Image 17](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Select Output as "Likely Matched" and click OK.
![Image 18](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Double click on “Derived Column 1” task to configure it.
![Image 19](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Select another "Derived Column" from "Data Flow Transformation" and Drag it on "Data Flow" tab. And connect extended green arrow from “Conditional Split” to your "Derived Column 2".
![Image 20](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Double click on “Derived Column 2” task to configure it.
![Image 21](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Select another "Union All" from "Data Flow Transformation" and Drag it on "Data Flow" tab.
and connect extended green arrow from “Derived Column” to your "Union All" and “Derived Column 1” to your "Union All" and “Derived Column 2” to your "Union All".
![Image 22](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Double click on “Union All” task to configure it.
![Image 23](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Select "SQL Server Destination" from "Data Flow Destination" and Drag it on "Data Flow" tab.
and connect extended green arrow from “Union All” to your "SQL Server Destination".
![Image 24](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Double click on “SQL Server Destination” task to configure it. Click New for create a New Table or Select from List.
![Image 25](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Click OK.
![Image 26](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
![Image 27](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
If you execute the package with debugging (press F5), the package should succeed and appear as shown here:
![Image 28](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
To check what is happining here:
SELECT [firstName]
,[LastName]
,[Ref_firstName]
,[Ref_LastName]
,[_Similarity]
,[_Confidence]
,[_Similarity_firstName]
,[_Similarity_LastName]
,[_Match]
FROM [Test].[dbo].[SQL Server Destination]
![Image 29](data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==)
Conclusion
Fuzzy logic seems closer to the way our brains work. We aggregate data and form a number of partial truths which we aggregate further into higher truths.
I hope this might be helpful to you!
References
History
None so far.
More than 8 years experience on Programming and Project implementation, I was primarily involved with projects for private organization,Govt.(Bangladesh Army,DG Health,RJSC), NGO (SEDF,WFP). Presently I am working at ICDDR,B and enhancing Hospital Management System developed by Microsoft Dynamic NAV and Windows Mobile Application 5.0
An active supporter of Open Source technology, my interested areas are ERP, IT Audit, Data warehouse, BI etc.
Playing Guitar for 15 years, my interested music style is Blues Rock,Neo Classical.
Certification
70-540:Microsoft® Windows Mobile® 5.0 - Application Development
MB7-514:Microsoft Dynamics™ NAV 5.0 C/SIDE Introduction
MB7-516:Microsoft Dynamics™ NAV 5.0 Solution Development
MB7-517:Microsoft Dynamics™ NAV 5.0 Installation and Configuration
MB7-515:Microsoft Dynamics™ NAV 5.0 Financials
70-432:Microsoft SQL Server 2008 - Implementation and Maintenance
70-450:PRO: Designing, Optimizing and Maintaining a Database Administrative Solution Using Microsoft SQL Server 2008
70-448:Microsoft SQL Server 2008, Business Intelligence Development and Maintenance
312-50:Certified Ethical Hacker
Web :http://masudparvezshabuz.appspot.com
Blog :http://masudparvezshabuz.wordpress.com
linkedin :http://www.linkedin.com/in/masudparvez