Click here to Skip to main content
15,891,828 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
I have a database with a "full name" field that can have the following content:
SMITH, J
SMITH, JOHN
SMITH, JOHN M
SMITH, JOHN MATHEW
SMITH, JOHN MATHEW MAJ
SMITH, J M MAJ
SMITH,J
SMITH,JOHN
JOHN M SMITH
JOHN MATHEW SMITH
JOHN MATHEW SMITH MAJ
SMITH J

In another database, there is a full name field that will contain one of the two following formats (but only one of them)
SQL
SMITH,JOHN
SMITH,JOHN M
SMITH,JOHN MATHEW

I've managed to "normalize" the various formats in the first database so that at least every thing looks like this:
SQL
SMITH,J
SMITH,JOHN
SMITH,JOHN M
SMITH,J M

and everything in the 2nd database to this (again, it will be one or the other depending on if the person has a middle name specified in that table):
SQL
SMITH,JOHN M
SMITH,JOHN

When I join the tables on full name fields, I get a high degree of matches, but at the same time, there are a moderately significant number of non-matches (as you might guess).

My new problem is this - what is the best way to try and automagically determine how to massage all (or as many as possible) of the names in the first table to the version being used in the 2nd table.

The way I have it figured is that the only names that could possibly be massaged are the ones that are closest to the name in the 2nd table, but how would I do that in SQL?

This matching stuff would be a whole lot easier if they (the source of the imported data) included an available numeric personID field (that exists in the 2nd table) in the data, but they aren't, and we haven't yet found out if they can. If they can and do, all of this work will have been for naught.

It required a SIGNIFICANT amount of SQL to normalize the names as far as I've gotten to this point, which involved numerous uses of REPLACE, TRIM, LEFT, RIGHT, and REVERSE, with a moderate sprinkling of CASE WHEN for good measure.

What I have tried:

I haven't tried anything yet because I don't even know if it's possible to do what I need to do, because it sounds more like a fuzzy logic problem than anything else. I abhor the thought of guessing based on some arbitrary ruleset that could be rendered useless with every new data import.
Posted
Updated 19-Apr-18 6:08am
v3
Comments
RedDk 19-Apr-18 14:20pm    
Thinking outside of the box, I'd assume there's some bigger reason to fashion an input method which requires a validation stage such as the use of a form whereby this "user" would be coralled into making better assertions as to what his identity might be.

But since the "box" is the thing use the most logical method and forget the fuzziness you mention.
#realJSOP 19-Apr-18 14:35pm    
The data entry side actually has a drop-down from which they can pick the name, but they also have the option of fat-fingering it, which is the option most chosen, apparently. This aspect of the data entry side "will not change" according to the people that developed the app.

I'm in hell...

1 solution

Hi John,
Quote:
because it sounds more like a fuzzy logic problem than anything else.

I agree with you, there is a degree of guessing, 100% match is impossible if not on a short list where you know exact real names without duplicates or almost duplicates.
Problems:
- 2 individuals can have the exact same name.
- 2 individuals with different names can be written the same. SMITH, JOHN MATHEW and SMITH, JOHN MAX can lead to SMITH, JOHN M and be undistinguishable.

I think there is no 100% accurate solution for your problem by using the name only.
 
Share this answer
 
Comments
#realJSOP 19-Apr-18 14:32pm    
Yeah, I know. From the boss - "I know the data is crap, so we match what we can and present the info to the idiots that entered it."

:)

I don't know who 1-voted you, but I 5'd your answer to offset it.
Patrice T 19-Apr-18 14:41pm    
The quote is realistic. :)
Thank you

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900