I have a database with a "full name" field that can have the following content:
SMITH, J
SMITH, JOHN
SMITH, JOHN M
SMITH, JOHN MATHEW
SMITH, JOHN MATHEW MAJ
SMITH, J M MAJ
SMITH,J
SMITH,JOHN
JOHN M SMITH
JOHN MATHEW SMITH
JOHN MATHEW SMITH MAJ
SMITH J
In another database, there is a full name field that will contain one of the two following formats (but only one of them)
SMITH,JOHN
SMITH,JOHN M
SMITH,JOHN MATHEW
I've managed to "normalize" the various formats in the first database so that at least every thing looks like this:
SMITH,J
SMITH,JOHN
SMITH,JOHN M
SMITH,J M
and everything in the 2nd database to this (again, it will be one or the other depending on if the person has a middle name specified in that table):
SMITH,JOHN M
SMITH,JOHN
When I join the tables on full name fields, I get a high degree of matches, but at the same time, there are a moderately significant number of non-matches (as you might guess).
My new problem is this - what is the best way to try and automagically determine how to massage all (or as many as possible) of the names in the first table to the version being used in the 2nd table.
The way I have it figured is that the only names that could possibly be massaged are the ones that are closest to the name in the 2nd table, but how would I do that in SQL?
This matching stuff would be a whole lot easier if they (the source of the imported data) included an available numeric personID field (that exists in the 2nd table) in the data, but they aren't, and we haven't yet found out if they can. If they can and do, all of this work will have been for naught.
It required a SIGNIFICANT amount of SQL to normalize the names as far as I've gotten to this point, which involved numerous uses of
REPLACE
,
TRIM
,
LEFT
,
RIGHT
, and
REVERSE
, with a moderate sprinkling of
CASE WHEN
for good measure.
What I have tried:
I haven't tried anything yet because I don't even know if it's possible to do what I need to do, because it sounds more like a fuzzy logic problem than anything else. I abhor the thought of guessing based on some arbitrary ruleset that could be rendered useless with every new data import.