In this puzzle, we’re going to learn how to join dirty data using SQL Server. Many times, you’re forced to use someone else’s data. Since it is their system and data, not yours, you can’t always fix the inconsistently formatted data before it is in the database.
So how can you use it? Do this puzzle and learn how!
Solving puzzles is a great way to learn SQL. Nothing beats practicing what you’ve learned. Once you have figured out the puzzle, post your answer in the comments so we can all learn from one another.
How Do You JOIN Dirty Data in SQL Server?
In business, the data you get to work with isn’t always clean. For instance, I know of a major automotive manufacturer, whose part number prefix and suffix are separated with dashes in their engineering system, and spaces in their EDI system (e.g. ‘1234-4567
’ versus ‘1234 4567
’).
This makes it difficult to compare and join data from the two systems, but in today’s puzzle, we’re going to explore this issue and learn how to do the join
s.
For our puzzle, we’ll use these tables:
DECLARE @PartDescription TABLE
(
PartNumber Varchar(20),
Name Varchar(20)
)
DECLARE @PartSales TABLE
(
PartNumber Varchar(20),
Year int,
Sales float
)
Your goal is to output PartNumber
, Name
, Year
, and Sales
. Sort your results by Name
and Year
.
Also, the PartNumber
you display should have dashes separating the prefix (first four characters) from the suffix (last four characters).
Keep in mind the part number data is inconsistently formatted. You can find the sample data here.
JOIN Dirty Data – No Results!
Let’s see what happens when we try to join the data without first cleaning it. In a perfect world, we’d expect the correct results however since the part number is inconsistently formatted, the join
condition fails and the query returns no results.
SELECT P.PartNumber,
P.NAme,
S.Year,
S.Sales
FROM @PartDescription P
INNER JOIN @PartSales S
ON P.PartNumber = S.PartNumber
ORDER BY P.Name, S.Year
Obviously, our data needs a good cleaning.
Getting Results with Dirty Data using REPLACE
One way to solve the puzzle is to replace the space and asterisk with a dash. Here is a query to do that:
SELECT REPLACE(REPLACE(P.PartNumber,'*','-'), ' ','-') as PartNumber,
P.NAme,
S.Year,
S.Sales
FROM @PartDescription P
INNER JOIN @PartSales S
ON REPLACE(REPLACE(P.PartNumber,'*','-'), ' ','-') =
REPLACE(REPLACE(S.PartNumber,'*','-'), ' ','-')
ORDER BY P.Name, S.Year
In this example, we used the REPLACE
function to search for incorrect characters, such as the asterisk and space, and substituted them with a dash.
You’ll notice that the REPLACE
functions are nested. This allows us to get around a limitation with REPLACE
.
Here is the result of the join
once the data is cleansed.
A flaw of this method is that you do need to know which characters you wish to cleanse. In our case, we know that spaces and asterisks are unwanted, but what if we encounter an equals sign?
The way our solution is written, these would not be found and replaced with a dash. The join
condition would fail.
Also, as you add more characters to search, you need to nest more REPLACE
functions. After a while, this becomes cumbersome to read and error prone to write.
In a future version of SQL, we’ll be able to use TRANSLATE. Then, I’m hoping we can replace all special characters as so:
SELECT REPLACE(REPLACE(P.PartNumber,'*','-'), ' ','-') as PartNumber,
P.NAme,
S.Year,
S.Sales
FROM @PartDescription P
INNER JOIN @PartSales S
ON TRANSLATE(P.PartNumber,'!@#$%^&*()_+=','-------------') =
TRANSLATE(S.PartNumber, '!@#$%^&*()_+=','-------------')
ORDER BY P.Name, S.Year
JOIN Dirty Data by Parsing the Part Number
Another way to solve is this problem is by parsing. That is, we can separate the part number into the prefix and suffix. Once done, we join on these to match rows from the two tables.
Here is a diagram of the part number structure. In our example, the prefix is the first four characters, and the suffix the last four.
The prefix is calculated using the LEFT
function. LEFT(P.PartNumber,4)
returns the first four characters. Likewise, RIGHT(P.PartNumber,4)
returns the last four characters.
Now all we need to do is write the join
condition to compare the prefix and suffix from each table. Check out this query to see what I mean:
SELECT REPLACE(REPLACE(P.PartNumber,'*','-'), ' ','-') as PartNumber,
P.NAme,
S.Year,
S.Sales
FROM @PartDescription P
INNER JOIN @PartSales S
ON LEFT(P.PartNumber,4) = LEFT(S.PartNumber,4) AND
RIGHT(P.PartNumber,4) = RIGHT(S.PartNumber,4)
ORDER BY P.Name, S.Year
Conclusion
There are certainly other ways to solve this puzzle. I hope by seeing the two solutions here, you have seen there are two drastically different approaches to the answer.
If you have a good answer, let me know. I would love to see it!
Hello my name is Kris. I’m here because I am passionate about helping non-techie people to overcome their fear of learning SQL.
I know what it is like to not know where to start or whether the time spent learning is worth the effort. That is why I am here to help you to:
- Get started in an easy to follow step-by-step manner.
- Use your time wisely so you focus on what is important to learn to get the most value from your time.
- Answer your questions. Really! Just post a comment and I’ll respond. I’m here to help.
It wasn’t long ago that I was helping a colleague with some reporting. She didn’t know where to start and soon got overwhelmed and lost as she didn’t know SQL.
I felt really bad, as she was under pressure to get some summary information to her boss, the built-in reports were falling short, and to make them better would require her to know SQL. At that time that seemed impossible! It in dawned on me, it doesn’t have to be that way.
Then I discovered a way for anyone with the desire to easily learn SQL. I worked with my co-worker, started to teach her what I learned and soon she was able to write reports and answer her boss’ questions without getting stressed or ploughing hours into manipulating data in Excel.
It hasn’t always been easy. Sometimes the information seems abstract or too conceptual. In this case I’ve found out that a visual explanation is best. I really like to use diagrams or videos to explain hard-to-grasp ideas.
Having video, pictures, and text really help to reinforce the point and enable learning.
And now I want to help you get the same results.
The first step is simple, click here http://www.essentialsql.com/get-started-with-sql-server/