When it comes to information management, duplicates present one of the most common challenges to data quality. In this article, I’ll explain how it is possible to find and distinguish duplicate names with the help of the SQL data programming language.
I really like my maiden name. The reason I like it so much is because it’s rare. My maiden name (first with last) provided a unique identifier on platforms such as LinkedIn, Facebook, Twitter and similar. But then I got married. In addition to all the happy life changes, my marriage brought me a new and much more common last name. When I typed my married name into the LinkedIn search engine, 138 profile accounts were returned! I began to ask myself: “How will recruiters relate my CV to my LinkedIn profile data when my name is not unique? Will they recognize and distinguish duplicate names in the data? And if yes, then how?” It happens that SQL is an excellent tool for detecting duplicates, so let’s take a look at the SQL syntax that can be used to solve just such a problem as this.
You might be wondering: ”Why SQL? Why not just search the web manually via Google search or a similar engine?” Primarily because SQL is much faster and more efficient. It’s a standard computer language designed specifically for storing, manipulating and retrieving data from databases. And you’ll find that SQL’s significant performance advantage is matched by the ease of learning the language, thanks to its simple & understandable syntax.
SQL for finding duplicate value names
To set up our examples, let’s assume that the LinkedIn table,
linkedin_profile_accounts, corresponding to candidate profile accounts from a recruiter’s database looks like this:
|Marija||Ilic||Attorney at Law at Kolsek, Markovic & Ilic Law Office||23||168||10.01.2017.||kolsek_law@KMI.be|
|Marija||Ilic||Senior Finance Manager||45||452||03.07.2017.||ilicsenior
|Tim||Leep||Digital Marketing Specialistemail@example.com|
|Tim||Leep||Head of Sales at Berg firstname.lastname@example.org|
We can see that two names in the table are duplicated: Marija Ilic (three occurrences) and Tim Leep (two occurrences).
The following SQL query will identify those values:
SELECT name,lastname,COUNT(*) AS duplicate_count FROM linkedin_profile_accounts GROUP BY name,lastname HAVING COUNT(*)>1;
Syntax is quite simple. Here is the explanation:
SELECT … FROMstatement – we are retrieving data from a specific database table. Table name is
lastnameare chosen variables that will be used and viewed on the output/result set.
COUNT(*)indicates that we will count occurrences. Here it is used in combination with a GROUP BY statement. We obtain the number of occurrences per specific group. In this case we count occurrences for each
ASdefines the column name for the output result set. Three columns will be generated, and the third (which represents count per group) will be labeled
GROUP BYstatement is used in combination with COUNT function. It groups the result-set by two columns –
lastname. This means we are counting the number of occurrences per each
lastnamecombination in the table
HAVING COUNT(*)>1– only duplicated records (names which are not unique) will appear in the output/result set.
The result set identifies duplicate names Marija Ilic and Tim Leep from the database table.
Be aware of the fact that
HAVING COUNT(*)>1 statement excludes unique names from a result data set. If you remove
HAVING COUNT(*)>1 statement then all the names will be included.
SELECT name,lastname,COUNT(*) as duplicate_count FROM linkedin_profile_accounts GROUP BY name,lastname;
Output will look like this:
All names are printed with their corresponding counts. Two names from the table are unique, and two have duplicate values. As we can see now, a recruiter will need additional information in order to uniquely identify a candidate and select the appropriate LinkedIn profile. Fortunately, the email address is also present in the profile account table, so perhaps this additional variable can be used to achieve a unique identifier for each name in the table.
Will the email address give us uniqueness? The following SQL will answer that question:
SELECT name,lastname,e_mail, COUNT(*) AS duplicate_count FROM linkedin_profile_accounts GROUP BY name,lastname,e_mail HAVING COUNT (*)>1;
Output now will look like this:
0 rows selected
No names are presented in the result set, because the
e_mail variables, together in combination, provide a unique identifier for every person in the
linkedin_profile_account table (all return count=1).
If we remove
HAVING COUNT(*)>1 all observations will be presented in the result set, because each row/person in the
linkedin_profile_accounts table has a different email account.
The code and output in that case will look like this:
SELECT name,lastname,e_mail, COUNT(*) AS duplicate_count FROM linkedin_profile_accounts GROUP BY NAME,lastname,e_mail;
It is clear, then, that a recruiter must use three attributes:
e_mail, to pull data from the LinkedIn database table in order to update his/her knowledge about potential candidates. With this combination he/she will get a better profile picture, so that smarter decisions can be made.
Complicated things can be solved by simple SQL queries
Despite the ease and syntactical simplicity of detecting duplicate data through the use of SQL, I have found duplicate data to be at the root of most of the data quality issues I have encountered. Simple SQL queries like the ones presented here often proved to be the answer. In most cases, the simplest code proves to be the most valuable. That is the beauty of SQL: it solves complicated problems with simple queries.
Many excellent interactive SQL courses are available on Vertabelo Academy for people with no IT background. If you would like to learn and benefit from the advantages of SQL, let me encourage you to go ahead and give it a try. Leave your fears behind, turn a new page in your life, and begin your SQL adventure. It’s an exciting and valuable journey that you just don’t want to miss!