Unwanted characters in text data can be a bit of a pain, but there’s an easy way to fix them!
Sometimes, we’ll find extra characters inside our string data because our SQL queries didn’t work as expected. Moreover, these extra characters may sometimes be invisible, which really complicates things. In this article, we’ll examine some string-related SQL functions that can handle unwanted characters—visible or not!
The Importance of Standardizing Strings in Your Database
When we work with data populated or loaded as free text, the data quality is often quite poor. Unwanted or extra characters can seriously hurt the quality of your data and make it more difficult to analyze the information you’ve collected. Thus, it’s important to understand how you can use SQL string functions to fix these common problems so you can clean up your database
Let’s start by exploring the SQL trim and length functions.
Problem-Solving with SQL String Functions
To explain how to solve problems with unwanted characters, we’ll work with a simple health care database. Imagine we have a doctor who stores all his patient data in a single table. The table contains the patient’s full name, the date of the visit, the doctor’s diagnosis, the suggested treatment, and any drugs that were prescribed.
Every time a patient visits his office, the doctor creates a new record. He manually types his notes into the database, so the data quality is occasionally poor.
|John Smith||’03/04/2016′||Flu||rest||“aspirin, paracetamol”|
|Mary Deep||’10/11/2016′||Food Poisoning||Rest & diet||“penicillin,paracetamol”|
|Agnes Jason||’03/12/2016′||flu||rest||“aspirin, paracetamol”|
|Johnny SMITH||’15/03/2017′||Food Poisoned||Rest & diet||“penicillin,paracetamol”|
|AGNES Jason||’19/072017′||angina||Rest and don’t speak||“amoxicillin”|
SQL Functions for Removing Unwanted And Invisible Characters
In some cases, a text string can have undesirable characters, such as blank spaces, quotes, commas, or even “|” separators. These can be on either or both sides of the string. We can remove those unwanted characters by using the SQL
LTRIM, and SQL
RTRIM functions. They are very similar and are explained in the following table:
|LTRIM||A string and a specific character||Removes the specified character from the left side only|
|RTRIM||A string and a specific character||Removes the specified character from the right side only|
|TRIM||A string and a specific character||Removes the specified character from both sides|
Let’s try these SQL functions, starting with
If you examine the original table, you’ll notice that some entries under the diagnostic column have unnecessary spaces at the beginning. But there are also some hidden spaces after each entry. To fix this, we’ll start by counting the number of characters in the diagnostic strings using the
LENGTH function. As blank spaces are not visible characters, we use angle brackets to show us where the extra spaces (if any) are.
Let’s see how the SQL
LENGTH function works in this query:
SELECT diagnostic as real_diagnostic, length(diagnostic) as field_length, '<' || diagnostic || '>' as delimited_diagnostic FROM patient_data
|Food Poisoning||14||<Food Poisoning>|
|Food Poisoned||15||< Food Poisoned>|
We can see that several records have unnecessary spaces. For “ flu”, the length is 4 instead of 3, and the delimited field shows the blank at the beginning. A proper query will fix this issue and remove the unnecessary space. Let’s go ahead and do that now!
SELECT diagnostic as real_diagnostic, trim(diagnostic) as trimmed_diagnostic FROM patient_data
|Food Poisoning||Food Poisoning|
|Food Poisoned||Food Poisoned|
Power Strings: Nested SQL String Functions
Let’s suppose our doctor wants to know how many patients were diagnosed with each of the illnesses in the
diagnostic column. We’d use the following query to get this information:
SELECT diagnostic, count(*) FROM patient_data GROUP BY diagnostic
As we can see, the result is not what we expected. The same illness is showing up several times because the doctor was not consistent with his typing. A diagnosis of flu shows up as “Flu”, “flu”, and “ flu”. We know they are the same, but the database engine sees them as three different things.
We have the skills to fix this query and get the result we want. We can use the same nested expression to get rid of the extra spaces and eliminate the capitalization mistakes. Here’s how it looks:
SELECT lower(trim(diagnostic)), count(*) FROM patient_data GROUP BY lower(trim(diagnostic))
In programming, it is really common to nest functions, or call a function from inside another function for use as a parameter. This is what we did in the previous example.
Learn More About SQL String Functions
In this article, we covered the important SQL string functions
LENGTH. If you want to detect hidden or unnecessary characters as part of an initial diagnosis, use
LENGTH. Then, use
TRIM to get rid of unwanted characters. We also discussed how you can nest SQL functions, which is a powerful technique in programming.
If you’re eager to learn more about SQL string functions, check out our Standard SQL Functions course. Try it for free today!