Excel is a powerful beast that lets you analyze complex data. Yet, operating on big chunks of data can sometimes be a daunting task. Let’s take a look at how SQL can help.
Today, we’ll tackle a common problem with importing data to an SQL database, using a real-life example. Suppose your company conducted a survey on the most popular programming trends and preferences, striving to meet the expectations of its users. Your user base was overwhelmingly responsive, as the questionnaire was completed by more than 15,000 people! Unfortunately, someone has to analyze those results—someone like you. So, let’s get right to the task!
Among many others, users had to answer the following questions, with possible answers listed in parentheses:
- How old are you? (17, 19, 25, 42, 33)
- What is your gender? (M, F, N–prefer not to answer, O–other)
- What is your current employment status? (E–employed; U–unemployed)
- What other topics would you like to learn? (Data science, Statistics, Data visualization, etc.)
- Do you have any experience in programming? (Yes, No)
- What is your preferred method of learning? (Course, Video, Articles, Webinar, Mix)
We’ve stored our data in an Excel file called survey.xlsx. Here’s what our table looks like:
Of course, this is only a small sample of data from a file that has around 15,000 rows containing the results of our survey.
If you already have some experience with SQL, you may know that you can insert existing data into a table with the
INSERT INTO command:
INSERT INTO survey VALUES('id', 'age', 'gender' …);
If you have no idea how to insert data into a table using SQL, I recommend you take our free Operating on data in SQL course before reading any further. The
INSERT INTO command can come in handy if you need to manually create a few rows of data. However, it would be tedious to fill in a table with 100,000 rows of data this way unless you were using a generated script. Fortunately, many database systems already provide a means of importing data from a spreadsheet into a database.
Save XLSX file as a CSV
To make things work, we need to first prepare an .xlsx file and then change its file extension from .xlsx (or any other common spreadsheet extension) to .csv, which is short for comma-separated values (CSV). This file format is used to store tabular data in plaintext. In most spreadsheets, all you have to do is click Save As and choose the .csv extension. You may be asked to specify a field delimiter value when you do this—you can use any character you like, but it should strictly be a character that is not used in the spreadsheet itself.
Create a table
Now that our file is ready, we can begin preparing the table that will store all our survey data. We’ll create the table with the same column names as in our CSV file to avoid confusion.
CREATE TABLE survey ( id int, age int, gender char(1), emp_status char(1), preferred_language varchar(16), other_topic varchar(32), experience boolean, pref_learn_way varchar(16) );
Regardless of their names, the columns must contain the correct data types and be defined in the order in which they appear in the CSV file. As you’ve already seen, each column of our CSV file stores numbers, single characters, or strings. In our case, we’ve opted to use the following data types:
- int for numbers
- char(n) for single-character responses, where ‘n’ signifies the exact number of characters that are to be stored in a given column’s entry
- varchar(n) for strings, where the ‘n’ signifies the maximum number of characters that may be stored in a given column’s entry.
- boolean for yes/no responses.
Remember, the order of the columns in our table must match the order of the columns in the CSV file.
Fill the table with data
Finally, we can import data from our CSV into our database. We’ll use the COPY command to do that, like so:
COPY survey FROM '/home/user/pathToFile/survey.csv' CSV HEADER DELIMITER ',';
What exactly does this query do? Basically, we tell our database to copy data from the given path (note the quotation marks) to the survey table. We then specified the following parameters:
CSV: informs the database that the file we would like to copy from is a csv.
HEADER: we indicated that the the first row of our file is a header and should be omitted.
DELIMITER‘,’: informs our database that the delimiter should be set to a comma.
However, keep in mind that the order in which we list these parameters after the path does not matter. We could’ve just as well written
DELIMITER ',' HEADER CSV, and everything would have worked fine. You also have to remember that the
COPY command appends data to the table. Every time you use it, the database adds the data to anything that already exists in a given table and does not replace anything.
The null parameter
Sometimes, the dataset we’d like to import is incomplete, meaning responses are missing from some of the cells. By default, the COPY command assumes that nulls are represented as empty, unquoted strings in a CSV file—in other words, as just an empty field. Obviously, this is not always the case. Someone could leave the field empty, but another person may write “NO DATA” instead to indicate that a response was not provided. We can handle this by using NULL AS ‘something’ to inform our database what should be interpreted as a null value:
COPY survey FROM '/home/user/pathToFile/employees.csv' CSV HEADER DELIMITER ',' NULL AS 'NO DATA';
In the above example, all CSV fields that have ‘NO DATA’ written inside them will be converted to nulls.
Analyze your data
Awesome! Our survey responses are now in the database! The only thing that remains is analyzing the information we’ve gathered. We can query almost anything we’d like. It all just depends what you’d like to know. The following example is a simple demonstration of some SQL queries you could execute:
WITH no_experience_python AS ( SELECT count(*) AS no_experience_python_users FROM survey WHERE experience IS FALSE AND preferred_language = 'Python' AND other_topic = 'Data Science' ), experienced_python AS ( SELECT count(*) AS experienced_python_users FROM survey WHERE experience IS TRUE AND preferred_language = 'Python' AND other_topic = 'Data Science' ) SELECT * FROM no_experience_python, experienced_python
In the above query, we compared the number of users with no prior experience who want to learn Python and are interested in Data Science to users who have no prior experience, want to learn Python, and are also interested in Data Science. It’s easy, readable, and simple!
SQL’s quite a powerful and useful language to learn, especially if you plan to analyze data in your line of work. If you’d like to learn about data types, creating tables, and much more, check out our Creating Tables in SQL course. Or, if you’re just getting started with SQL, you can learn the core principles of the language by signing up for our SQL Basics course.