Whatever you’re hoping to do with data, having SQL skills is likely to be key.
That’s because despite the fact that SQL is quite old, some version of it is used in the databases of just about every company on earth. Whether you’re interested in doing data science or simply accessing and analyzing a little bit of your company’s data, chances are SQL fundamentals are going to be required.
What is SQL?
We’ll start this SQL fundamentals tutorial with a very fundamental question: what is SQL, anyway? SQL (often pronounced “sequel”) is an abbreviation for Structured Query Language. It’s a kind of programming language that’s designed specifically for use querying and updating databases.
This is useful because it’s not always practical or advisable to store data in other formats like spreadsheets or CSVs. Imagine, for example, a company with terabytes of data that’s being updated regularly by people in different departments across the company. Trying to manage this with spreadsheets would be a logistical and security nightmare! Databases allow for the secure storage of huge amounts of data, and SQL is the tool we use to access and make changes to that data without needing to store it all in a file on our local machine.
In SQL, we express our requests to the database as “queries”. For example, we might send a query — an instruction — to a database to return a specific subset of data, like a particular table, or to update a particular value in the database.
Let’s imagine we want to get some data on salaries from our company’s database. We certainly don’t want to download all of the company’s data, so instead we’ll write a SQL query that gives us only the specific data we need. If we assume there’s a table in the company database called, salaries
, that query would probably look something like this:
SELECT * FROM salaries
The *
character here means “everything” — select every column from the salaries
table. With this single short query, we’ll be able to get just the table we need from our company database.
But SQL can do a lot more than query a full table. It can help us efficiently merge, filter, update, and analyze data to answer all sorts of questions. So let’s dive in and start learning how to write SQL queries!
Note: for many students, it may be easier learn SQL interactively and write queries in your browser rather than following along with a tutorial. You can do that for free in our interactive SQL Fundamentals course:
SQL Fundamentals Tutorial: Introduction
In this tutorial, we’ll explore data from the American Community Survey on job outcome statistics based on college majors. While the original CSV version can be found on FiveThirtyEight’s Github, we’ll be using a slightly modified version of the data that’s stored as a database. (Remember, SQL is used for querying data stored in databases).
Specifically, we’ll be working with a of the data that contains the 2010-2012 data for recent college grads only. This data is stored as a SQLite database (SQLite is one of the most popular SQL-based database management systems).
If you’re familiar with how to query SQLite databases, you can download the database file here. If you’re not comfortable with getting a SQLite database set up and querying it on your local machine, it may be easier to work through our free SQL fundamentals course, which is interactive and allows you to write real SQL queries of this database using only your browser.
Previewing A Table Using SELECT
Whenever we’re working with a new data set, it’s helpful to start by previewing what we’re actually working with. For this tutorial, we’ll be working with the database file jobs.db
, which contains a single table named recent_grads
:
Here’s a visual representation of that structure:
Let’s start by displaying only the first five rows of our table. This will give us a good idea of what we’re working with, and limiting the query to five rows will ensure it’s quick and we don’t get overwhelmed with a huge table.
Like other programming languages, code in SQL has to adhere to a defined structure and vocabulary. To specify that we want to return the first 5 rows from recent_grads
, we need to run the following SQL query:
SELECT * FROM recent_grads LIMIT 5
In this query, we specify:
- the columns we wanted using
SELECT *
- the table we wanted to query using
FROM recent_grads
- the number of rows we wanted using
LIMIT 5
Here’s a visual breakdown of the different components of the query:
Now, let’s see what this query returns! Here’s what we’ll get when this query is run:
As we can see, our query returned exactly what we wanted: every column of our selected table, but only the first five rows. If we changed the end of the query to read LIMIT 10
, it would return the first ten rows instead. If we remove the LIMIT
command altogether, it would return every row in the table.
A Quick Note About Capitalization
Before we move on, you may have noticed that we’re typing some things, like SELECT
, in all capital letters, while other things like the table name are written in lower case.
Depending on the specific database you’re querying and what sort of database management system it uses, capitalization of table names may or may not matter. However, to ensure that your queries will work and be easy to read with any system, it’s best to stick to these conventions:
- Statements or commands (like
SELECT
,LIMIT
, etc.) should be written in all-caps. - Table and column names should be written however they are stored in the database you’re querying (so if the database stores them in all lower case, they should be all lower case in your queries, too).
Now, on to our next statement: WHERE
!
Filtering Rows Using WHERE
If we take a look at the details of this data set, we can get a better idea of what data is here and what types of questions we might be able to answer. Spend some time getting familiar with what each column represents.
Based on an understanding of what each column represents, here are some questions we may have:
- Which majors had mostly female students? Which ones had mostly male students?
- Which majors had the largest spread (difference) between the 25th and 75th percentile starting salaries?
- Which engineering majors had the highest full time employment rates?
Let’s start by focusing on the first question. We need to translate this question into SQL so that we can get the answer we’re looking for from the database.
To determine which majors had mostly female students, we want the following subset of data:
- only the
Major
column - only the rows where
ShareWomen
is greater than0.5
(corresponding to 50%)
To return only the Major
column, we need to add the specific column name to the SELECT
statement part of the query (instead of using the *
operator to return all columns):
SELECT Major FROM recent_grads
This will return all of the values in the Major
column. We can specify multiple columns this way as well and the results table will preserve the order of the columns:
SELECT Major, Major_category FROM recent_grads
To return only the values where ShareWomen
is greater than or equal to 0.5
, we need to add a WHERE
clause:
SELECT Major FROM recent_gradsWHERE ShareWomen >= 0.5
Finally, we can limit the number of rows returned using LIMIT
:
SELECT Major FROM recent_gradsWHERE ShareWomen >= 0.5 LIMIT 5
If we run that query, here’s what we’ll see:
Here’s a breakdown of the different components of our query:
While in the SELECT
part of the query, we express the specific column we want. In the WHERE
part we express the specific rows we want. The beauty of SQL is that these can be independent.
Let’s write a SQL query that returns the majors where females were a minority. We’ll only return the Major
and ShareWomen
columns (in that order) and, we won’t limit the number of rows returned:
SELECT Major, ShareWomen FROM recent_grads WHERE ShareWomen < 0.5
Expressing Multiple Filter Criteria Using AND
To filter rows by specific criteria, we need to use the WHERE
statement. A simple WHERE
statement requires three things:
- The column we want the database to filter on:
ShareWomen
- A comparison operator that specifies how we want to compare a value in a column:
>
- The value we want the database to compare each value to:
0.5
Here are the comparison operators we can use:
- Less than:
<
- Less than or equal to:
<=
- Greater than:
>
- Greater than or equal to:
>=
- Equal to:
=
- Not equal to:
!=
The comparison value after the operator must be either text or a number, depending on the field. Because ShareWomen
is a numeric column, we don’t need to enclose the number 0.5
in quotes.
Finally, most database systems require that the SELECT
and FROM
statements come first, before WHERE
or any other statements. We can use the AND
operator to combine multiple filter criteria. For example, to determine which engineering majors had majority female, we’d need to specify 2 filtering criteria.
SELECT Major FROM recent_gradsWHERE Major_category = 'Engineering' AND ShareWomen > 0.5
It looks like only 2 majors met this criteria. If we wanted to “zoom” back out to look at all of the columns for both of these majors to see if they shared some other common attributes, we can modify the SELECT
statement and use the symbol *
to represent all columns
SELECT * FROM recent_gradsWHERE Major_category = 'Engineering' AND ShareWomen > 0.5
The ability to quickly iterate on queries as you think of new questions is one of the most powerful aspects of SQL. The SQL workflow lets data professionals focus on asking and answering questions, instead of worrying about lower-level programming concepts.
Now, let’s write a SQL query that returns all majors with a majority of female students and that had a median salary greater than 50000
. Let’s also limit the columns displayed in the results like so:
SELECT Major, Major_category, Median, ShareWomen FROM recent_grads WHERE ShareWomen > 0.5 AND Median > 50000
It looks like only two majors with mostly female students also had a median salary higher than $50,000.
Returning One of Several Conditions With OR
We used the AND
operator to specify that our filter needs to fit two Boolean conditions. Both of the conditions had to evaluate to True
for the record to appear in the results.
If we wanted to specify a filter that meets either of the conditions instead, we could use the OR
operator.
SELECT [column1, column2,...] FROM [table1]WHERE [condition1] OR [condition2]
Let’s write a SQL query that returns the first 20 majors that either have a Median
salary greater than or equal to 10,000
, or have less than or equal to 1,000
Unemployed
people. Let’s only include the following columns in the results and in this order:
Major
Median
Unemployed
SELECT Major, Median, Unemployed FROM recent_grads WHERE Median >= 10000 OR Unemployed <= 1000 LIMIT 20
Now, let’s start getting even more advanced with our SQL queries.
Grouping Operators With Parentheses
There’s a certain class of questions that we can’t answer using only the techniques we’ve learned so far. For example, if we wanted to write a query that returned all Engineering
majors that either had mostly female graduates or an unemployment rate below 5.1%, we would need to use parentheses to express this more complex logic.
Major_category = 'Engineering'ShareWomen >= 0.5Unemployment_rate < 0.051
What the SQL query looks like using parantheses:
SELECT Major, Major_category, ShareWomen, Unemployment_rate FROM recent_grads
WHERE (Major_category = 'Engineering') AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051);
Notice that we enclosed the logic we wanted to be evaluated together in parentheses. This is very similar to how we group mathematical calculations together using parentheses to define which are evaluated first. The parentheses makes it explictly clear to the database that we want all of the rows where both of the expressions in the statements evaluate to True
:
(Major_category = 'Engineering' AND ShareWomen > 0.5) -> True or False(ShareWomen > 0.5 OR Unemployment_rate < 0.051) -> True or False
If we had written the where
statement without any parentheses, the database would guess what our intentions are, and actually execute the following query instead:
WHERE (Major_category = 'Engineering' AND ShareWomen > 0.5) OR (Unemployment_rate < 0.051)
Leaving the parentheses out implies that we want the calculation to happen from left to right, in the order in which the logic is written, so it wouldn’t return us the data we want.
Now let’s run our intended query and see the results! We’ll run the query we explored above, which returns all Engineering
majors that either had mostly women graduates or had an unemployment rate below 5.1%, which was the nationwide unemployment rate in August 2015.
Let’s only include the following columns in the results and in this order:
Major
Major_category
ShareWomen
Unemployment_rate
Here’s our query:
SELECT Major, Major_category, ShareWomen, Unemployment_rateFROM recent_gradsWHERE (Major_category = 'Engineering') AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051)
That looks great, but what if we want to sort the results based on the values of a specific column to make them more readable?
Ordering Results Using ORDER BY
The results of every query we’ve written so far have been ordered by the Rank
column. Recall the query from earlier in this post that returned all of the columns and you’ll see the Rank column:
SELECT * FROM recent_grads LIMIT 5
As the questions we want to answer get more complex, we want more control over how the results are ordered. We can specify the order using the ORDER BY clause.
For example, we may want to understand which majors that met the criteria in the WHERE
statement had the lowest unemployment rate. The following query will return the results in ascending order by the Unemployment_rate
column.
SELECT Rank, Major, Major_category, ShareWomen, Unemployment_rate
FROM recent_grads
WHERE (Major_category = 'Engineering') AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051)
ORDER BY Unemployment_rate
If we instead want the results ordered by the same column but in descending order, we can add the DESC
keyword:
SELECT Rank, Major, Major_category, ShareWomen, Unemployment_rate
FROM recent_gradsWHERE (Major_category = 'Engineering') AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051)
ORDER BY Unemployment_rate DESC
Let’s write a query that returns all majors where ShareWomen
is greater than 0.3
and Unemployment_rate
is less than .1
. Let’s only include the following columns in the results and in this order:
Major
,ShareWomen
,Unemployment_rate
We’ll order the results in descending order by the ShareWomen
column. This will allow us to easily compare the unemployment rates for majors with higher or lower shares of women.
SELECT Major, ShareWomen, Unemployment_rate FROM recent_grads
WHERE ShareWomen > 0.3 AND Unemployment_rate < .1O
RDER BY ShareWomen DESC
And here are our results!
Master the Fundamentals of SQL
As you can see, SQL is a powerful language for accessing data, and this post is really just scratching the surface. If you’d like to learn more, we encourage you to check out our free, interactive SQL Fundamentals course, which this blog post is based on. The course goes even deeper, though, and will teach you how to use SQL to do things like:
- Calculate summary statistics
- Further segment data using grouping
- Write more complex queries using subqueries
In the course, you’ll continue to work with salary data from recent college graduates as well as data from the CIA World Factbook.
This article has been published from the source link without modifications to the text. Only the headline has been changed.
Source link