Week 3 - Aggregation
Week 3 - Aggregation
This week we'll begin looking at aggregate functions. This will involve discussing data types, and some new keywords, like HAVING and GROUP BY.
Another dataset
I've created a table of Dungeons & Dragons characters called, creatively, characters
.
We'll use this to work on aggregate functions today.
Aggregate functions
So far, we've looked at scalar functions - functions that take one value and return one value.
Today, we look at aggregate functions. Aggregate functions take in multiple values (one value from each row) and return one value.
Let's see how many rows we have in our characters
table.
SELECT COUNT(*) FROM characters
Aggregate queries may also be referred to as summary queries, because its return table is a summary rather than a collection of rows.
5 common aggregate functions
AVG()
returns the average.
SUM()
returns the sum.
MIN()
returns the lowest value, while
MAX()
returns the highest.
COUNT()
returns the number of values (meaning, the number of rows).
All of these functions - except COUNT(*)
used with an asterisk - only work on non-null values. If you want to count null
as being equal to zero, you will have to convert null values with the COALESCE()
function.
Now, as we've said before, database tables are not spreadsheets, but the have some things in common. These aggregate functions are one of those things! Let's pop over to Google Sheets Opens in a new window and take a look at how these functions work there.
Let's see what the average strength is for a character in this set.
SELECT AVG(strength) FROM characters
If I was passing this value to another function, I would want to use the precise value. But since I'm using this information to display, I'll clean it up by nesting my AVG() function inside a scalar function - ROUND().
SELECT ROUND(AVG(strength), 2)
AS avg_strength FROM characters
In other words, if you're going to round off numbers, make sure it's the last thing you do with them.
Now, you probably won't have to use DISTINCT
with AVG()
or any of the other aggregate functions, but it does come in handy with COUNT()
.
Let's see how many unique classes (primary_class
) there are in our table:
SELECT COUNT(DISTINCT primary_class)
AS "Total Unique Classes"
FROM characters
GROUP BY
The GROUP BY
clause lets us create groups of rows. It groups together any rows that have matching data in the specified column(s).
The purpose of this is to be able to use aggregate functions on groups of rows.
Imagine we had a table of students who went to different schools.
We count the total number of students like this:
SELECT COUNT(*) FROM students
If we wanted to count the total number of students who go to each school, we'd write it like this:
SELECT school, COUNT(*) FROM students
GROUP BY school;
In other words, if we want to run aggregate functions on more than one set of rows, the GROUP BY
clause lets us run an aggregate function on each distinct set of values in the column that we are grouping by.
SELECT primary_class, ROUND(AVG(intelligence), 3)
AS smarts
FROM characters
GROUP BY primary_class
ORDER BY smarts DESC
This is important: we can only select columns outside of an aggregate function if they're referenced in GROUP BY
.
This is because the data in those columns is the same for the whole group, but rows within the group might differ in their data in other columns. Remember, what we're seeing in each row of our results table represents a group of rows.
As with other clauses, we can group by a combination of columns. You might want to do this, for example, if you've got columns for city and state (where there might be two cities with the same name in different states).
So, if we want to see what combination of subclass
and background
gives the average character the greatest dexterity:
SELECT
subclass,
background,
ROUND(AVG(dexterity))
AS dex
FROM characters
GROUP BY subclass, background
ORDER BY dex DESC
HAVING
HAVING
is like WHERE
for groups.
-- SELECT Primary Class and the
-- average Character Level...
SELECT primary_class,
ROUND(AVG(char_level), 3) AS "Character level"
FROM characters
-- ... for each Primary Class ...
GROUP BY primary_class
-- ... if, on average, the Armour of that
-- Primary Class is greater than 10
HAVING AVG(armour_class) > 10
ORDER BY "Character level" DESC
HAVING
filters for groups (not individual rows) that meet the condition.
Can we have more than one HAVING
condition?
Of course!
-- select the background and average wisdom
SELECT
background,
ROUND(AVG(wisdom), 4) AS wisdom,
AVG(charisma) AS charm
FROM characters
-- for each background
GROUP BY background
-- where there are more than
-- 25 instances of that background
HAVING COUNT(*) > 25
-- and the average charisma is between ten and twenty
AND AVG(charisma) BETWEEN 10 AND 20
Our 4 clauses go in this order:
- WHERE
- GROUP BY
- HAVING
- ORDER BY
WHERE
filters rows before they are grouped
HAVING
filters groups after grouping.
SELECT primary_class,
ROUND(AVG(char_level), 3) AS "Character level"
FROM characters
WHERE UPPER(primary_class) LIKE ('%BARD%')
GROUP BY primary_class
HAVING AVG(armour_class) > 10
ORDER BY ROUND(AVG(char_level), 3) DESC
Queries don't start with SELECT
I mean, when you're writing queries, of course they do begin with SELECT
, but that's not actually the order in which the database management system reads them.
What do they start with, then?
The first thing the RDBMS does is gets the table data, meaning the first part of your query that gets executed is FROM {tableName}
.
Next is WHERE
. This is important to know! It means that any other part of the statement that gets executed is only happening to rows that haven't already been filtered out.
After WHERE
comes GROUP BY
and HAVING
.
Then comes SELECT
, followed by ORDER BY
and LIMIT
.
Once again, I strongly recommend you read "SQL queries don't start with SELECT Opens in a new window", an excellent blog article by Montréal-based programmer Julia Evans.
Subqueries
Subqueries are queries used to return a value within another query.
They are powerful! And with great power comes great opportunities to screw things up.
Don't use a subquery because you're not sure of the WHERE
syntax, the order of operations, or how to code a self-join.
DO NOT use subqueries unless you don't have a choice or unless it reduces, rather than adds, complexity.
A use-case for subqueries
Using a value returned by an aggregate function inside a standard query.
Here is a query that selects all the Primary Classes that have above-average wisdom:
SELECT * FROM characters
WHERE wisdom > (SELECT AVG(wisdom) FROM characters);
Other than that, subqueries rarely have a use-case outside of deep complexity, and will usually introduce more complexity than they resolve.
We can learn more about them later in the year, if we have time. At that point you'll be comfortable with JOIN
s, clauses and the order of operations, and you'll be better able to judge when a subquery might be justified. Until then, avoid them unless you're comparing a value against the value of an aggregate.