Week 3 - Aggregation

This week we'll begin looking at aggregate functions. This will involve discussing data types, and some new keywords, like HAVING and GROUP BY.

Aggregate functions
1. Common aggregate functions
GROUP BY
HAVING
Queries don't start with SELECT
Subqueries
1. A use-case for subqueries

Another dataset

I've created a table of Dungeons & Dragons characters called, creatively, characters. Dungeons and Dragons characters

We'll use this to work on aggregate functions today.

Aggregate functions

So far, we've looked at scalar functions - functions that take one value and return one value.

Today, we look at aggregate functions. Aggregate functions take in multiple values (one value from each row) and return one value.

Let's see how many rows we have in our characters table.

SELECT COUNT(*) FROM characters

Aggregate queries may also be referred to as summary queries, because its return table is a summary rather than a collection of rows.

5 common aggregate functions

AVG() returns the average.

SUM() returns the sum.

MIN() returns the lowest value, while

MAX() returns the highest.

COUNT() returns the number of values (meaning, the number of rows).

All of these functions - except COUNT(*) used with an asterisk - only work on non-null values. If you want to count null as being equal to zero, you will have to convert null values with the COALESCE() function.

Now, as we've said before, database tables are not spreadsheets, but the have some things in common. These aggregate functions are one of those things! Let's pop over to Google Sheets Opens in a new window and take a look at how these functions work there.

Let's see what the average strength is for a character in this set.

SELECT AVG(strength) FROM characters

If I was passing this value to another function, I would want to use the precise value. But since I'm using this information to display, I'll clean it up by nesting my AVG() function inside a scalar function - ROUND().

SELECT ROUND(AVG(strength), 2) 
  AS avg_strength FROM characters

In other words, if you're going to round off numbers, make sure it's the last thing you do with them.

Now, you probably won't have to use DISTINCT with AVG() or any of the other aggregate functions, but it does come in handy with COUNT().

Let's see how many unique classes (primary_class) there are in our table:

SELECT COUNT(DISTINCT primary_class) 
  AS "Total Unique Classes"
FROM characters

GROUP BY

The GROUP BY clause lets us create groups of rows. It groups together any rows that have matching data in the specified column(s).

The purpose of this is to be able to use aggregate functions on groups of rows.

Imagine we had a table of students who went to different schools.

We count the total number of students like this:

SELECT COUNT(*) FROM students

If we wanted to count the total number of students who go to each school, we'd write it like this:

SELECT school, COUNT(*) FROM students
  GROUP BY school;

In other words, if we want to run aggregate functions on more than one set of rows, the GROUP BY clause lets us run an aggregate function on each distinct set of values in the column that we are grouping by.

SELECT primary_class, ROUND(AVG(intelligence), 3) 
  AS smarts
FROM characters
GROUP BY primary_class
ORDER BY smarts DESC

This is important: we can only select columns outside of an aggregate function if they're referenced in GROUP BY.

This is because the data in those columns is the same for the whole group, but rows within the group might differ in their data in other columns. Remember, what we're seeing in each row of our results table represents a group of rows.

As with other clauses, we can group by a combination of columns. You might want to do this, for example, if you've got columns for city and state (where there might be two cities with the same name in different states).

So, if we want to see what combination of subclass and background gives the average character the greatest dexterity:

SELECT 
    subclass, 
    background, 
    ROUND(AVG(dexterity)) 
        AS dex
FROM characters
GROUP BY subclass, background
ORDER BY dex DESC

HAVING

HAVING is like WHERE for groups.

-- SELECT Primary Class and the 
-- average Character Level...
SELECT primary_class, 
  ROUND(AVG(char_level), 3) AS "Character level" 
FROM characters
-- ... for each Primary Class ...
GROUP BY primary_class
-- ... if, on average, the Armour of that 
-- Primary Class is greater than 10
HAVING AVG(armour_class) > 10
ORDER BY "Character level" DESC

HAVING filters for groups (not individual rows) that meet the condition.

Can we have more than one HAVING condition?

Of course!

-- select the background and average wisdom
SELECT 
  background, 
  ROUND(AVG(wisdom), 4) AS wisdom, 
  AVG(charisma) AS charm
FROM characters
-- for each background
GROUP BY background
-- where there are more than
-- 25 instances of that background
HAVING COUNT(*) > 25 
-- and the average charisma is between ten and twenty
AND AVG(charisma) BETWEEN 10 AND 20

Our 4 clauses go in this order:

WHERE
GROUP BY
HAVING
ORDER BY

WHERE filters rows before they are grouped

HAVING filters groups after grouping.

SELECT primary_class, 
  ROUND(AVG(char_level), 3) AS "Character level" 
FROM characters
WHERE UPPER(primary_class) LIKE ('%BARD%')
GROUP BY primary_class
HAVING AVG(armour_class) > 10
ORDER BY ROUND(AVG(char_level), 3) DESC

Queries don't start with SELECT

I mean, when you're writing queries, of course they do begin with SELECT, but that's not actually the order in which the database management system reads them.

What do they start with, then?

The first thing the RDBMS does is gets the table data, meaning the first part of your query that gets executed is FROM {tableName}.

Next is WHERE. This is important to know! It means that any other part of the statement that gets executed is only happening to rows that haven't already been filtered out.

After WHERE comes GROUP BY and HAVING.

Then comes SELECT, followed by ORDER BY and LIMIT.

Once again, I strongly recommend you read "SQL queries don't start with SELECT Opens in a new window", an excellent blog article by Montréal-based programmer Julia Evans.

(She also does really cool programming 'zines and comics on many topics!)

Subqueries

Subqueries are queries used to return a value within another query.

They are powerful! And with great power comes great opportunities to screw things up.

Don't use a subquery because you're not sure of the WHERE syntax, the order of operations, or how to code a self-join.

DO NOT use subqueries unless you don't have a choice or unless it reduces, rather than adds, complexity.

A use-case for subqueries

Using a value returned by an aggregate function inside a standard query.

Here is a query that selects all the Primary Classes that have above-average wisdom:

SELECT * FROM characters 
WHERE wisdom > (SELECT AVG(wisdom) FROM characters);

Other than that, subqueries rarely have a use-case outside of deep complexity, and will usually introduce more complexity than they resolve.

We can learn more about them later in the year, if we have time. At that point you'll be comfortable with JOINs, clauses and the order of operations, and you'll be better able to judge when a subquery might be justified. Until then, avoid them unless you're comparing a value against the value of an aggregate.