Week 5 - Outer Joins & also Data Types
JOINS (Part ✌️)!
👉👈
Plus bonus datatypes
This week we'll cover types of table relationships; using OUTER JOINs to retrieve rows from multiple tables along with the various set operators, including the use of a self-join. We'll also discuss the various data types used in MySQL in preparation for learning about Data Manipulation Language starting next week.
You get your own database!
I've created a database for each of you! If you have your own login credentials (i.e. the username and password I emailed to you), then you should see a database named after you (or your student number). In that database is a table called demo
.
You won't need to use this database this week, but you will next week.
If you don't have unique working credentials either because I didn't email them to you, or there's some other issue, email me Opens in a new window or let me know at the end of class today.
Takeaways about non-inner joins this week:
- Outer JOINs return rows even when there isn't a match with the other table.
- There are some weird types of JOINs that don't come up very often.
- You can find entities that have things in common by using a self-join.
- JOINs have several syntax variations.
- UNION isn't a JOIN, but it will let you put rows from different tables into the same results table.
Okay: more JOINs, types of data, here we go.
Here's what we're going to do today:
- More JOINs
- Types of data
- Functions for working with data types
- Working with timezones
A join is when you merge data from two (or more) different tables to create your results table.
We match rows from each of the tables by selecting one column from each to match up.
SELECT [columns]
FROM [table] JOIN
[otherTable]
ON [table].[column] = [otherTable].[column]
Today we'll add "Outer JOINs" (and a few more ways of merging table data).
SELECT [columns]
FROM [table]
[INNER|OUTER (RIGHT|LEFT)] JOIN
[otherTable]
ON [table].[column] = [otherTable].[column]
Outer JOINs
Inner joins only include data from rows where a match from the ON
clause is found in both tables.
Outer joins also include data from rows where there isn't a match.
Left outer joins include data from rows where a match is found in both tables plus rows from the first table.
Right outer joins include data from rows where a match is found in both tables plus rows from the second table.
Compare the results for this query...
SELECT
m.member_id AS "Members Table Member ID",
w.member_id AS "Withdrawals Table Member ID",
first_name, last_name, withdrawal_id
FROM members m
JOIN withdrawals w
ON m.member_id = w.member_id
ORDER BY m.member_id DESC
...with the results for this query.
SELECT
m.member_id AS "Members Table Member ID",
w.member_id AS "Withdrawals Table Member ID",
first_name, last_name, withdrawal_id
FROM members m
LEFT OUTER JOIN withdrawals w
ON m.member_id = w.member_id
ORDER BY m.member_id DESC
Let's revisit our students and assignments tables from last week.
Say we have two tables: one for assignments, one for students.
Each assignment belongs to a class. Each student belongs to a class. If we use our join statement to "merge" these two tables, it will output a row for every instance of a match between the assignments table, and the students table. This tells us, based on what class a student is in, what assignment they have to do.
(Inner) join
SELECT assignments.class, student, assignment
FROM assignments
JOIN students
ON assignments.class = students.class
2 | Birinder | Lab |
4 | Ryan | Exam |
Left (outer) join
SELECT assignments.class, student, assignment
FROM assignments
LEFT JOIN students
ON assignments.class = students.class
2 | Birinder | Lab |
4 | Ryan | Exam |
1 | (null) | Paper |
Right (outer) join
SELECT students.class, student, assignment
FROM assignments
RIGHT JOIN students
ON assignments.class = students.class
2 | Birinder | Lab |
4 | Ryan | Exam |
3 | Amandeep | (null) |
Why would we RIGHT JOIN
table A with table B when we could LEFT JOIN
table B with table A?
The answer is, normally, we wouldn't. Right joins only really come in handy when we're joining more than one table to another and things start to get awkward syntactically.
Joining a single table (to itself)
SELECT DISTINCT
a1.first_name,
a1.last_name,
a1.stateProv
FROM authors a1 JOIN authors a2
ON a1.stateProv = a2.stateProv
AND a1.author_id <> a2.author_id
Okay, this is a bit weird, but let's think it through:
This query joins two tables (one just happens to be a duplicate of the other).
We're used to joining things on a key (like an id) that is unique for at least one of the tables.
In this case, however, we're joining on the state/province. So this query produces a list of all authors from one table that have the same state/province as the authors in the other table (a1.stateProv = a2.stateProv
).
Then we say we're only going to join the rows if the id in the first table doesn't match the id in the second table. So we're asking the question, "where are the rows with the same state/province but different authors?"
Or, another way to put that is, "what states or provinces have multiple authors?" Using DISTINCT
(otherwise we would find a row in each of the two tables that met the conditions), what we end up with is a list of authors that share a state or province.
Cross joins - don't let them throw you off.
There's another type of join, but don't get it confused with inner or outer joins. The cross join (or 'Cartesian join') doesn't merge rows - it multiplies them.
For every row in the first table, it returns a copy appended with every row in the second table.
This is not used often.
SELECT
assignments.class,
students.class,
student,
assignment
FROM assignments, students
1 | 2 | Birinder | Paper |
2 | 2 | Birinder | Lab |
4 | 2 | Birinder | Exam |
1 | 3 | Amandeep | Paper |
2 | 3 | Amandeep | Lab |
4 | 3 | Amandeep | Exam |
1 | 4 | Ryan | Paper |
2 | 4 | Ryan | Lab |
4 | 4 | Ryan | Exam |
Syntax variations
The syntax we've gone over today is chosen to make it clear what we're doing, but, as usual with SQL, there are other ways to write things. Shorter or implicit syntax is risky with joins, as it's easy to write something you don't mean. Still, you'll need to recognize it if you come across it.
Inner join - implicit (a.k.a. non-ANSI) syntax
The implicit syntax for an inner join puts both tables in the FROM
conditions, and uses WHERE
in place of ON
SELECT
m.member_id, first_name, last_name, withdrawal_id
FROM members m, withdrawals w
WHERE m.member_id = w.member_id
USING
Using USING
isn't necessarily bad practice - it's quite concise. But it doesn't have the flexibility of the ON
clause, hence why we didn't cover it today. It is a shorthand for ON table1.id = table2.id
.
SELECT
m.member_id, first_name, last_name, withdrawal_id
FROM members m
JOIN withdrawals w
USING (member_id)
NATURAL
NATURAL
, however, is sloppy syntax, and should be avoided. It tells the database to guess what columns to join on. It only works if there is a single, identically named shared column. It's very poor for maintainability.
-- Shame!
SELECT member_id, first_name, last_name, withdrawal_id
FROM members NATURAL JOIN withdrawals
Another way to 'join' table data
(that isn't a JOIN)
We're not going to dive into this today, but I want you to have heard of this...
UNION
There is an operator called UNION
. It's similar to a join, but it doesn't merge any rows. Instead it lets you add rows from multiple tables to a single results table. So, if you've got a table of current students, and a table of past students, you could do something like this:
SELECT 'current' AS status,
name_first, name_last, grade_point_average
FROM current_students
UNION
SELECT 'graduated' AS status,
first_name, last_name, final_grade_point_average
FROM past_students
Mimicking a Full (outer) join
Unlike most other SQL-driven RDBMS', MySQL does not feature a FULL JOIN
. However, we can mimic it with the UNION
clause, which combines queries (while omitting duplicate rows).
If we create a UNION
between a right join and a left join, we get an identical result to a full join.
SELECT assignments.class, student, assignment
FROM assignments
LEFT JOIN students
ON assignments.class = students.class
UNION
SELECT students.class, student, assignment
FROM assignments
RIGHT JOIN students
ON assignments.class = students.class
2 | Birinder | Lab |
4 | Ryan | Exam |
1 | (null) | Paper |
3 | Amandeep | (null) |
The INSERT command
Let's throw an error!
Try repeating the steps above, only replace the values
349, 'Paranoid', 12
with the values
'Three hundred and fifty', 'Master of Reality', 12
More ways to write an INSERT command
You can insert data for any set of columns, in any order, provided you specify the columns before the VALUES keyword.
INSERT INTO album (artistId, title)
VALUES (12, 'Master of Reality');
You can write an INSERT command without specifying the columns if you are inserting data for every column, in the same order as those columns appear in the database table.
INSERT INTO album
VALUES (352, 'Sabbath Bloody Sabbath', 12);
You can insert multiple rows provided they all have the same column data in the same order.
INSERT INTO album
VALUES
(353, 'Technical Ecstasy', 12),
(354, 'Sabotage', 12);
Why do data types matter?
Data types matter because data integrity matters.
Data integrity basically means that the data is going in the right place.
If try to put a date into your primary key column, that means something has gone wrong with your process. So we throw an error to show that something needs fixing, rather than letting the mistake happen over and over again.
We can also say that data types matter because you can do stuff with comparable data.
What's 2 days after today? We can find this out with a date function!
What's 10 times 54321? We can find this out with an arithmetic operator!
What's 72 divided by orange? That doesn't make any sense! Why not? Different data types.
Each column in a database contains only one data type. Any time you try to put data of the wrong type into a column, the whole row of data will get rejected note - I'm simplifying this a bit, more on this later.
Data type categories
SQL databases have a lot of different data types, but they all fall into three main categories:
- characters,
- numbers, and
- date/time
Technically speaking, there are two other categories of data type: Geospatial Opens in a new window and JSON Opens in a new window. But I'm not gonna teach those in class unless a majority of you are like "oh, no, please sir, teach us how to handle geospatial data!"
Characters
Characters, also known as 'strings', can contain just about anything - letters, numbers, emojis, punctuation, et al. You're pretty familiar with string data at this point, and you know that you can use functions like SUBSTR() and CONCAT() to work with them, and that they're wrapped in single quotes. You also have figured out, at this point, the trying to use a SUM() function on a column full of strings wouldn't work out too well.
Numbers
Numbers, or 'numerical data' are what allow us to do our math-based data mining, using arithmetic and a bunch of different aggregate functions.
Dates & times
Dates and times are a separate category of data from either characters or numbers - and probably the trickiest to work with. We haven't done too much with dates so far, but we'll learn a lot today about how to properly format different date and time data, and what sort of functions are available to help us make use of that data!
Specifics
Like I said, SQL databases have just a ton of different data types, and there are a lot of data types that will appear in one RDBMS but not another.
We will not be learning all the different data types today.
Instead, I've chosen to focus on the data types that are most generally useful, and which are most likely to be shared across different database management systems.
Why not just three data types?
There are a number of reasons for the diversity of data types, but the main difference you'll see between two data types of the same data type category is storage space.
TINYTEXT
, TEXT
, MEDIUMTEXT
, and LONGTEXT
, as you might guess, are all capable of storing the same kind of data. The difference is that...
TINYTEXT
can store up to 255 bytes,TEXT
can store up to 65,535 bytes,MEDIUMTEXT
can store up to 16,777,215 bytes, andLONGTEXT
can store up to 4,294,967,295 bytes.
When you're creating a column in a table, most string data types let you specify your own maximum number of characters (or, with numbers, the 'precision'), if you don't want to use the maximum.
This can both 1) make your database more efficient, and 2) protect your data integrity.
Just a note - the number of bytes is not the same as the number of characters. Emojis might be 4 bytes, for example.
Data types for characters
VARCHAR
VARCHAR lets you store up to 65,535 bytes. If you don't specify a length, it will default to 30 characters. Most commonly, you see the limit set to 255 characters, as that used to be the limit for a lot of RDBMSs.
The Arctic char or Arctic charr (Salvelinus alpinus) is a cold-water fish in the family Salmonidae, native to alpine lakes and arctic and subarctic coastal waters. Its distribution is circumpolar. It spawns in fresh water and populations can be lacustrine
VARCHAR is a "variable length" data type, meaning that each row only stores as much data as you put into it. This is good for saving space, but can impact the speed of your queries if you're dealing with databases that are 100GB+. If you need speedy look ups, use the CHAR data type instead.
VARCHAR is probably the most common data type you'll use.
MEDIUMTEXT
Need something bigger than VARCHAR? If you're storing, say, long-form articles, you may want to use MEDIUMTEXT (or even LONGTEXT, if you're storing entire books).
ENUM
ENUM is like VARCHAR, except that it lets you define a list of permissable strings. For example, if you were to create a column with the following data type:
ENUM('x-small', 'small', 'medium', 'large', 'x-large')
... the only strings allowed in that column would be those clothing sizes. This can be really good both for performance and data integrity!
Data types for numbers
INT
INT is short of 'integer', meaning it can store whole numbers.
By default, integer data types (like INT, SMALLINT, and BIGINT) are "signed", which just means you can put negative values in them (the '-' sign in '-12' is the 'sign'). When you create a column, you can specify whether the data type is signed or unsigned. Unsigned data has a higher maximum value. For example, by default, INT can hold values from -2147483648 to 2147483647. If it's unsigned, however, it can hold values up to 4294967295, but nothing below zero.
DECIMAL
DECIMAL, as you might guess, lets you hold numbers with decimal values to a specific number of places.
When you define the DECIMAL, you provide two arguments - precision and scale. Precision means how many number in total, and scale means how many decimal places. If you write DECIMAL(6,2), that column can hold values up to 9,999.99 (two decimal places, 6 numbers in total).
DECIMAL comes in handy for storing things like cash values.
Watch out! For some reason, the default number of decimal places in the DECIMAL data type is none.
FLOAT
Similar to DECIMAL, FLOAT lets you hold numbers with decimal values. The difference is that FLOAT doesn't have a fixed number of decimal places. It's more handy for storing values like "one third".
Data types for dates & times
DATE
DATE stores, well, a date! It will be in the following format by default: YYYY-MM-DD
.
On the off chance that you're storing data about dates before 1000 A.D., you'll need to go with a different format! The minimum value for DATE is '1000-01-01'.
TIME
TIME, predictably, stores data for hours, minutes, and seconds, in the format 'HH:MM:SS'. When defining a column's data type, you could add milliseconds by defining the data type as TIME(3).
DATETIME
A combination of the above, DATETIME stores values in the format 'YYYY-MM-DD HH:MM:SS'
TIMESTAMP
The TIMESTAMP data type is formatted the same as DATETIME ('YYYY-MM-DD HH:MM:SS'), but it is stored differently in the database.
Whereas DATETIME just stores whatever value you put into it, TIMESTAMP records the date and time relative to Universal Coordinated Time (UTC), meaning that it is a fixed point in time, independent of timezones.
TIMESTAMP is intended to record events.
As such, note that TIMESTAMP only allows for dates between 1970 and (early) 2038 - so not the best datatype for storing birth dates.
Functions for working with data types
We've already covered a great deal of functions that are appropriate for certain data types. I wanted to remind you of CAST() (for converting data from one type to another), and introduce some time/date based functions.
CAST()
The CAST function lets us convert data from one type to another.
It has a pretty simple syntax - CAST({data} AS {data type})
Date & time related functions
Okay, there's a ton of these Opens in a new window, so I'm just gonna tell you about the important ones, and throw them into a table .
Name | Purpose | Example |
---|---|---|
DATE_ADD/DATE_SUB | Adds or subtracts an interval (like minutes, days, months, etc.) |
|
NOW | Gets the current date and time (based on the server's clock). |
|
EXTRACT | Gets the a segment of date and time data (i.e. year, month). |
|
DATEDIFF | Subtracts one date from another, giving the difference between two dates. |
|
MONTH | For each segment of time and date data, there is a function that will extract that segment - YEAR(), MINUTE(), etc. |
|
MONTHNAME | Given a numbered month, returns the name of the month. |
|
DATE_FORMAT | Lets you format a date to a given format. This is it's own weird little syntax Opens in a new window. You can use this, as it's fairly concise, but I won't blame you if you want to just use EXTRACT, or any of the specific segment functions. |
|
Want to see DATE_FORMAT in action? Run the following in dBeaver:
SELECT DATE_FORMAT(NOW(), '%i minutes, %s seconds past %l %p on %W the %D of %M, in the year %Y.')
AS "Date statement";
Working with timezones
So, timezones complicate things. In the interests of keeping things simple, I have set our server's clock to Toronto time. I had to ssh into the server and update the mysql config file. If that's not a thing you know how to do, consider this a heads-up:
Most servers are set to UTC.
This is a good thing. Two words: daylight savings.
Also, it's good to get into the habit of converting things based on timezone, because you may very well have users that are in different time zones.
But wait!
Okay, remember when I said...
"Any time you try to put data of the wrong type into a column, the whole row of data will get rejected."
There are conditions under which you can insert data of one data type into a column of another data type.
For example, this is okay:
INSERT INTO timetable
(theDate, theTime, theDateTime)
VALUES(NOW(),NOW(),NOW());
But wait! NOW()
returns a date and a time, right? How could we insert data into the theDate or theTime columns (which have data types date
and time
, respectively) if it's the wrong data type?
Well, the truth is, you actually can write an insert statement like the one above - telling the database to put data of one data type into a column of another data type.
In these cases, the data the ends up in the database still matches the column's data type. For example, if you were to run this select statement:
SELECT * FROM timetable;
You'd get a result like this:
theDate |theTime |theDatetime |
----------|--------|---------------------|
2021-02-11|13:22:12|2021-02-11 13:22:12.0|
...even though you inserted NOW()
's same return value (in this case 2021-02-11 13:22:12.0
) three times.
This is because MySQL understands that a datetime value (as returned by NOW()
contains both a date and a time. If the column can only contain a date, MySQL is perfectly happy to throw away your time data and only store the date from the original datetime.
Something similar happens when you store a decimal in a column with the data type INT
:
INSERT INTO numbers (integer)
VALUES (7.51);
SELECT * FROM numbers;
/* Returns integer
|-----------------|
8|
*/
So, while it's always true that data in a column is always of the column's data type, (and while you will throw errors when your data is too big for the allocated memory, or you try to insert a string into a numerical column), there are instances where the database will do you a favour and turn the data being inserted into the correct data type before storing it in the database.
Exercises!
- Insert a row into your
demo
table. Specify the column names in your insert statement. - Insert another row into your
demo
table, this time without specifying the column names. - Insert another row, this time specify only three of the column names, and only insert data for those columns.
Note: some columns may require a value, while others may be okay being left empty. - Insert another row, with data for each column. Use arithmetic operations to populate any columns with numerical data types.
- Insert another row, with data for each column. Use the NOW function to populate the
dates
anddate_and_time
columns. - Insert another row, with data for each column. To populate the
whole_numbers
column, use the NOW function, along with DATEDIFF, to enter the number of days between now and Hallowe'en. - Insert another row, with data for each column. Use NOW and the DATE_SUB function to populate the
dates
column with the date 20 months before today. Use NOW and the DATE_ADD function to populate thedate_and_time
column with the date 500 hours from today. - Write a select statement that only returns the month names from your
dates
anddate_and_time
columns. Check back in the notes for the best function to accomplish this. - Write a select statement that only returns values from your decimalNumbers column as the data type DECIMAL, with zero decimal places.
- Write a select statement with a WHERE condition that uses the EXTRACT function.
- Determine the number of bytes for each character in the following sentence:
🔶 This string is 32 bytes ⭐
(Reminder thatLENGTH()
returns the number of bytes in a string.)