JOINS (Part ✌️)!
👉👈
Plus bonus datatypes
Plus bonus datatypes
This week we'll cover types of table relationships; using OUTER JOINs to retrieve rows from multiple tables along with the various set operators, including the use of a self-join. We'll also discuss the various data types used in MySQL in preparation for learning about Data Manipulation Language starting next week.
I've created a database for each of you! If you have your own login credentials (i.e. the username and password I emailed to you), then you should see a database named after you (or your student number). In that database is a table called demo
.
You won't need to use this database this week, but you will next week.
If you don't have unique working credentials either because I didn't email them to you, or there's some other issue, email me Opens in a new window or let me know at the end of class today.
A join is when you merge data from two (or more) different tables to create your results table.
We match rows from each of the tables by selecting one column from each to match up.
SELECT [columns]
FROM [table] JOIN
[otherTable]
ON [table].[column] = [otherTable].[column]
Today we'll add "Outer JOINs" (and a few more ways of merging table data).
SELECT [columns]
FROM [table]
[INNER|OUTER (RIGHT|LEFT)] JOIN
[otherTable]
ON [table].[column] = [otherTable].[column]
Inner joins only include data from rows where a match from the ON
clause is found in both tables.
Outer joins also include data from rows where there isn't a match.
Left outer joins include data from rows where a match is found in both tables plus rows from the first table.
Right outer joins include data from rows where a match is found in both tables plus rows from the second table.
Compare the results for this query...
SELECT
m.member_id AS "Members Table Member ID",
w.member_id AS "Withdrawals Table Member ID",
first_name, last_name, withdrawal_id
FROM members m
JOIN withdrawals w
ON m.member_id = w.member_id
ORDER BY m.member_id DESC
...with the results for this query.
SELECT
m.member_id AS "Members Table Member ID",
w.member_id AS "Withdrawals Table Member ID",
first_name, last_name, withdrawal_id
FROM members m
LEFT OUTER JOIN withdrawals w
ON m.member_id = w.member_id
ORDER BY m.member_id DESC
Let's revisit our students and assignments tables from last week.
Say we have two tables: one for assignments, one for students.
Each assignment belongs to a class. Each student belongs to a class. If we use our join statement to "merge" these two tables, it will output a row for every instance of a match between the assignments table, and the students table. This tells us, based on what class a student is in, what assignment they have to do.
Assignments | |
Class | Assignment |
1 | Paper |
2 | Lab |
4 | Exam |
Students | |
Class | Student |
2 | Birinder |
3 | Amandeep |
4 | Ryan |
SELECT assignments.class, student, assignment
FROM assignments
JOIN students
ON assignments.class = students.class
2 | Birinder | Lab |
4 | Ryan | Exam |
SELECT assignments.class, student, assignment
FROM assignments
LEFT JOIN students
ON assignments.class = students.class
2 | Birinder | Lab |
4 | Ryan | Exam |
1 | (null) | Paper |
SELECT students.class, student, assignment
FROM assignments
RIGHT JOIN students
ON assignments.class = students.class
2 | Birinder | Lab |
4 | Ryan | Exam |
3 | Amandeep | (null) |
Why would we RIGHT JOIN
table A with table B when we could LEFT JOIN
table B with table A?
The answer is, normally, we wouldn't. Right joins only really come in handy when we're joining more than one table to another and things start to get awkward syntactically.
SELECT DISTINCT
a1.first_name,
a1.last_name,
a1.stateProv
FROM authors a1 JOIN authors a2
ON a1.stateProv = a2.stateProv
AND a1.author_id <> a2.author_id
Okay, this is a bit weird, but let's think it through:
This query joins two tables (one just happens to be a duplicate of the other).
We're used to joining things on a key (like an id) that is unique for at least one of the tables.
In this case, however, we're joining on the state/province. So this query produces a list of all authors from one table that have the same state/province as the authors in the other table (a1.stateProv = a2.stateProv
).
Then we say we're only going to join the rows if the id in the first table doesn't match the id in the second table. So we're asking the question, "where are the rows with the same state/province but different authors?"
Or, another way to put that is, "what states or provinces have multiple authors?" Using DISTINCT
(otherwise we would find a row in each of the two tables that met the conditions), what we end up with is a list of authors that share a state or province.
There's another type of join, but don't get it confused with inner or outer joins. The cross join (or 'Cartesian join') doesn't merge rows - it multiplies them.
For every row in the first table, it returns a copy appended with every row in the second table.
This is not used often.
SELECT
assignments.class,
students.class,
student,
assignment
FROM assignments, students
1 | 2 | Birinder | Paper |
2 | 2 | Birinder | Lab |
4 | 2 | Birinder | Exam |
1 | 3 | Amandeep | Paper |
2 | 3 | Amandeep | Lab |
4 | 3 | Amandeep | Exam |
1 | 4 | Ryan | Paper |
2 | 4 | Ryan | Lab |
4 | 4 | Ryan | Exam |
The syntax we've gone over today is chosen to make it clear what we're doing, but, as usual with SQL, there are other ways to write things. Shorter or implicit syntax is risky with joins, as it's easy to write something you don't mean. Still, you'll need to recognize it if you come across it.
The implicit syntax for an inner join puts both tables in the FROM
conditions, and uses WHERE
in place of ON
SELECT
m.member_id, first_name, last_name, withdrawal_id
FROM members m, withdrawals w
WHERE m.member_id = w.member_id
USING
Using USING
isn't necessarily bad practice - it's quite concise. But it doesn't have the flexibility of the ON
clause, hence why we didn't cover it today. It is a shorthand for ON table1.id = table2.id
.
SELECT
m.member_id, first_name, last_name, withdrawal_id
FROM members m
JOIN withdrawals w
USING (member_id)
NATURAL
NATURAL
, however, is sloppy syntax, and should be avoided. It tells the database to guess what columns to join on. It only works if there is a single, identically named shared column. It's very poor for maintainability.
-- Shame!
SELECT member_id, first_name, last_name, withdrawal_id
FROM members NATURAL JOIN withdrawals
We're not going to dive into this today, but I want you to have heard of this...
There is an operator called UNION
. It's similar to a join, but it doesn't merge any rows. Instead it lets you add rows from multiple tables to a single results table. So, if you've got a table of current students, and a table of past students, you could do something like this:
SELECT 'current' AS status,
name_first, name_last, grade_point_average
FROM current_students
UNION
SELECT 'graduated' AS status,
first_name, last_name, final_grade_point_average
FROM past_students
Unlike most other SQL-driven RDBMS', MySQL does not feature a FULL JOIN
. However, we can mimic it with the UNION
clause, which combines queries (while omitting duplicate rows).
If we create a UNION
between a right join and a left join, we get an identical result to a full join.
SELECT assignments.class, student, assignment
FROM assignments
LEFT JOIN students
ON assignments.class = students.class
UNION
SELECT students.class, student, assignment
FROM assignments
RIGHT JOIN students
ON assignments.class = students.class
2 | Birinder | Lab |
4 | Ryan | Exam |
1 | (null) | Paper |
3 | Amandeep | (null) |
Try repeating the steps above, only replace the values
349, 'Paranoid', 12
with the values
'Three hundred and fifty', 'Master of Reality', 12
You can insert data for any set of columns, in any order, provided you specify the columns before the VALUES keyword.
INSERT INTO album (artistId, title)
VALUES (12, 'Master of Reality');
You can write an INSERT command without specifying the columns if you are inserting data for every column, in the same order as those columns appear in the database table.
INSERT INTO album
VALUES (352, 'Sabbath Bloody Sabbath', 12);
You can insert multiple rows provided they all have the same column data in the same order.
INSERT INTO album
VALUES
(353, 'Technical Ecstasy', 12),
(354, 'Sabotage', 12);
Data types matter because data integrity matters.
Data integrity basically means that the data is going in the right place.
If try to put a date into your primary key column, that means something has gone wrong with your process. So we throw an error to show that something needs fixing, rather than letting the mistake happen over and over again.
We can also say that data types matter because you can do stuff with comparable data.
What's 2 days after today? We can find this out with a date function!
What's 10 times 54321? We can find this out with an arithmetic operator!
What's 72 divided by orange? That doesn't make any sense! Why not? Different data types.
Each column in a database contains only one data type. Any time you try to put data of the wrong type into a column, the whole row of data will get rejected note - I'm simplifying this a bit, more on this later.
SQL databases have a lot of different data types, but they all fall into three main categories:
Technically speaking, there are two other categories of data type: Geospatial Opens in a new window and JSON Opens in a new window. But I'm not gonna teach those in class unless a majority of you are like "oh, no, please sir, teach us how to handle geospatial data!"
Characters, also known as 'strings', can contain just about anything - letters, numbers, emojis, punctuation, et al. You're pretty familiar with string data at this point, and you know that you can use functions like SUBSTR() and CONCAT() to work with them, and that they're wrapped in single quotes. You also have figured out, at this point, the trying to use a SUM() function on a column full of strings wouldn't work out too well.
Numbers, or 'numerical data' are what allow us to do our math-based data mining, using arithmetic and a bunch of different aggregate functions.
Dates and times are a separate category of data from either characters or numbers - and probably the trickiest to work with. We haven't done too much with dates so far, but we'll learn a lot today about how to properly format different date and time data, and what sort of functions are available to help us make use of that data!
Like I said, SQL databases have just a ton of different data types, and there are a lot of data types that will appear in one RDBMS but not another.
We will not be learning all the different data types today.
Instead, I've chosen to focus on the data types that are most generally useful, and which are most likely to be shared across different database management systems.
There are a number of reasons for the diversity of data types, but the main difference you'll see between two data types of the same data type category is storage space.
TINYTEXT
, TEXT
, MEDIUMTEXT
, and LONGTEXT
, as you might guess, are all capable of storing the same kind of data. The difference is that...
TINYTEXT
can store up to 255 bytes,TEXT
can store up to 65,535 bytes,MEDIUMTEXT
can store up to 16,777,215 bytes, andLONGTEXT
can store up to 4,294,967,295 bytes.When you're creating a column in a table, most string data types let you specify your own maximum number of characters (or, with numbers, the 'precision'), if you don't want to use the maximum.
This can both 1) make your database more efficient, and 2) protect your data integrity.
Just a note - the number of bytes is not the same as the number of characters. Emojis might be 4 bytes, for example.
VARCHAR lets you store up to 65,535 bytes. If you don't specify a length, it will default to 30 characters. Most commonly, you see the limit set to 255 characters, as that used to be the limit for a lot of RDBMSs.
The Arctic char or Arctic charr (Salvelinus alpinus) is a cold-water fish in the family Salmonidae, native to alpine lakes and arctic and subarctic coastal waters. Its distribution is circumpolar. It spawns in fresh water and populations can be lacustrine
VARCHAR is a "variable length" data type, meaning that each row only stores as much data as you put into it. This is good for saving space, but can impact the speed of your queries if you're dealing with databases that are 100GB+. If you need speedy look ups, use the CHAR data type instead.
VARCHAR is probably the most common data type you'll use.
Need something bigger than VARCHAR? If you're storing, say, long-form articles, you may want to use MEDIUMTEXT (or even LONGTEXT, if you're storing entire books).
ENUM is like VARCHAR, except that it lets you define a list of permissable strings. For example, if you were to create a column with the following data type:
ENUM('x-small', 'small', 'medium', 'large', 'x-large')
... the only strings allowed in that column would be those clothing sizes. This can be really good both for performance and data integrity!
INT is short of 'integer', meaning it can store whole numbers.
By default, integer data types (like INT, SMALLINT, and BIGINT) are "signed", which just means you can put negative values in them (the '-' sign in '-12' is the 'sign'). When you create a column, you can specify whether the data type is signed or unsigned. Unsigned data has a higher maximum value. For example, by default, INT can hold values from -2147483648 to 2147483647. If it's unsigned, however, it can hold values up to 4294967295, but nothing below zero.
DECIMAL, as you might guess, lets you hold numbers with decimal values to a specific number of places.
When you define the DECIMAL, you provide two arguments - precision and scale. Precision means how many number in total, and scale means how many decimal places. If you write DECIMAL(6,2), that column can hold values up to 9,999.99 (two decimal places, 6 numbers in total).
DECIMAL comes in handy for storing things like cash values.
Watch out! For some reason, the default number of decimal places in the DECIMAL data type is none.
Similar to DECIMAL, FLOAT lets you hold numbers with decimal values. The difference is that FLOAT doesn't have a fixed number of decimal places. It's more handy for storing values like "one third".
DATE stores, well, a date! It will be in the following format by default: YYYY-MM-DD
.
On the off chance that you're storing data about dates before 1000 A.D., you'll need to go with a different format! The minimum value for DATE is '1000-01-01'.
TIME, predictably, stores data for hours, minutes, and seconds, in the format 'HH:MM:SS'. When defining a column's data type, you could add milliseconds by defining the data type as TIME(3).
A combination of the above, DATETIME stores values in the format 'YYYY-MM-DD HH:MM:SS'
The TIMESTAMP data type is formatted the same as DATETIME ('YYYY-MM-DD HH:MM:SS'), but it is stored differently in the database.
Whereas DATETIME just stores whatever value you put into it, TIMESTAMP records the date and time relative to Universal Coordinated Time (UTC), meaning that it is a fixed point in time, independent of timezones.
TIMESTAMP is intended to record events.
As such, note that TIMESTAMP only allows for dates between 1970 and (early) 2038 - so not the best datatype for storing birth dates.
We've already covered a great deal of functions that are appropriate for certain data types. I wanted to remind you of CAST() (for converting data from one type to another), and introduce some time/date based functions.
The CAST function lets us convert data from one type to another.
It has a pretty simple syntax - CAST({data} AS {data type})
Okay, there's a ton of these Opens in a new window, so I'm just gonna tell you about the important ones, and throw them into a table in the notes Opens in a new window.
Name | Purpose | Example |
---|---|---|
DATE_ADD/DATE_SUB | Adds or subtracts an interval (like minutes, days, months, etc.) |
|
NOW | Gets the current date and time (based on the server's clock). |
|
EXTRACT | Gets the a segment of date and time data (i.e. year, month). |
|
DATEDIFF | Subtracts one date from another, giving the difference between two dates. |
|
MONTH | For each segment of time and date data, there is a function that will extract that segment - YEAR(), MINUTE(), etc. |
|
MONTHNAME | Given a numbered month, returns the name of the month. |
|
DATE_FORMAT | Lets you format a date to a given format. This is it's own weird little syntax Opens in a new window. You can use this, as it's fairly concise, but I won't blame you if you want to just use EXTRACT, or any of the specific segment functions. |
|
Want to see DATE_FORMAT in action? Run the following in dBeaver:
SELECT DATE_FORMAT(NOW(), '%i minutes, %s seconds past %l %p on %W the %D of %M, in the year %Y.')
AS "Date statement";
So, timezones complicate things. In the interests of keeping things simple, I have set our server's clock to Toronto time. I had to ssh into the server and update the mysql config file. If that's not a thing you know how to do, consider this a heads-up:
Most servers are set to UTC.
This is a good thing. Two words: daylight savings.
Also, it's good to get into the habit of converting things based on timezone, because you may very well have users that are in different time zones.
Okay, remember when I said...
"Any time you try to put data of the wrong type into a column, the whole row of data will get rejected."
There are conditions under which you can insert data of one data type into a column of another data type.
For example, this is okay:
INSERT INTO timetable
(theDate, theTime, theDateTime)
VALUES(NOW(),NOW(),NOW());
But wait! NOW()
returns a date and a time, right? How could we insert data into the theDate or theTime columns (which have data types date
and time
, respectively) if it's the wrong data type?
Well, the truth is, you actually can write an insert statement like the one above - telling the database to put data of one data type into a column of another data type.
In these cases, the data the ends up in the database still matches the column's data type. For example, if you were to run this select statement:
SELECT * FROM timetable;
You'd get a result like this:
theDate |theTime |theDatetime |
----------|--------|---------------------|
2021-02-11|13:22:12|2021-02-11 13:22:12.0|
...even though you inserted NOW()
's same return value (in this case 2021-02-11 13:22:12.0
) three times.
This is because MySQL understands that a datetime value (as returned by NOW()
contains both a date and a time. If the column can only contain a date, MySQL is perfectly happy to throw away your time data and only store the date from the original datetime.
Something similar happens when you store a decimal in a column with the data type INT
:
INSERT INTO numbers (integer)
VALUES (7.51);
SELECT * FROM numbers;
/* Returns integer
|-----------------|
8|
*/
So, while it's always true that data in a column is always of the column's data type, (and while you will throw errors when your data is too big for the allocated memory, or you try to insert a string into a numerical column), there are instances where the database will do you a favour and turn the data being inserted into the correct data type before storing it in the database.
demo
table. Specify the column names in your insert statement.demo
table, this time without specifying the column names.dates
and date_and_time
columns.whole_numbers
column, use the NOW function, along with DATEDIFF, to enter the number of days between now and Hallowe'en.dates
column with the date 20 months before today. Use NOW and the DATE_ADD function to populate the date_and_time
column with the date 500 hours from today.dates
and date_and_time
columns. Check back in the notes for the best function to accomplish this.🔶 This string is 32 bytes ⭐
LENGTH()
returns the number of bytes in a string.)