Week 5 - Outer Joins & also Data Types

JOINS (Part ✌️)!
👉👈

Plus bonus datatypes

You get your own database!

I've created a database for each of you! If you have your own login credentials (i.e. the username and password I emailed to you), then you should see a database named after you (or your student number). In that database is a table called demo.

You won't need to use this database this week, but you will next week.

If you don't have unique working credentials either because I didn't email them to you, or there's some other issue, email me Opens in a new window or let me know at the end of class today.

Takeaways about non-inner joins this week:

Outer JOINs return rows even when there isn't a match with the other table.
There are some weird types of JOINs that don't come up very often.
You can find entities that have things in common by using a self-join.
JOINs have several syntax variations.
UNION isn't a JOIN, but it will let you put rows from different tables into the same results table.

Okay: more JOINs, types of data, here we go.

A join is when you merge data from two (or more) different tables to create your results table.

We match rows from each of the tables by selecting one column from each to match up.

SELECT [columns] 
FROM [table] JOIN
  [otherTable]
  ON [table].[column] = [otherTable].[column]

Today we'll add "Outer JOINs" (and a few more ways of merging table data).

SELECT [columns] 
FROM [table] 
  [INNER|OUTER (RIGHT|LEFT)] JOIN
  [otherTable]
  ON [table].[column] = [otherTable].[column]

Outer JOINs

Inner joins only include data from rows where a match from the ON clause is found in both tables.

Outer joins also include data from rows where there isn't a match.

Left outer joins include data from rows where a match is found in both tables plus rows from the first table.

Right outer joins include data from rows where a match is found in both tables plus rows from the second table.

Compare the results for this query...

SELECT 
  m.member_id AS "Members Table Member ID",
  w.member_id AS "Withdrawals Table Member ID",
  first_name, last_name, withdrawal_id
FROM members m
  JOIN withdrawals w
  ON m.member_id = w.member_id
ORDER BY m.member_id DESC

...with the results for this query.

SELECT 
  m.member_id AS "Members Table Member ID",
  w.member_id AS "Withdrawals Table Member ID",
  first_name, last_name, withdrawal_id
FROM members m
  LEFT OUTER JOIN withdrawals w
  ON m.member_id = w.member_id
ORDER BY m.member_id DESC

Let's revisit our students and assignments tables from last week.

Say we have two tables: one for assignments, one for students.

Each assignment belongs to a class. Each student belongs to a class. If we use our join statement to "merge" these two tables, it will output a row for every instance of a match between the assignments table, and the students table. This tells us, based on what class a student is in, what assignment they have to do.

Assignments
Class	Assignment
1	Paper
2	Lab
4	Exam

Students
Class	Student
2	Birinder
3	Amandeep
4	Ryan

(Inner) join

SELECT assignments.class, student, assignment
FROM assignments
  JOIN students
    ON assignments.class = students.class

2	Birinder	Lab
4	Ryan	Exam

Left (outer) join

SELECT assignments.class, student, assignment
FROM assignments
LEFT JOIN students
ON assignments.class = students.class

2	Birinder	Lab
4	Ryan	Exam
1	`(null)`	Paper

Right (outer) join

SELECT students.class, student, assignment
FROM assignments
RIGHT JOIN students
ON assignments.class = students.class

2	Birinder	Lab
4	Ryan	Exam
3	Amandeep	`(null)`

Why would we RIGHT JOIN table A with table B when we could LEFT JOIN table B with table A?

The answer is, normally, we wouldn't. Right joins only really come in handy when we're joining more than one table to another and things start to get awkward syntactically.

Joining a single table (to itself)

SELECT DISTINCT 
  a1.first_name, 
  a1.last_name, 
  a1.stateProv
FROM authors a1 JOIN authors a2
    ON a1.stateProv = a2.stateProv 
      AND a1.author_id <> a2.author_id

Okay, this is a bit weird, but let's think it through:

This query joins two tables (one just happens to be a duplicate of the other).

We're used to joining things on a key (like an id) that is unique for at least one of the tables.

In this case, however, we're joining on the state/province. So this query produces a list of all authors from one table that have the same state/province as the authors in the other table (a1.stateProv = a2.stateProv).

Then we say we're only going to join the rows if the id in the first table doesn't match the id in the second table. So we're asking the question, "where are the rows with the same state/province but different authors?"

Or, another way to put that is, "what states or provinces have multiple authors?" Using DISTINCT (otherwise we would find a row in each of the two tables that met the conditions), what we end up with is a list of authors that share a state or province.

Cross joins - don't let them throw you off.

There's another type of join, but don't get it confused with inner or outer joins. The cross join (or 'Cartesian join') doesn't merge rows - it multiplies them.

For every row in the first table, it returns a copy appended with every row in the second table.

This is not used often.

SELECT 
  assignments.class, 
  students.class, 
  student, 
  assignment
FROM assignments, students

1	2	Birinder	Paper
2	2	Birinder	Lab
4	2	Birinder	Exam
1	3	Amandeep	Paper
2	3	Amandeep	Lab
4	3	Amandeep	Exam
1	4	Ryan	Paper
2	4	Ryan	Lab
4	4	Ryan	Exam

Syntax variations

The syntax we've gone over today is chosen to make it clear what we're doing, but, as usual with SQL, there are other ways to write things. Shorter or implicit syntax is risky with joins, as it's easy to write something you don't mean. Still, you'll need to recognize it if you come across it.

Another way to 'join' table data
(that isn't a JOIN)

We're not going to dive into this today, but I want you to have heard of this...

Note that the columns from the two tables are treated as single columns in the results table, and need the same data type.

Mimicking a Full (outer) join

Unlike most other SQL-driven RDBMS', MySQL does not feature a FULL JOIN. However, we can mimic it with the UNION clause, which combines queries (while omitting duplicate rows).

If we create a UNION between a right join and a left join, we get an identical result to a full join.

SELECT assignments.class, student, assignment 
FROM assignments
  LEFT JOIN students
    ON assignments.class = students.class
UNION
SELECT students.class, student, assignment 
FROM assignments
  RIGHT JOIN students
    ON assignments.class = students.class

2	Birinder	Lab
4	Ryan	Exam
1	`(null)`	Paper
3	Amandeep	`(null)`

Isn't that fun?

The INSERT command

Check out the notes for a walkthrough of how to insert data into a database, and why data types matter.

More ways to write an INSERT command

You can insert data for any set of columns, in any order, provided you specify the columns before the VALUES keyword.

INSERT INTO album (artistId, title)
  VALUES (12, 'Master of Reality');

You can write an INSERT command without specifying the columns if you are inserting data for every column, in the same order as those columns appear in the database table.

INSERT INTO album 
  VALUES (352, 'Sabbath Bloody Sabbath', 12);

You can insert multiple rows provided they all have the same column data in the same order.

INSERT INTO album
  VALUES 
    (353, 'Technical Ecstasy', 12), 
    (354, 'Sabotage', 12);

Why do data types matter?

Data types matter because data integrity matters.

Data integrity basically means that the data is going in the right place.

If try to put a date into your primary key column, that means something has gone wrong with your process. So we throw an error to show that something needs fixing, rather than letting the mistake happen over and over again.

We can also say that data types matter because you can do stuff with comparable data.

What's 2 days after today? We can find this out with a date function!

What's 10 times 54321? We can find this out with an arithmetic operator!

What's 72 divided by orange? That doesn't make any sense! Why not? Different data types.

Each column in a database contains only one data type. Any time you try to put data of the wrong type into a column, the whole row of data will get rejected note - I'm simplifying this a bit, more on this later.

Data type categories

SQL databases have a lot of different data types, but they all fall into three main categories:

characters,
numbers, and
date/time

Characters

Characters, also known as 'strings', can contain just about anything - letters, numbers, emojis, punctuation, et al. You're pretty familiar with string data at this point, and you know that you can use functions like SUBSTR() and CONCAT() to work with them, and that they're wrapped in single quotes. You also have figured out, at this point, the trying to use a SUM() function on a column full of strings wouldn't work out too well.

Numbers

Numbers, or 'numerical data' are what allow us to do our math-based data mining, using arithmetic and a bunch of different aggregate functions.

Dates & times

Dates and times are a separate category of data from either characters or numbers - and probably the trickiest to work with. We haven't done too much with dates so far, but we'll learn a lot today about how to properly format different date and time data, and what sort of functions are available to help us make use of that data!

Specifics

Like I said, SQL databases have just a ton of different data types, and there are a lot of data types that will appear in one RDBMS but not another.

We will not be learning all the different data types today.

Instead, I've chosen to focus on the data types that are most generally useful, and which are most likely to be shared across different database management systems.

Why not just three data types?

There are a number of reasons for the diversity of data types, but the main difference you'll see between two data types of the same data type category is storage space.

TINYTEXT, TEXT, MEDIUMTEXT, and LONGTEXT, as you might guess, are all capable of storing the same kind of data. The difference is that...

TINYTEXT can store up to 255 bytes,
TEXT can store up to 65,535 bytes,
MEDIUMTEXT can store up to 16,777,215 bytes, and
LONGTEXT can store up to 4,294,967,295 bytes.

When you're creating a column in a table, most string data types let you specify your own maximum number of characters (or, with numbers, the 'precision'), if you don't want to use the maximum.

This can both 1) make your database more efficient, and 2) protect your data integrity.

Just a note - the number of bytes is not the same as the number of characters. Emojis might be 4 bytes, for example.

Data types for characters

VARCHAR

VARCHAR lets you store up to 65,535 bytes. If you don't specify a length, it will default to 30 characters. Most commonly, you see the limit set to 255 characters, as that used to be the limit for a lot of RDBMSs.

For reference, this is 255 characters (including whitespace):

The Arctic char or Arctic charr (Salvelinus alpinus) is a cold-water fish in the family Salmonidae, native to alpine lakes and arctic and subarctic coastal waters. Its distribution is circumpolar. It spawns in fresh water and populations can be lacustrine

VARCHAR is a "variable length" data type, meaning that each row only stores as much data as you put into it. This is good for saving space, but can impact the speed of your queries if you're dealing with databases that are 100GB+. If you need speedy look ups, use the CHAR data type instead.

VARCHAR is probably the most common data type you'll use.

MEDIUMTEXT

Need something bigger than VARCHAR? If you're storing, say, long-form articles, you may want to use MEDIUMTEXT (or even LONGTEXT, if you're storing entire books).

Data types for numbers

INT

INT is short of 'integer', meaning it can store whole numbers.

By default, integer data types (like INT, SMALLINT, and BIGINT) are "signed", which just means you can put negative values in them (the '-' sign in '-12' is the 'sign'). When you create a column, you can specify whether the data type is signed or unsigned. Unsigned data has a higher maximum value. For example, by default, INT can hold values from -2147483648 to 2147483647. If it's unsigned, however, it can hold values up to 4294967295, but nothing below zero.

DECIMAL

DECIMAL, as you might guess, lets you hold numbers with decimal values to a specific number of places.

When you define the DECIMAL, you provide two arguments - precision and scale. Precision means how many number in total, and scale means how many decimal places. If you write DECIMAL(6,2), that column can hold values up to 9,999.99 (two decimal places, 6 numbers in total).

DECIMAL comes in handy for storing things like cash values.

Watch out! For some reason, the default number of decimal places in the DECIMAL data type is none.

FLOAT

Similar to DECIMAL, FLOAT lets you hold numbers with decimal values. The difference is that FLOAT doesn't have a fixed number of decimal places. It's more handy for storing values like "one third".

Data types for dates & times

DATE

DATE stores, well, a date! It will be in the following format by default: YYYY-MM-DD.

On the off chance that you're storing data about dates before 1000 A.D., you'll need to go with a different format! The minimum value for DATE is '1000-01-01'.

TIME

TIME, predictably, stores data for hours, minutes, and seconds, in the format 'HH:MM:SS'. When defining a column's data type, you could add milliseconds by defining the data type as TIME(3).

DATETIME

A combination of the above, DATETIME stores values in the format 'YYYY-MM-DD HH:MM:SS'

TIMESTAMP

The TIMESTAMP data type is formatted the same as DATETIME ('YYYY-MM-DD HH:MM:SS'), but it is stored differently in the database.

Whereas DATETIME just stores whatever value you put into it, TIMESTAMP records the date and time relative to Universal Coordinated Time (UTC), meaning that it is a fixed point in time, independent of timezones.

TIMESTAMP is intended to record events.

As such, note that TIMESTAMP only allows for dates between 1970 and (early) 2038 - so not the best datatype for storing birth dates.

Functions for working with data types

We've already covered a great deal of functions that are appropriate for certain data types. I wanted to remind you of CAST() (for converting data from one type to another), and introduce some time/date based functions.

CAST()

The CAST function lets us convert data from one type to another.

It has a pretty simple syntax - CAST({data} AS {data type})

See the notes for an example Opens in a new window

Date & time related functions

Okay, there's a ton of these Opens in a new window, so I'm just gonna tell you about the important ones, and throw them into a table in the notes Opens in a new window.

Note that I encourage you to try these functions out to see what they do, but many are MySQL-specific, so they may not work as queries in the browser demos.

Name	Purpose	Example
DATE_ADD/DATE_SUB	Adds or subtracts an interval (like minutes, days, months, etc.)	`DATE_ADD('2018-05-01', INTERVAL 1 DAY)`
NOW	Gets the current date and time (based on the server's clock).	`NOW()`
EXTRACT	Gets the a segment of date and time data (i.e. year, month).	`EXTRACT(YEAR FROM '2019-07-02')`
DATEDIFF	Subtracts one date from another, giving the difference between two dates.	`DATEDIFF('2021-03-31', NOW())`
MONTH	For each segment of time and date data, there is a function that will extract that segment - YEAR(), MINUTE(), etc.	`MONTH('2021-03-31')`
MONTHNAME	Given a numbered month, returns the name of the month.	`MONTHNAME('2021-03-31')`
DATE_FORMAT	Lets you format a date to a given format. This is it's own weird little syntax Opens in a new window. You can use this, as it's fairly concise, but I won't blame you if you want to just use EXTRACT, or any of the specific segment functions.	`DATE_FORMAT('2021-03-31', '%W %M %Y')`

A reminder about formatting dates - your data will need to be in the right date format to get put into a column. To compare one date to another, they need to be in the same format.

Working with timezones

So, timezones complicate things. In the interests of keeping things simple, I have set our server's clock to Toronto time. I had to ssh into the server and update the mysql config file. If that's not a thing you know how to do, consider this a heads-up:

Most servers are set to UTC.

This is a good thing. Two words: daylight savings.

Also, it's good to get into the habit of converting things based on timezone, because you may very well have users that are in different time zones.

But wait!

Okay, remember when I said...

"Any time you try to put data of the wrong type into a column, the whole row of data will get rejected."

There are conditions under which you can insert data of one data type into a column of another data type.

For example, this is okay:

INSERT INTO timetable 
  (theDate, theTime, theDateTime)
  VALUES(NOW(),NOW(),NOW());

But wait! NOW() returns a date and a time, right? How could we insert data into the theDate or theTime columns (which have data types date and time, respectively) if it's the wrong data type?

Well, the truth is, you actually can write an insert statement like the one above - telling the database to put data of one data type into a column of another data type.

In these cases, the data the ends up in the database still matches the column's data type. For example, if you were to run this select statement:

SELECT * FROM timetable;

You'd get a result like this:

theDate   |theTime |theDatetime          |
----------|--------|---------------------|
2021-02-11|13:22:12|2021-02-11 13:22:12.0|

...even though you inserted NOW()'s same return value (in this case 2021-02-11 13:22:12.0) three times.

This is because MySQL understands that a datetime value (as returned by NOW() contains both a date and a time. If the column can only contain a date, MySQL is perfectly happy to throw away your time data and only store the date from the original datetime.

Something similar happens when you store a decimal in a column with the data type INT:

INSERT INTO numbers (integer) 
    VALUES (7.51);
SELECT * FROM numbers;
/* Returns integer
|-----------------|
                 8|
*/

So, while it's always true that data in a column is always of the column's data type, (and while you will throw errors when your data is too big for the allocated memory, or you try to insert a string into a numerical column), there are instances where the database will do you a favour and turn the data being inserted into the correct data type before storing it in the database.

JOINS (Part ✌️)!👉👈

Let's throw an error!

Exercises!

JOINS (Part ✌️)!
👉👈