4. Joins and other ways to combining or create tables

Concepts

The JOIN command lets us combine data from two or more tables by matching values in a shared column.

There are 4 primary types of SQL joins: INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, and FULL OUTER JOIN.

We will compare INNER JOIN and LEFT JOIN. In SQL, an INNER JOIN returns only the rows that have matching values in both tables.

A LEFT JOIN, on the other hand, returns all the rows from the left table (the first table in the join) and the matching rows from the right table (the second table in the join). If there is no match, NULL values are returned for the columns of the right table.

JOINS
Figure 1. 4 types of SQL JOINS

Additional ways to combine or create tables

Command Description Example Example Description

INTO

A directional command that may be used to create a new table from the results of a query.

SELECT * INTO new_table FROM table

Selects all columns from table and creates a new table called new_table with the results

INSERT

Used to insert new records into a table.

INSERT INTO table (column1, column2) VALUES (value1, value2)

Inserts a new record into the table with column1 set to value1 and column2 set to value2

UPDATE

Used to modify existing records in a table.

UPDATE table SET column = value WHERE condition

Updates the column in the table to value where the condition is met

DELETE

Used to delete records from a table.

DELETE FROM table WHERE condition

Deletes records from the table where the condition is met

DROP

Used to delete a table or database.

DROP TABLE table

Deletes the table from the database

CREATE

Used to create a new table or database.

CREATE TABLE table (column1 datatype, column2 datatype)

Creates a new table called table with column1 and column2 of specified data types

Primary and foreign keys

Maggie still needs to add content.

A primary key is a unique identifier for every record in a table. It cannot be left NULL and every row for this column must be unique.

A foreign key is an identifier used to reference a primary key in a different (foreign) table. In other words, a foreign key must exist in another table.

Demo

Let us simply INNER JOIN two tables first:

SELECT * FROM crew AS c INNER JOIN people AS p
ON c.person_id = p.person_id LIMIT 5;

This context tells SQL to add all columns from both tables, and then match up the rows based on the common column specified in the ON clause.

We can use the WHERE clause and the IS NULL operator to list only crew members who do not have a matching record in the people table.

SELECT * FROM crew AS c LEFT JOIN people AS p
ON c.person_id = p.person_id
WHERE p.name IS NULL
LIMIT 20;

RIGHT JOIN is the opposite of a LEFT JOIN, but SQLite does not support it natively. If you need to use a right join, you can typically just use a left join with the order of the tables flipped. Additionally, there is an OUTER JOIN, which is a combination of LEFT JOIN and RIGHT JOIN. It returns all rows from both tables, with NULLs in places where there is no match. However, SQLite does not support FULL OUTER JOIN natively either.

Join three tables

Let us see all the movies Christopher Nolan worked on:

SELECT * FROM crew
INNER JOIN people ON crew.person_id = people.person_id
INNER JOIN titles ON crew.title_id = titles.title_id
WHERE people.name = 'Christopher Nolan' AND titles.type = 'movie';

Order them by premiered year:

SELECT * FROM crew
INNER JOIN people ON crew.person_id = people.person_id
INNER JOIN titles ON crew.title_id = titles.title_id
WHERE people.name = 'Christopher Nolan' AND titles.type = 'movie'
ORDER BY titles.premiered;

Now, we want to see what type of productions (movies, video, etc.) Christopher Nolan is mainly working on (executing this code will take some time):

SELECT t.type, COUNT(*) AS count
FROM titles AS t
JOIN crew AS c ON t.title_id = c.title_id
JOIN people AS p ON c.person_id = p.person_id
WHERE p.name = 'Christopher Nolan'
GROUP BY t.type
ORDER BY count DESC;

We can find what type of movies Christopher Nolan is mainly working on:

SELECT t.genres, COUNT(*) AS count
FROM titles AS t
JOIN crew AS c ON t.title_id = c.title_id
JOIN people AS p ON c.person_id = p.person_id
WHERE p.name = 'Christopher Nolan' AND t.type = 'movie'
GROUP BY t.genres
ORDER BY count DESC;

Let us practise 'GROUP BY' and 'HAVING' with 'JOIN' by finding the average rating of movies by year and genre, but only for the movies with an average rating is greater than a certain amount.

SELECT t.type, t.genres, t.premiered, AVG(r.rating) AS average_rating
FROM ratings AS r
LEFT JOIN titles AS t ON r.title_id = t.title_id
WHERE t.type = 'movie'
GROUP BY t.genres, t.premiered
HAVING t.premiered = 2000 AND average_rating > 8;

The HAVING clause is very similar to the WHERE clause, but the key difference is that it is used to filter results after aggregation, while WHERE is used to filter results before aggregation.

WITH

Our queries are becoming longer and more complex with multiple joins. When we need to run several operations on the same joined data, repeating those joins can be inefficient and hard to read. To simplify this, we can use the WITH clause to create a temporary table for the joined data and run our operations on it. The example below shows how:

WITH joined_table AS (SELECT * FROM ratings AS r LEFT JOIN titles AS t ON r.title_id = t.title_id)

SELECT * FROM joined_table LIMIT 5;

This temporary table joined_data will only exist for the duration of the query (the current code cell in this case). If you want to make it permanent, you can use the CREATE TABLE command instead of WITH, with the exact same syntax. However, in this case, you will not be able to do this on the IMDB or Lahman dataset as it is stored as a read-only database in anvil.

Practice on your own

Deliverables

4.1. Look at first 5 records in at least 3 tables in the Lahman dataset, not including the People table.

4.2. Name any columns that appear in multiple tables, or only in a specific table.

4.3. Describe how the tables might relate to each other.

4.4 Write a SQL query to join the People and Salaries tables using the playerID column, and limit the results to the first 5 records.

4.5. Write a SQL query to join the People and Salaries tables, and return the average salary of players grouped by their weight, sorted in descending order, limiting the results to 10 records.

4.6. WriteJOIN command to combine the Batting table with the People table. Answer the next questions with this info.

4.7. Write a SQL query to find how many times Babe Ruth reached the home plate.

4.8. Write a SQL query to find the average number of points scored by players born in the 1980s, grouped and ordered by their birth year.

4.9. Query to find the 10 highest scoring players and their names.

4.10. Query to find the 10 highest hitting players and their names.

4.11. Are there any players who appear in both lists? If so, who are they?