4. Joins and other ways to combining or create tables
Concepts
The JOIN command lets us combine data from two or more tables by matching values in a shared column.
There are 4 primary types of SQL joins: INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, and FULL OUTER JOIN.
We will compare INNER JOIN and LEFT JOIN. In SQL, an INNER JOIN returns only the rows that have matching values in both tables.
A LEFT JOIN, on the other hand, returns all the rows from the left table (the first table in the join) and the matching rows from the right table (the second table in the join). If there is no match, NULL values are returned for the columns of the right table.
JOINSAdditional ways to combine or create tables
| Command | Description | Example | Example Description |
|---|---|---|---|
|
A directional command that may be used to create a new table from the results of a query. |
|
Selects all columns from |
|
Used to insert new records into a table. |
|
Inserts a new record into the |
|
Used to modify existing records in a table. |
|
Updates the |
|
Used to delete records from a table. |
|
Deletes records from the |
|
Used to delete a table or database. |
|
Deletes the |
|
Used to create a new table or database. |
|
Creates a new table called |
Primary and foreign keys
|
Maggie still needs to add content. |
A primary key is a unique identifier for every record in a table. It cannot be left NULL and every row for this column must be unique.
A foreign key is an identifier used to reference a primary key in a different (foreign) table. In other words, a foreign key must exist in another table.
Demo
Let us simply INNER JOIN two tables first:
SELECT * FROM crew AS c INNER JOIN people AS p
ON c.person_id = p.person_id LIMIT 5;
|
This context tells SQL to add all columns from both tables, and then match up the rows based on the common column specified in the |
We can use the WHERE clause and the IS NULL operator to list only crew members who do not have a matching record in the people table.
SELECT * FROM crew AS c LEFT JOIN people AS p
ON c.person_id = p.person_id
WHERE p.name IS NULL
LIMIT 20;
|
|
Join three tables
Let us see all the movies Christopher Nolan worked on:
SELECT * FROM crew
INNER JOIN people ON crew.person_id = people.person_id
INNER JOIN titles ON crew.title_id = titles.title_id
WHERE people.name = 'Christopher Nolan' AND titles.type = 'movie';
Order them by premiered year:
SELECT * FROM crew
INNER JOIN people ON crew.person_id = people.person_id
INNER JOIN titles ON crew.title_id = titles.title_id
WHERE people.name = 'Christopher Nolan' AND titles.type = 'movie'
ORDER BY titles.premiered;
Now, we want to see what type of productions (movies, video, etc.) Christopher Nolan is mainly working on (executing this code will take some time):
SELECT t.type, COUNT(*) AS count
FROM titles AS t
JOIN crew AS c ON t.title_id = c.title_id
JOIN people AS p ON c.person_id = p.person_id
WHERE p.name = 'Christopher Nolan'
GROUP BY t.type
ORDER BY count DESC;
We can find what type of movies Christopher Nolan is mainly working on:
SELECT t.genres, COUNT(*) AS count
FROM titles AS t
JOIN crew AS c ON t.title_id = c.title_id
JOIN people AS p ON c.person_id = p.person_id
WHERE p.name = 'Christopher Nolan' AND t.type = 'movie'
GROUP BY t.genres
ORDER BY count DESC;
Let us practise 'GROUP BY' and 'HAVING' with 'JOIN' by finding the average rating of movies by year and genre, but only for the movies with an average rating is greater than a certain amount.
SELECT t.type, t.genres, t.premiered, AVG(r.rating) AS average_rating
FROM ratings AS r
LEFT JOIN titles AS t ON r.title_id = t.title_id
WHERE t.type = 'movie'
GROUP BY t.genres, t.premiered
HAVING t.premiered = 2000 AND average_rating > 8;
|
The |
WITH
Our queries are becoming longer and more complex with multiple joins. When we need to run several operations on the same joined data, repeating those joins can be inefficient and hard to read. To simplify this, we can use the WITH clause to create a temporary table for the joined data and run our operations on it. The example below shows how:
WITH joined_table AS (SELECT * FROM ratings AS r LEFT JOIN titles AS t ON r.title_id = t.title_id)
SELECT * FROM joined_table LIMIT 5;
|
This temporary table joined_data will only exist for the duration of the query (the current code cell in this case). If you want to make it permanent, you can use the |
Practice on your own
4.1. Look at first 5 records in at least 3 tables in the Lahman dataset, not including the People table.
4.2. Name any columns that appear in multiple tables, or only in a specific table.
4.3. Describe how the tables might relate to each other.
4.4 Write a SQL query to join the People and Salaries tables using the playerID column, and limit the results to the first 5 records.
4.5. Write a SQL query to join the People and Salaries tables, and return the average salary of players grouped by their weight, sorted in descending order, limiting the results to 10 records.
4.6. WriteJOIN command to combine the Batting table with the People table. Answer the next questions with this info.
4.7. Write a SQL query to find how many times Babe Ruth reached the home plate.
4.8. Write a SQL query to find the average number of points scored by players born in the 1980s, grouped and ordered by their birth year.
4.9. Query to find the 10 highest scoring players and their names.
4.10. Query to find the 10 highest hitting players and their names.
4.11. Are there any players who appear in both lists? If so, who are they?