TDM 20100: Project 7 - SQL Introduction 3
Project Objectives
Learn how to use SQL to join multiple tables (at least 3), and different types of joins, including composite keys and aggregations. Additionally, you will learn about grouping and filtering results using GROUP BY
and HAVING
clauses in SQLite.
Dataset
-
/anvil/projects/tdm/data/lahman/lahman.db
If AI is used in any cases, such as for debugging, research, etc, we now require that you submit a link to the entire chat history. For example, if you used ChatGPT, there is an “Share” option in the conversation sidebar. Click on “Create Link” and please add the shareable link as a part of your citation. The project template in the Examples Book now has a “Link to AI Chat History” section; please have this included in all your projects. If you did not use any AI tools, you may write “None”. We allow using AI for learning purposes; however, all submitted materials (code, comments, and explanations) must all be your own work and in your own words. No content or ideas should be directly applied or copy pasted to your projects. Please refer to the-examples-book.com/projects/fall2025/syllabus#guidance-on-generative-ai. Failing to follow these guidelines is considered as academic dishonesty. |
Questions
Question 1 (2 points)
By now, you should be somewhat familiar with the Lahman baseball dataset. If you haven’t already, take a look at the People
table to understand the structure and the columns it contains. This table contains information about players, such as their names, birth dates, and player IDs. Additionally, you can look at the Salaries
table, which contains information about player salaries, and the Teams
table, which contains information about teams and their IDs.
When you look at the Salaries
table, you will notice that it contains a column called playerID
, which can be used to match up a player’s information from the People
table with their salary information in the Salaries
table. This is useful, as it can let us further analyze data such as comparing player heights to their salaries, or finding the average salary of players by their birth year.
Furthermore, the Teams table contains a column called teamID
, which can be used to match up a player’s team information with their salary information in the Salaries
table. This is useful, as it can let us analyze team budgets and player salaries.
Suppose you want to find the average salary of players by their team and year. To do this, you will need to join the Salaries
, People
, and Teams
tables together.
You should be familiar with joining two tables together, but joining three tables is a bit more complex. You will need to use the JOIN
clause to join the Salaries
table with the People
table on the playerID
column, and then join the resulting table with the Teams
table on the teamID
and yearID
columns. A typical pattern for composite joins is as follows (feel free to run this query in your SQLite environment to see how it works):
SELECT
p.nameFirst || ' ' || p.nameLast AS fullName,
t.name AS teamName,
s.yearID,
s.salary
FROM Salaries s
JOIN People p ON p.playerID = s.playerID
JOIN Teams t ON t.teamID = s.teamID AND t.yearID = s.yearID
ORDER BY s.salary DESC
LIMIT 10;
As you can see, we first join the Salaries
table with the People
table on the playerID
column, and then we join the resulting table with the Teams
table on both the teamID
and yearID
columns. This allows us to get the player’s full name, team name, year, and salary in a single query.
You may notice the second line of that query uses the |
Now that you have a basic understanding of how to join three tables together, you can try to write your own query to find the average salary of players by their team and year.
Last project, you used the |
1.1. A single SQL query returning fullName
, teamName
, yearID
, and salary
.
1.2. Order by salary
descending and show at least 10 rows.
1.3. A single SQL query returning the average salary of players by their team and year.
1.4. What was the average salary of players on the New York Yankees in 2005?
Question 2 (2 points)
Great! Now that you have successfully joined three tables together, let’s look into the different types of joins available in SQL. In this question, you will compare INNER JOIN
and LEFT JOIN
on a three-table query.
In SQL, an INNER JOIN
returns only the rows that have matching values in both tables. For example, if you join the Salaries
table with the People
table using an INNER JOIN
, you will only get the rows where there is a matching playerID
in both tables.
A LEFT JOIN
, on the other hand, returns all the rows from the left table (the first table in the join) and the matching rows from the right table (the second table in the join). If there is no match, NULL values are returned for the columns of the right table. This means that if you join the Salaries
table with the People
table using a LEFT JOIN
, you will get all the rows from the Salaries
table, even if there is no matching playerID
in the People
table.
In this situation, when using LEFT JOIN
the order of the tables is very important. For INNER JOIN
, however, the order does not matter as long as the join conditions are correct.
Let’s look at an example of how to use INNER JOIN
and LEFT JOIN
in a three-table query. Starting with the INNER JOIN
, you can use the following query:
SELECT
p.nameFirst || ' ' || p.nameLast AS fullName,
t.name AS teamName,
s.yearID,
s.salary
FROM Salaries s
INNER JOIN People p ON p.playerID = s.playerID
INNER JOIN Teams t ON t.teamID = s.teamID AND t.yearID = s.yearID
ORDER BY s.salary DESC
LIMIT 10;
Now, let’s look at the same query using LEFT JOIN
:
SELECT
p.nameFirst || ' ' || p.nameLast AS fullName,
t.name AS teamName,
s.yearID,
s.salary
FROM Salaries s
LEFT JOIN People p ON p.playerID = s.playerID
LEFT JOIN Teams t ON t.teamID = s.teamID AND t.yearID = s.yearID
ORDER BY s.salary DESC
LIMIT 10;
As you can see, both queries look the exact same as our original JOIN
query from Question 1, but the difference is in the type of join used. Because the Lahman database is extremely well-structured and maintained, you will not see a difference in the output of these queries. However, for more disorganized/less populated datasets, you may see a difference in the number of rows returned or the presence of NULL values in the output.
There also exists a |
For this question, you will need to write two queries: one using INNER JOIN
and one using LEFT JOIN
. You can use the same query structure as above, but make sure to use the correct join type. Then, compare the results of the two queries and answer the following questions:
-
What is the difference in the number of rows returned by the
INNER JOIN
andLEFT JOIN
queries? -
Are there any NULL values in the
LEFT JOIN
results? If so, what do they represent?
2.1. A single SQL query using INNER JOIN
returning fullName
, teamName
, yearID
, and salary
.
2.2. A single SQL query using LEFT JOIN
returning fullName
, teamName
, yearID
, and salary
.
2.3. A brief explanation of the differences in the results of the two queries, including the number of rows returned and the presence of NULL values.
Question 3 (2 points)
Now that you understand how to join three tables together and compare different types of joins, let’s explore how to filter and group the results using GROUP BY
and HAVING
clauses.
You’ve used the GROUP BY
clause in previous projects to group results by a specific column, such as yearID
or teamID
. However, we can also use HAVING
in conjunction with GROUP BY
to filter the grouped results based on aggregate functions. For example, suppose you want to find the average salary of players by their team and year, but only for teams with an average salary greater than a certain amount. You can use the HAVING
clause to filter the results after grouping.
An example of using GROUP BY
and HAVING
together is as follows:
SELECT
t.name AS teamName,
s.yearID,
AVG(s.salary) AS avgSalary
FROM Salaries s
JOIN Teams t ON t.teamID = s.teamID AND t.yearID = s.yearID
GROUP BY t.name, s.yearID
HAVING AVG(s.salary) > 5000000
ORDER BY avgSalary DESC
LIMIT 10;
In this query, we first join the Salaries
table with the Teams
table to get the team names and years. Then, we group the results by teamName
and yearID
, calculating the average salary for each group. Finally, we use the HAVING
clause to filter out teams with an average salary less than or equal to 5 million.
The |
For this question, write a query to join the Salaries
, People
, and Teams
tables, group the results by team and year, and filter the results using the HAVING
clause to only include teams with an average salary greater than 7.5 million. You can use the query structure provided above as a starting point.
3.1. A single SQL query returning teamName
, yearID
, and avgSalary
.
3.2. Use HAVING
to filter teams with an average salary greater than 7.5 million.
3.3. How many teams meet this criteria?
Question 4 (2 points)
We can also use the HAVING
clause to filter results based on conditions that involve multiple columns. For example, suppose you want to find teams where the average salary is greater than 5 million and the number of players on the team is greater than 20. You can use the HAVING
clause to filter the results based on both conditions.
An example of using HAVING
with multiple conditions is as follows:
SELECT
t.name AS teamName,
s.yearID,
AVG(s.salary) AS avgSalary,
COUNT(s.playerID) AS numPlayers
FROM Salaries s
JOIN Teams t ON t.teamID = s.teamID AND t.yearID = s.yearID
GROUP BY t.name, s.yearID
HAVING AVG(s.salary) > 5000000 AND COUNT(s.playerID) > 20
ORDER BY avgSalary DESC
LIMIT 10;
This will return the team names, years, average salaries, and number of players for teams that have an average salary greater than 5 million and more than 20 players.
For this question, please perform your join on the People
, Salaries
, and Teams
tables, and then use the WHERE
and HAVING
clauses to filter the results based on the following conditions:
-
The average salary is greater than 6.8 million.
-
The number of players on the team is greater than 20.
-
The year is between 2010 and 2015.
Previously, you may have used multiple |
4.1. A single SQL query returning teamName
, yearID
, avgSalary
, and numPlayers
.
4.2. Use HAVING
to filter teams with an average salary greater than 6.8 million and more than 20 players.
4.3. Use WHERE
to filter results for the years 2010 to 2015.
4.4. How many teams meet this criteria?
Question 5 (2 points)
You should have a good understanding of how to join multiple tables, filter results using WHERE
and HAVING
, and group results using GROUP BY
. Now, let’s put all of this knowledge together to answer a more complex question.
Suppose you want to find the top 5 cheapest team-seasons where the total payroll exceeded $150 million and at least one batter on that team hit 30 or more home runs (HR) in the same year. To do this, you will need to:
-
Create a payroll aggregation by
(teamID, yearID)
to compute the total payroll for each team in each year. -
Figure out the total number of home runs (HR) hit by each player in each year using the
SUM(HR)
function, and then aggregate this by(teamID, yearID)
to compute the maximum HR for that team in that year using theMAX(HR)
function. -
Join the two aggregates on
(teamID, yearID)
. -
Filter with
HAVING
(or aWHERE
on the joined aggregates) to keep only qualifying team-seasons. -
Order the results by total payroll in ascending order and limit the results to the top 5.
This may sound complex, but creating these aggregations is very similar to what you’ve done in previous problems. For example, aggregating the payroll would look something like this:
This creates a Common Table Expression (CTE) called |
5.1. A SQL query that meets the above criteria.
5.2. What are the top 5 cheapest team-seasons that meet the criteria?
Submitting your Work
Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.
-
firstname_lastname_project7.ipynb
You must double check your You will not receive full credit if your |