TDM 20100: Project 6 - SQL Introduction 2
Project Objectives
Learn how to use SQL to join tables, aggregate data, and perform more complex queries. This project will help you understand how to manipulate and analyze data using SQL.
Dataset
-
'/anvil/projects/tdm/data/lahman/lahman.db'
If AI is used in any cases, such as for debugging, research, etc, we now require that you submit a link to the entire chat history. For example, if you used ChatGPT, there is an “Share” option in the conversation sidebar. Click on “Create Link” and please add the shareable link as a part of your citation. The project template in the Examples Book now has a “Link to AI Chat History” section; please have this included in all your projects. If you did not use any AI tools, you may write “None”. We allow using AI for learning purposes; however, all submitted materials (code, comments, and explanations) must all be your own work and in your own words. No content or ideas should be directly applied or copy pasted to your projects. Please refer to the-examples-book.com/projects/fall2025/syllabus#guidance-on-generative-ai. Failing to follow these guidelines is considered as academic dishonesty. |
Questions
Question 1 (2 points)
Last project, we learned about a bunch of different SQL commands, operators, and functions. However, we stuck only to the People
table. For this question, go explore some of the other tables in the Lahman dataset. You can look back on the previous project for examples of how to initially load the Lahman dataset into SQLite, and how to query the sqlite_master
table to see all the tables in the database. Once you have this list of table names, please pick at least 3 tables and look at the first 5 records in each of them. You can use the LIMIT
clause to limit the number of records returned by your query. What do you notice about the data in these tables? Are there any columns that appear in multiple places? Are there any columns that are unique to a specific table? How do the tables relate to each other?
1.1. Look at first 5 records in at least 3 tables in the Lahman dataset, not including the People
table.
1.2. Name any columns that appear in multiple tables, or only in a specific table.
1.3. Describe how the tables might relate to each other.
NOTE Did you see the note regarding the new AI policy? Scroll back up to the top
Question 2 (2 points)
If you did not happen to look at the Salaries
table in the previous question, please do so now. This table contains information about player salaries, which can be useful for analyzing player performance and team budgets. You may notice that in both the People
and Salaries
tables, there is a column called playerID
. This column can be used to help match up a player’s information from the People
table, with their salary information in the Salaries
table. However, it must be very slow to find a playerID in one table, and then look it up in another table. Luckily, SQL has already solved this problem with the JOIN
command. The JOIN
command allows us to combine data from two or more tables based on a common column between them. In this case, we can use the playerID
column to join the People
and Salaries
tables into a single table that contains both player information and salary information. This is useful, as it can let us further analyze data such as comparing player heights to their salaries, or finding the average salary of players by their birth year.
Using the JOIN
command is quite simple. You can use the following syntax:
SELECT * FROM table1 JOIN table2 ON table1.columnname = table2.columnname;
This context tells SQL to add all columns from both tables, and then match up the rows based on the common column specified in the
|
Now, please write a SQL query to join the People
and Salaries
tables using the playerID
column. Limit the results to the first 5 records using LIMIT
:
%%sql
SELECT * FROM Salaries JOIN People ON Salaries.playerID = People.playerID LIMIT 5;
If we want to perform any other operations or functions on the joined data, we can add those to the end of the query. However, you may notice that these queries can get quite long and complex. Additionally, if we want to perform multiple different operations on the joined data, we need to write multiple queries and join the same tables multiple times. This can be inefficient and hard to read. To solve this problem, we can use the WITH
clause to create a temporary table to store the joined data. Then, we can perform our operations on this temporary table. The below code shows an example of how to do this:
%%sql
WITH joined_data AS (
SELECT * FROM Salaries JOIN People ON People.playerID = Salaries.playerID
)
SELECT * FROM joined_data LIMIT 5;
This temporary table |
Now that you know how to join tables and create temporary tables, please write a SQL query that joins the People
and Salaries
tables, and then returns the average salary of players grouped by their weight, sorted in descending order. Limit the results to 10 records.
2.1. Write a SQL query to join the People
and Salaries
tables using the playerID
column, and limit the results to the first 5 records.
2.2. Write a SQL query to join the People
and Salaries
tables, and return the average salary of players grouped by their weight, sorted in descending order, limiting the results to 10 records.
Did you see the note regarding the new AI policy? Click here and read it |
Question 3 (2 points)
Let’s look at some other fun statistics we can calculate with the JOIN
command. If you didn’t in Question 1, go look at the Batting
table. This table contains information about player batting statistics, such as hits, home runs, and runs batted in (RBIs). We can use the JOIN
command to combine this data with the People
table to analyze player performance. Firstly, write a SQL query to join the People
and Batting
tables using the playerID
column:
%%sql
WITH joined_data AS (
SELECT * FROM People JOIN Batting ON People.playerID = Batting.playerID
)
Find the career number of points scored, or how many times they reached the home plate (column name 'R') for Babe Ruth.
Babe Ruth reached the home plate 2174 times in his career, so that is the number you should get.
Then, write a SQL query to find the average number of points scored by players born in the 1980s, grouped and ordered by their birth year.
3.1. SQL query to find how many times Babe Ruth reached the home plate.
3.2. SQL query to find the average number of points scored by players born in the 1980s, grouped and ordered by their birth year.
Did you see the note regarding the new AI policy? Click here and read it |
Question 4 (2 points)
Write a SQL query to find the 10 highest scoring players in the Batting
table, ordered by their career runs scored (column name 'R') in descending order. Be sure to JOIN
with the People
table so you can get the name of the player, and also GROUP BY
the playerID
to ensure you are getting the total runs scored for each player across all seasons, not just a single season.
Additionally, write another SQL query to find the 10 highest hitting players in the Batting
table, ordered by their career hits (column name 'H') in descending order.
4.1. Query to find the 10 highest scoring players and their names.
4.2. Query to find the 10 highest hitting players and their names.
4.3. Are there any players who appear in both lists? If so, who are they?
Did you see the note regarding the new AI policy? Click here and read it |
Question 5 (2 points)
There’s another table that we may find interesting. The CollegePlaying
table contains information about players who played in college before joining the major leagues:
%%sql
SELECT * FROM CollegePlaying LIMIT 5;
This can be useful for analyzing where top players come from and how their college performance may have influenced their professional careers. Write a SQL query to find the players who played at Purdue University, and the sort them by their career number of runs scored (column name 'R') in descending order. Be sure to limit the results to the first 10 records. What is the highest number of runs scored by a player from Purdue University?
You don’t need to join with the
|
Then, write another SQL query to find the average number of runs scored by players, grouped by their college name, and ordered by the average runs scored in descending order. Again, you can use the CollegePlaying
and Batting
tables for this query. Which university has the highest average runs scored by its players?
5.1. SQL query to find the players who played at Purdue University, sorted by their career runs scored in descending order, limited to the first 10 records.
5.2. Highest number of runs scored by a player from Purdue University.
5.3. SQL query to find the average number of runs scored by players, grouped by their college name, ordered by the average runs scored in descending order.
5.4. Which university has the highest average runs scored by its players?
Did you see the note regarding the new AI policy? Click here and read it |
Submitting your Work
Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.
-
firstname_lastname_project6.ipynb
You must double check your You will not receive full credit if your |