TDM 20100: Project 9 — SQL

Motivation: Now we learn how to write SQL queries that rely on more than one table.

Context: The JOIN in SQL enables us to make queries that rely on information from multiple SQL tables. It is absolutely important to tell SQL which rows need to agree, by including the ON portion of the JOIN statement.

Scope: SQLite queries use a JOIN to gather information from more than one table.

Learning Objectives:
  • We will learn how to make SQLite queries on multiple tables at a time (using the JOIN)

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following dataset:

  • /anvil/projects/tdm/data/lahman/lahman.db (Lahman baseball database)

Our page in The Examples Book about SQL (in general) is given here: the-examples-book.com/tools/SQL/

Before you begin the project, try the examples from the Lahman baseball database found on this webpage of The Examples Book: the-examples-book.com/tools/SQL/lahman-examples-one-join All of these examples rely on one JOIN statement, to extract information from two tables.

Questions

Using the seminar kernel, if you run this line in a cell by itself:

%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db

If your kernel dies, then you need to re-run the line above. You also need to re-run this line at the start of any new Jupyter Lab session.

Again, we remind students that the list of all of the tables in this database are:

AllstarFull
Appearances
AwardsManagers
AwardsPlayers
AwardsShareManagers
AwardsSharePlayers
Batting
BattingPost
CollegePlaying
Fielding
FieldingOF
FieldingOFsplit
FieldingPost
HallOfFame
HomeGames
Managers
ManagersHalf
Parks
People
Pitching
PitchingPost
Salaries
Schools
SeriesPost
Teams
TeamsFranchises
TeamsHalf

Please read the examples given here: the-examples-book.com/tools/SQL/lahman-examples-one-join and then you are ready to start the questions for this project!

In the page of examples, sometimes we write JOIN and sometimes we write INNER JOIN. These are interchangeable; in other words, JOIN and INNER JOIN mean the same thing. (There are other types of statements such as LEFT JOIN and RIGHT JOIN but we will not use either of these, in this project.)

Question 1 (2 pts)

Join the Batting table to the People by matching the playerID values in these two tables. For all 48 players on the 2023 Chicago Cubs team, print their PlayerID (from either table), as well as their hits (H) and home runs (HR) from the Batting table, and also their nameFirst and nameLast from the People table.

Deliverables
  • Print the playerID, H, HR, nameFirst, and nameLast values for all 48 of the players on the 2023 Chicago Cubs team.

Question 2 (2 pts)

Join the Batting table to the Pitching table by matching the playerID, yearID, stint, and yearID columns. There is only one person from 2023 appearing in both of these tables that hit more than 30 home runs. Print this person’s playerID and the number of home runs (HR) that they attained (from the Batting table).

Deliverables
  • Print the PlayerID and the number of home runs (HR) from the Batting table for the only person who is in both the Batting and Pitching table in 2023 who had more than 30 home runs (HR) in the Batting table.

Question 3 (2 pts)

In this question, we will accomplish everything from Project 8, Question 3abc in just one query.

Join the People and Batting table by matching the playerID values in these two tables. Print only 1 row, corresponding to Rickey Henderson, displaying his playerID, nameFirst, nameLast, SUM(R), and SUM(SB) values.

He had 2295 runs scored altogether and 1406 stolen bases.

Deliverables
  • Print only 1 row, corresponding to Rickey Henderson, displaying his playerID, nameFirst, nameLast, SUM(R), and SUM(SB) values.

Question 4 (2 pts)

  1. As in Project 8, Question 4a, use the Batting table but now also JOIN the People table (by matching the playerID values), to find the top 5 players of all time, in terms of their total number of hits, in other words, according to SUM(H). For the top 5 players (in terms of the total number of hits), print their playerID, the SUM(H) (in other words, their total number of hits in their careers), and their nameFirst and nameLast values.

  2. Same question as 4b, but this time use home runs (according to SUM(HR)) instead of hits.

Deliverables
  • For the top 5 players (in terms of the total number of hits), print their playerID, the SUM(H) (in other words, their total number of hits in their careers), and their nameFirst and nameLast values.

  • For the top 5 players (in terms of the total number of home runs), print their playerID, the SUM(HR) (in other words, their total number of home runs in their careers), and their nameFirst and nameLast values.

Question 5 (2 pts)

Join the CollegePlaying and People tables on the playerID values. Print the DISTINCT(playerID) and nameFirst and nameLast values from the playerID table for each of the 15 distinct players that have schoolID = 'purdue' in the CollegePlaying table.

Deliverables
  • Print the DISTINCT(playerID) and nameFirst and nameLast values from the playerID table for each of the 15 distinct players that have schoolID = 'purdue' in the CollegePlaying table.

Submitting your Work

We hope that you enjoyed learning about databases this week! Please let us know if we can assist, as you are learning these new ideas!

Items to submit
  • firstname-lastname-project9.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.