TDM 20100: Project 10 — SQL

Motivation: Now we learn how to write SQL queries that rely on three or more tables.

Context: We can use multiple JOIN statements in SQL. Each JOIN statement allows us to add data from an additional table. Each JOIN needs its own ON statement too.

Scope: We can use several JOIN statements in the same query, to pull data from several tables.

Learning Objectives:
  • We will learn how to make SQLite queries on three or more tables at a time (using the JOIN)

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following dataset:

  • /anvil/projects/tdm/data/lahman/lahman.db (Lahman baseball database)

Our page in The Examples Book about SQL (in general) is given here: the-examples-book.com/tools/SQL/

Now we learn how to join THREE OR MORE tables using multiple JOIN statements! Before you begin the project, try the examples from the Lahman baseball database found on this webpage of The Examples Book: the-examples-book.com/tools/SQL/lahman-examples-two-joins All of these examples rely on two (or more) JOIN statements, to extract information from three (or more) tables.

INNER JOIN and JOIN are exactly the same thing. If you see INNER JOIN and you prefer to write JOIN, that is totally OK.

Questions

Using the seminar kernel, if you run this line in a cell by itself:

%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db

If your kernel dies, then you need to re-run the line above. You also need to re-run this line at the start of any new Jupyter Lab session.

Again, we remind students that the list of all of the tables in this database are:

AllstarFull
Appearances
AwardsManagers
AwardsPlayers
AwardsShareManagers
AwardsSharePlayers
Batting
BattingPost
CollegePlaying
Fielding
FieldingOF
FieldingOFsplit
FieldingPost
HallOfFame
HomeGames
Managers
ManagersHalf
Parks
People
Pitching
PitchingPost
Salaries
Schools
SeriesPost
Teams
TeamsFranchises
TeamsHalf

Please read the examples given here: the-examples-book.com/tools/SQL/lahman-examples-two-joins and then you are ready to start the questions for this project!

In the page of examples, sometimes we write JOIN and sometimes we write INNER JOIN. These are interchangeable; in other words, JOIN and INNER JOIN mean the same thing. (There are other types of statements such as LEFT JOIN and RIGHT JOIN but we will not use either of these, in this project.)

Question 1 (2 pts)

Revisit Project 9, Question 1: Now add a third table, namely, the AwardsPlayers table, so that we see the 4 Chicago Cubs who won a total of 7 awards in 2023.

For each of the 4 Chicago Cubs who won these 7 awards in 2023, in addition to printing their PlayerID, hits (H), home runs (HR), nameFirst, and nameLast, please also add one more variable in the SELECT statement to print, namely, the awardID from the AwardsPlayers table.

When you join the AwardsPlayers table, it is necessary to join ON the value of the playerID and the yearID.

Deliverables
  • Print the playerID, H, HR, nameFirst, nameLast, and awardID values for all 4 of the players on the 2023 Chicago Cubs team who won 7 awards altogether during that year.

Question 2 (2 pts)

Revisit Project 9, Question 2: Now add a third table, namely, the People table, so that we can extract the first and last name of the player:

Join the Batting table to the Pitching table by matching the playerID, yearID, and stint columns, and in addition, now also join the People table. There is only one person from 2023 appearing in both of these tables that hit more than 30 home runs. Print this person’s playerID, home runs (HR), first name (nameFirst), and last name (nameLast).

If you write Pitching as p and also People as p then your query will not work. Choose a different letter for one of the tables. For instance, you might write: People as m (for example!) or any letter you choose is OK. You simply cannot use the same letter for different tables.

Deliverables
  • Print the PlayerID, home runs (HR), first name (nameFirst), and last name (nameLast) for the only person who is in both the Batting and Pitching table in 2023 who had more than 30 home runs (HR) in the Batting table.

Question 3 (2 pts)

Revisit Project 9, Question 3, but this time study Omar Vizquel instead of Rickey Henderson: Now add a third table, namely, the Salaries table, so that we can find the total amount of salary that Omar Vizquel earned in his career:

Join the People and Batting and Salaries tables. Print only 1 row, corresponding to Omar Vizquel, displaying his playerID, nameFirst, nameLast, SUM(R), SUM(SB), and SUM(salary) values.

Omar Vizquel had 1445 runs scored altogether and 404 stolen bases, and he made more than 60 million dollars in salary during his career! Your solution will show his exact total salary during his career.

Deliverables
  • Print only 1 row, corresponding to Omar Vizquel, displaying his playerID, nameFirst, nameLast, SUM(R), SUM(SB), and SUM(salary) values.

Question 4 (2 pts)

  1. Revisit Project 9, Question 4, but now join the Batting, People, and Appearances table, to find the top 5 players of all time, in terms of their total number of hits, in other words, according to SUM(H). For the top 5 players (in terms of the total number of hits), print their playerID, the SUM(H) (in other words, their total number of hits in their careers), their nameFirst and nameLast values, and now also include a column that shows the SUM(G_all) which is the total number of games played in their career. [Do not change the ordering from Project 9, Question 4; in other words, please continue to keep the results in order by the total number of hits.]

  2. Same question as 4b, but this time use home runs (according to SUM(HR)) instead of hits.

When you join the Appearances table, make sure that the playerID and yearID and teamID are all in agreement with the Batting table.

Deliverables
  • For the top 5 players (in terms of the total number of hits), print their playerID, the SUM(H) (in other words, their total number of hits in their careers), their nameFirst and nameLast values, and now also include a column that shows the SUM(G_all) which is the total number of games played in their career.

  • For the top 5 players (in terms of the total number of home runs), print their playerID, the SUM(HR) (in other words, their total number of home runs in their careers), their nameFirst and nameLast values, and now also include a column that shows the SUM(G_all) which is the total number of games played in their career.

Question 5 (2 pts)

Join the CollegePlaying and People and HallOfFame tables to find the playerID, nameFirst, nameLast, yearID, ballots, needed, votes, and inducted values for the only player who had schoolID = 'purdue' in the CollegePlaying table and who also appears in the HallOfFame table. [There is only 1 such player!]

Deliverables
  • Print the playerID, nameFirst, nameLast, yearID, ballots, needed, votes, and inducted values for the only player who had schoolID = 'purdue' in the CollegePlaying table and who also appeared in the HallOfFame table.

Submitting your Work

Now that you know how to join three tables together, you are very knowledgeable about SQL databases!

Items to submit
  • firstname-lastname-project10.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.