TDM 20100: Project 8 — SQL

Motivation: Starting with this project, we will learn about databases, which allow you to write queries about data.

Context: We will focus on SQLite. Once you understand how to make queries in SQLite, you will have a strong foundation that can be used to learn other database resources, such as MySQL.

Scope: SQLite queries that do not need JOIN

Learning Objectives:
  • We will learn how to make SQLite queries on one table at a time (without using the JOIN)

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following dataset:

  • /anvil/projects/tdm/data/lahman/lahman.db (Lahman baseball database)

Our page in The Examples Book about SQL (in general) is given here: the-examples-book.com/tools/SQL/

Before you begin the project, try the examples from the Lahman baseball database found on this webpage of The Examples Book: the-examples-book.com/tools/SQL/lahman-examples-no-joins All of these examples are (relatively) simple and do not need to JOIN two tables. They just rely on one table of the database at a time.

Questions

Using the seminar kernel, if you run this line in a cell by itself:

%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db

If your kernel dies, then you need to re-run the line above. You also need to re-run this line at the start of any new Jupyter Lab session.

After running the line above (in your session, just once), then you can make SQL queries in subsequent cells in Jupyter Lab, like this, for example, which shows all of the information on 5 lines of the Pitching table:

%%sql
SELECT * FROM Pitching LIMIT 5;

or like this, which shows 5 lines corresponding to the Teams table for which the number of wins is 110 or larger.

%%sql
SELECT * FROM Teams WHERE W >= 110 LIMIT 5;

It is really important to include LIMIT 5 or something similar, for instance, LIMIT 20, so that you do not try to print all of the results from a SQL table in your Jupyter Lab notebook.

The list of all of the tables in this database are:

AllstarFull
Appearances
AwardsManagers
AwardsPlayers
AwardsShareManagers
AwardsSharePlayers
Batting
BattingPost
CollegePlaying
Fielding
FieldingOF
FieldingOFsplit
FieldingPost
HallOfFame
HomeGames
Managers
ManagersHalf
Parks
People
Pitching
PitchingPost
Salaries
Schools
SeriesPost
Teams
TeamsFranchises
TeamsHalf

Please read the examples given here: the-examples-book.com/tools/SQL/lahman-examples-no-joins and then you are ready to start the questions for this project!

Question 1 (2 pts)

  1. From the Teams table, print the row corresponding to the 2023 data for the team with name = 'Chicago Cubs'. Your output will be just one row, showing the Cubs overall information for 2023.

  2. From the Batting table, print the 48 rows corresponding to the 2023 data for the players from the team with teamID = 'CHN' (this is the Chicago Cubs teamID).

For both 1a and 1b, since we only want to see the 2023 results, you need to use yearID = 2023 as a condition in your query.

Deliverables
  • Print the 2023 summary data from the Teams table for the team with name = 'Chicago Cubs' (just one row of output).

  • Print the 48 rows of table from the Batting table for the 2023 Chicago Cubs players.

Question 2 (2 pts)

Print the rows of the Teams table corresponding to the 4 rows for the winners of the 2020, 2021, 2022, 2023 World Series winning teams.

Deliverables
  • Print the rows of the Teams table corresponding to the 4 rows for the winners of the 2020, 2021, 2022, 2023 World Series winning teams.

Question 3 (2 pts)

  1. Considering the People table, find the playerID for Rickey Henderson.

  2. Using the playerID that you found in question 3a, now use the Batting table to print all of the rows corresponding to Rickey Henderson’s playerID.

  3. Finally, again using the Batting table, print only the SUM(R) and SUM(SB) for Rickey Henderson, which are his total number of runs in his career and his total number of stolen bases in his career.

He had 2295 runs scored altogether and 1406 stolen bases.

Deliverables
  • Use the People table to find Rickey Henderson’s playerID

  • Print all of the rows of the Batting table corresponding to Rickey Henderson.

  • Print only the sum of his number of runs in his career and also the sum of his number of stolen bases in his career.

Question 4 (2 pts)

  1. Use the Batting table to find the top 5 players of all time, in terms of their total number of hits, in other words, according to SUM(H). Please print only the top 5 players (their playerID) and the number of hits in each of their careers.

  2. Same question as 4a, but this time use home runs (according to SUM(HR)) instead of hits.

Deliverables
  • Print only the top 5 players' IDs and the number of hits in each of their careers.

  • Print only the top 5 players' IDs and the number of home runs in each of their careers.

Question 5 (2 pts)

Consider the Schools table, group together the schools in each state. Print the number of schools in each group, using COUNT(*) as mycounts, state so that you see how many schools are in each state, and the state abbreviation too. Order your results according to the values of mycounts in descending order (which is denoted by DESC), in other words, the states with the most schools are printed first in your list. In this way, by using LIMIT 5, you will display the states with the most schools.

Deliverables
  • Print a list of the top 5 states according to how many schools are located there, and the number of schools in each of those top 5 states.

Submitting your Work

We hope that you enjoyed learning about databases this week! Please let us know if we can assist, as you are learning these new ideas!

Items to submit
  • firstname-lastname-project8.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.