TDM 20100: Project 8 — SQL
Motivation: Starting with this project, we will learn about databases, which allow you to write queries about data.
Context: We will focus on SQLite. Once you understand how to make queries in SQLite, you will have a strong foundation that can be used to learn other database resources, such as MySQL.
Scope: SQLite queries that do not need JOIN
Dataset(s)
This project will use the following dataset:
-
/anvil/projects/tdm/data/lahman/lahman.db
(Lahman baseball database)
Our page in The Examples Book about SQL (in general) is given here: the-examples-book.com/tools/SQL/
Before you begin the project, try the examples from the Lahman baseball database found on this webpage of The Examples Book: the-examples-book.com/tools/SQL/lahman-examples-no-joins All of these examples are (relatively) simple and do not need to |
Questions
Using the seminar
kernel, if you run this line in a cell by itself:
%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db
If your kernel dies, then you need to re-run the line above. You also need to re-run this line at the start of any new Jupyter Lab session. |
After running the line above (in your session, just once), then you can make SQL queries in subsequent cells in Jupyter Lab, like this, for example, which shows all of the information on 5 lines of the Pitching
table:
%%sql
SELECT * FROM Pitching LIMIT 5;
or like this, which shows 5 lines corresponding to the Teams
table for which the number of wins is 110 or larger.
%%sql
SELECT * FROM Teams WHERE W >= 110 LIMIT 5;
It is really important to include |
The list of all of the tables in this database are:
AllstarFull
Appearances
AwardsManagers
AwardsPlayers
AwardsShareManagers
AwardsSharePlayers
Batting
BattingPost
CollegePlaying
Fielding
FieldingOF
FieldingOFsplit
FieldingPost
HallOfFame
HomeGames
Managers
ManagersHalf
Parks
People
Pitching
PitchingPost
Salaries
Schools
SeriesPost
Teams
TeamsFranchises
TeamsHalf
Please read the examples given here: the-examples-book.com/tools/SQL/lahman-examples-no-joins and then you are ready to start the questions for this project!
Question 1 (2 pts)
-
From the
Teams
table, print the row corresponding to the 2023 data for the team withname = 'Chicago Cubs'
. Your output will be just one row, showing the Cubs overall information for 2023. -
From the
Batting
table, print the 48 rows corresponding to the 2023 data for the players from the team withteamID = 'CHN'
(this is the Chicago CubsteamID
).
For both 1a and 1b, since we only want to see the 2023 results, you need to use |
-
Print the 2023 summary data from the
Teams
table for the team withname = 'Chicago Cubs'
(just one row of output). -
Print the 48 rows of table from the
Batting
table for the 2023 Chicago Cubs players.
Question 2 (2 pts)
Print the rows of the Teams
table corresponding to the 4 rows for the winners of the 2020, 2021, 2022, 2023 World Series winning teams.
-
Print the rows of the
Teams
table corresponding to the 4 rows for the winners of the 2020, 2021, 2022, 2023 World Series winning teams.
Question 3 (2 pts)
-
Considering the
People
table, find theplayerID
for Rickey Henderson. -
Using the
playerID
that you found in question 3a, now use theBatting
table to print all of the rows corresponding to Rickey Henderson’splayerID
. -
Finally, again using the
Batting
table, print only theSUM(R)
andSUM(SB)
for Rickey Henderson, which are his total number of runs in his career and his total number of stolen bases in his career.
He had 2295 runs scored altogether and 1406 stolen bases. |
-
Use the
People
table to find Rickey Henderson’splayerID
-
Print all of the rows of the
Batting
table corresponding to Rickey Henderson. -
Print only the sum of his number of runs in his career and also the sum of his number of stolen bases in his career.
Question 4 (2 pts)
-
Use the
Batting
table to find the top 5 players of all time, in terms of their total number of hits, in other words, according toSUM(H)
. Please print only the top 5 players (theirplayerID
) and the number of hits in each of their careers. -
Same question as 4a, but this time use home runs (according to
SUM(HR)
) instead of hits.
-
Print only the top 5 players' IDs and the number of hits in each of their careers.
-
Print only the top 5 players' IDs and the number of home runs in each of their careers.
Question 5 (2 pts)
Consider the Schools
table, group together the schools in each state. Print the number of schools in each group, using COUNT(*) as mycounts, state
so that you see how many schools are in each state, and the state abbreviation too. Order your results according to the values of mycounts
in descending order (which is denoted by DESC
), in other words, the states with the most schools are printed first in your list. In this way, by using LIMIT 5
, you will display the states with the most schools.
-
Print a list of the top 5 states according to how many schools are located there, and the number of schools in each of those top 5 states.
Submitting your Work
We hope that you enjoyed learning about databases this week! Please let us know if we can assist, as you are learning these new ideas!
-
firstname-lastname-project8.ipynb
You must double check your You will not receive full credit if your |