TDM 20100: Project 9 — 2023

Motivation: Although SQL syntax may still feel unnatural and foreign, with more practice it will start to make more sense. The ability to read and write SQL queries is a "bread-and-butter" skill for anyone working with data.

Context: We are in the second of a series of projects that focus on learning the basics of SQL. In this project, we will continue to harden our understanding of SQL syntax, and introduce common SQL functions like AVG, COUNT, and MAX.

Scope: SQL, sqlite

Learning Objectives
  • Describe basic database concepts like: RDBMs, tables, fields, query, clause,etc.

  • Basic clauses: select, order by, limit, desc, asc, count, where, from, group by, etc.

  • Utilize SQL functions like max, avg, sum, count, cast,round etc.

Make sure to read about, and use the template found here, and the important information about projects submissions here.


For this project, we will be using the lahman sqlite database. This database contains the data in the directory

  • /anvil/projects/tdm/data/lahman

You may get some more lahman database information from this youtube video 2023 SABR Analytics:Sean Lahman, "introduction to Baseball Databases"

To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook to establish a connection with the database.

%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db

For every following cell where you want to run a SQL query, prepend %%sql to the top of the cell — just like we do for R or bash cells.


In previous projects, we used awk to parse through and summarize data. Using SQL gives us more ways to analyze and summarize data.

Make sure all queries limit output to only 100 rows. You may refer to SQLite-Limit Syntax If you want the headers to be more descriptive, you can use aliases.You may refer to SQLite-aliases Syntax

Question 1 (2 pts)

  1. Write a query to find out who won the 2022 World Baseball Series from table seriespost?

  2. For this champion team, please find out the home runs (hr) rate and runs batted in (rbi) rate in year 2022 from table batting. Round the rates to 2 decimals. You may get rates by

    • hr_rate = sum of home runs / sum of hits

    • rbi_rate = sum of runs batted in / sum of hits

Use the sum aggregate function to calculate the totals, and division to figure out the percentages(rates).

cast is useful to convert integer to real number to do calculation, e.g.

select cast (HR AS REAL) from batting

Try to do the calculation without cast. What do you get? Also, round is useful to round to a decimal.

Question 2 (2 pts)

  1. For the champion team from question 1, please write a query that counts the number of RBIs for each athlete in the champion team during year 2022, using the batting table. Display your output in ascending order.

  2. Run the query again, but this time, display the output in descending order.

  3. Which athlete has the highest RBIs in this question? Please provide the player’s playerID, along with their first name and last name, from the people table

  • Use group by to group for each athlete

  • Use order by to sort the output

Question 3 (2 pts)

  1. Write a query that finds how many times the athlete from question 2 attended All Star Games.

  2. Write a query to find out who is the athlete that attended most All Star Games in the entire data set.

Question 4 (1 pt)

  1. Write a query that gets the average salary for each athlete in the database. Display your output in descending order. Limit the output to 100 rows (i.e., to the top 100 salaries).

Question 5 (1 pt)

Now create your own query about a topic that you are interested in. Use at least one of the aggregation functions, such as min, max, count, or sum. Be sure to use group by and display the results in order with order by.

Project 09 Assignment Checklist

  • Jupyter notebook with your code, comments and output for questions 1 to 5

    • firstname-lastname-project09.ipynb

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.