TDM 20100: Project 11 — 2023

Motivation: Databases are (usually) comprised of many tables. It is imperative that we learn how to combine data from multiple tables using queries. To do so, we perform "joins"! In this project we will explore, learn about, and practice using joins on our database. The database has many tables, so the benefit of using joins will become obvious.

Context: We’ve introduced a variety of SQL commands that let you filter and extract information from a database in an systematic way. In this project we will introduce joins, a powerful method to combine data from different tables.

Scope: SQL, sqlite, joins

Learning Objectives
  • Briefly explain the differences between left and inner join and demonstrate the ability to use the join statements to solve a data-driven problem.

  • Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING.

  • Showcase the ability to filter, alias, and write subqueries.

Make sure to read about, and use the template found here, and the important information about projects submissions here.


For this project, we will be using the lahman sqlite database. This database contains the data in the directory

  • /anvil/projects/tdm/data/lahman

You may get some more lahman database information from this youtube video

To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook to establish a connection with the database.

%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db

For every following cell where you want to run a SQL query, prepend %%sql to the top of the cell — just like we do for R or bash cells.


Question 1 (2 pts)

In the previous project, you already learned how to get data from a single table.

Now that we are learning about joins, so that we will have the ability to make much more interesting queries!

You may get more information on joins here:

Table batting contains a field H (hits) and a field AB (at-bats). We can calculate the batting average (BA) by the formula


A batting average is an indicator that shows a batter’s ability to produce offensively.

You may get more batting average information from Wikipedia:

  1. Please find the 10 players with the lowest batting average for the year 2022. Use the batting table and INNER JOIN with the people table to get players' first name and last name. The output will contain following fields: playerID, player’s first name, player’s last name, and their battingAverage.

Question 2 (2 pts)

When considering the batting average, pitchers often have a significantly lower bating average, because they are not trained hitters. To focus on regular batters, pitchers need to be excluded.

  1. Use the appearances table, to find out the players who are pitchers for the year 2022.

  2. Return to the query from Question 1, but this time, use a subquery to exclude the pitchers from Question 2a.

Pitchers have field G_p>0 in appearances table

Question 3 (2 pts)

In question 2, instead of using a sub query, we can use a left join to accomplish the same task.

  1. Modify your query from question 2, to use a left join (instead of a sub query). The goal is the same as question 2b, namely: to get the 10 players (who are not pitchers!) with lowest batting average.

Question 4 (2 pts)

  1. Write another query, to find out what is the average batting average for all players (exclude pitchers) in year 2022.

Project 11 Assignment Checklist

  • Jupyter Lab notebook with your code, comments and output for the assignment

    • firstname-lastname-project11.ipynb

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.