TDM 20100: Project 8 — 2023

Motivation: Structured Query Language (SQL) is a language used for querying and manipulating data in a database. SQL can handle much larger amounts of data than R and Python can alone. SQL is incredibly powerful. Learning SQL is well worth your time!

Context: There are a multitude of RDBMSs (relational database management systems). Among the most popular are: MySQL, MariaDB, Postgresql, and SQLite. As we’ve spent much of this semester in the terminal, we will start in the terminal using SQLite.

Scope: SQL, SQlite

Learning Objectives
  • Explain the advantages and disadvantages of using a database

  • Describe basic database concepts like: RDBMS, tables, fields, query, join, clause.

  • Basic clauses: select, limit, where, from, etc.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

For this project, we will be using the lahman sqlite database. This database contains the data in the directory

  • /anvil/projects/tdm/data/lahman

You may get some more lahman database information from this youtube video youtube.com/watch?v=tS_-oTbsDzs

To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook to establish a connection with the database.

%sql sqlite:////anvil/projects/tdm/data/lahman/lahman.db

For every following cell where you want to run a SQL query, prepend %%sql to the top of the cell — just like we do for R or bash cells.

Questions

Question 1 (2 pts)

Get started by taking a look at the available tables in the Lahman database.

  1. What tables are available in the Lahman database?

You’ll want to prepend %%sql to the top of the cell — it should be the very first line of the cell (no comments or anything else before it).

%%sql

-- Query here

In SQLite, you can show the tables using the following query:

.tables

Unfortunately, SQLite-specific functions can’t be run in a Jupyter Lab cell like that. Instead, we need to use a different query.

SELECT tbl_name FROM sqlite_master WHERE type='table';

Question 2 (2 pts)

  1. It’s always a good idea to learn what your table(s) looks like. A good way to do this is to get the first 5 rows of data from the table(s). Write and run queries that return the first 5 rows of data for the people table, the batting table, the fielding table, the managers table, and 2 more tables of your choice (you can pick any 2 more tables to consider).

  2. To get a better idea of the size of the data, you can use the count clause to get the number of rows in each table. Write and run 6 queries that return the number of rows in each of these 6 tables.

Run each query in a separate cell, and remember to limit the query to return only 5 rows each.

You can use the limit clause to limit the number of rows returned.

Question 3 (1 pt)

Okay, let’s dig into the people table a little bit. Run the following query.

SELECT * FROM people LIMIT 5;

As you can see, every row has a playerID for each player. It is a unique identifier or key for the people table. In Question 2, you checked several tables, so you might already notice that a few tables contain this playerID such as in table batting, fielding, managers etc. The playerID relates data from those tables to the specific player.

  1. Let us find information about a famous baseball player named Mike Trout from the people table.

The WHERE clause can be used to filter the results of a query. Use table fields nameLast and nameFirst for the query.

Question 4 (1 pt)

Now you understand what the playerID means inside the database.

SELECT * FROM batting where playerID ='troutmi01'

The query will output all fields of data for Mike Trout from table batting

  1. First use Mike Trout’s playerID (from Question 3) to find the number of his home runs in each season.

  2. Now make a second query that only displays Mike Trout’s data for the year 2022 but includes the playerID, teamID, and number of home runs.

The HR field contains the number of home runs.

Question 5 (2 pts)

Now pick a different baseball player (your choice!) and find that baseball player’s information in the database.

  1. For this baseball player, please find the baseball player’s information from the people table

  2. Please use the playerID to get this player’s number of home runs in the year 2022.

  3. Please join the people table and the batting table, to display information from the fields of nameLast, nameFirst, weight, height, birthYear, and number of home runs in the year 2022, along with the teamID, and yearID.

You may refer to the following website for SQLite table join examples www.sqlitetutorial.net/sqlite-join/

Use yearID from the batting table for the Year.

Project 08 Assignment Checklist

  • Jupyter notebook with your code, comments and output for questions 1 to 5

    • firstname-lastname-project08.ipynb

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project