STAT 29000: Project 10 — Fall 2020

Motivation: Although SQL syntax may still feel unnatural and foreign, with more practice it will start to make more sense. The ability to read and write SQL queries is a bread-and-butter skill for anyone working with data.

Context: We are in the second of a series of projects that focus on learning the basics of SQL. In this project, we will continue to harden our understanding of SQL syntax, and introduce common SQL functions like AVG, MIN, and MAX.

Scope: SQL, sqlite

Learning objectives
  • Explain the advantages and disadvantages of using a database over a tool like a spreadsheet.

  • Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause.

  • Basic clauses: select, order by, limit, desc, asc, count, where, from, etc.

  • Utilize SQL functions like min, max, avg, sum, and count to solve data-driven problems.

Dataset

The following questions will use the dataset similar to the one from Project 9, but this time we will use a MariaDB version of the database, which is also hosted on Scholar, at scholar-db.rcac.purdue.edu. As in Project 9, this is the Lahman Baseball Database. You can find its documentation here, including the definitions of the tables and columns.

Questions

Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

For this project all solutions should be done using R code chunks, and the RMariaDB package. Run the following code to load the library:

library(RMariaDB)

Question 1

Connect to RStudio Server rstudio.scholar.rcac.purdue.edu, and, rather than navigating to the terminal like we did in the previous project, instead, create a connection to our MariaDB lahman database using the RMariaDB package in R, and the credentials below. Confirm the connection by running the following code chunk:

con <- dbConnect(RMariaDB::MariaDB(),
                 host="scholar-db.rcac.purdue.edu",
                 db="lahmandb",
                 user="lahman_user",
                 password="HitAH0merun")
head(dbGetQuery(con, "SHOW tables;"))

In the example provided, the variable con from the dbConnect function is the connection. Each query that you make, using the dbGetQuery, needs to use this connection con. You can change the name con if you want to (it is user defined), but if you change the name con, you need to change it on all of your connections. If your connection to the database dies while you are working on the project, you can always re-run the dbConnect line again, to reset your connection to the database.

Items to submit
  • R code used to solve the problem.

  • Output from running your (potentially modified) head(dbGetQuery(con, "SHOW tables;")).

Question 2

How many players are members of the 40/40 club? These are players that have stolen at least 40 bases (SB) and hit at least 40 home runs (HR) in one year.

Use the batting table.

You only need to run library(RMariaDB) and the dbConnect portion of the code a single time towards the top of your project. After that, you can simply reuse your connection con to run queries.

In our project template, for this project, make all of the SQL queries using the dbGetQuery function, which returns the results directly in R. Therefore, your RMarkdown blocks for this project should all be {r} blocks (as opposed to the {sql} blocks used in Project 9).

You can use dbGetQuery to run your queries from within R. Example:

dbGetQuery(con, "SELECT * FROM battings LIMIT 5;")

We already demonstrated the correct SQL query to use for the 40/40 club in the video below, but now we want you to use RMariaDB to solve this query.

Items to submit
  • R code used to solve the problem.

  • The result of running the R code.

Question 3

Find Corey Kluber’s lifetime across his career (i.e., use SUM from SQL to summarize his achievements) in two categories: strikeouts (SO) and walks (BB). Also display his Strikeouts to Walks ratio. A Strikeout to Walks ratio is calculated by this equation: $\frac{Strikeouts}{Walks}$.

Questions in this project need to be solved using SQL when possible. You will not receive credit for a question if you use sum in R rather than SUM in SQL.

Use the people table to find the playerID and use the pitching table to find the statistics.

Items to submit
  • R code used to solve the problem.

  • The result of running the R code.

Question 4

How many times in total has Giancarlo Stanton struck out in years in which he played for "MIA" or "FLO"?

Use the people table to find the playerID and use the batting table to find the statistics.

Items to submit
  • R code used to solve the problem.

  • The result of running the R code.

Question 5

The Batting Average is a metric for a batter’s performance. The Batting Average in a year is calculated by \$\frac{H}{AB}\$ (the number of hits divided by at-bats). Considering (only) the years between 2000 and 2010, calculate the (seasonal) Batting Average for each batter who had more than 300 at-bats in a season. List the top 5 batting averages next to playerID, teamID, and yearID.

Use the batting table.

Items to submit
  • R code used to solve the problem.

  • The result of running the R code.

Question 6

How many unique players have hit > 50 home runs (HR) in a season?

Use the batting table.

If you view DISTINCT as being paired with SELECT, instead, think of it as being paired with one of the fields you are selecting.

Items to submit
  • R code used to solve the problem.

  • The result of running the R code.

Question 7

Find the number of unique players that attended Purdue University. Start by finding the schoolID for Purdue and then find the number of players who played there. Do the same for IU. Who had more? Purdue or IU? Use the information you have in the database, and the power of R to create a misleading graphic that makes Purdue look better than IU, even if just at first glance. Make sure you label the graphic.

Use the schools table to get the schoolID`s, and the `collegeplaying table to get the statistics.

You can mess with the scale of the y-axis. You could (potentially) filter the data to start from a certain year or be between two dates.

To find IU’s id, try the following query: SELECT schoolID FROM schools WHERE name_full LIKE '%indiana%';. You can find more about the LIKE clause and % here.

Items to submit
  • R code used to solve the problem.

  • The result of running the R code.