STAT 39000: Project 9 — Fall 2020

Motivation: Structured Query Language (SQL) is a language used for querying and manipulating data in a database. SQL can handle much larger amounts of data than R and Python can alone. SQL is incredibly powerful. In fact, cloudflare, a billion dollar company, had much of its starting infrastructure built on top of a Postgresql database (per this thread on hackernews). Learning SQL is well worth your time!

Context: There are a multitude of RDBMSs (relational database management systems). Among the most popular are: MySQL, MariaDB, Postgresql, and SQLite. As we’ve spent much of this semester in the terminal, we will start in the terminal using SQLite.

Scope: SQL, sqlite

Learning objectives
  • Explain the advantages and disadvantages of using a database over a tool like a spreadsheet.

  • Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause.

  • Basic clauses: select, order by, limit, desc, asc, count, where, from, etc.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/lahman/lahman.db

This is the Lahman Baseball Database. You can find its documentation here, including the definitions of the tables and columns.

Questions

Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

For this project all solutions should be done using SQL code chunks. To connect to the database, copy and paste the following before your solutions in your .Rmd

````markdown

library(RSQLite)
lahman <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/lahman/lahman.db")

``

Each solution should then be placed in a code chunk like this:

````markdown

SELECT * FROM batting LIMIT 1;

``

If you want to use a SQLite-specific function like .tables (or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command:

sqlite3 /class/datamine/data/lahman/lahman.db

From there, the SQLite-specific commands will function properly. They will not function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option eval=F like this:

````markdown

SELECT * FROM batting LIMIT 1;

``

This will allow the code to be displayed without throwing an error.

Question 1

Connect to RStudio Server rstudio.scholar.rcac.purdue.edu, and navigate to the terminal and access the Lahman database. How many tables are available?

To connect to the database, do the following:

sqlite3 /class/datamine/data/lahman/lahman.db

This is a good resource.

Items to submit
  • How many tables are available in the Lahman database?

  • The sqlite3 commands used to figure out how many tables are available.

Question 2

Some people like to try to visit all 30 MLB ballparks in their lifetime. Use SQL commands to get a list of parks and the cities they’re located in. For your final answer, limit the output to 10 records/rows.

There may be more than 30 parks in your result, this is ok. For long results, you can limit the number of printed results using the LIMIT clause.

Make sure you take a look at the column names and get familiar with the data tables. If working from the Terminal, to see the header row as a part of each query result, run the following:

.headers on
Items to submit
  • SQL code used to solve the problem.

  • The first 10 results of the query.

Question 3

There is nothing more exciting to witness than a home run hit by a batter. It’s impressive if a player hits more than 40 in a season. Find the hitters who have hit 60 or more home runs (HR) in a season. List their playerID, yearID, home run total, and the teamID they played for.

There are 8 occurrences of home runs greater than or equal to 60.

The batting table is where you should look for this question.

Items to submit
  • SQL code used to solve the problem.

  • The first 10 results of the query.

Question 4

Make a list of players born on your birth day (don’t worry about the year). Display their first names, last names, and birth year. Order the list descending by their birth year.

The people table is where you should look for this question.

Examples that utilize the relevant topics in this problem can be found here.

Items to submit
  • SQL code used to solve the problem.

  • The first 10 results of the query.

Question 5

Get the Cleveland (CLE) Pitching Roster from the 2016 season (playerID, W, L, SO). Order the pitchers by number of Strikeouts (SO) in descending order.

The pitching table is where you should look for this question.

Examples that utilize the relevant topics in this problem can be found here.

Items to submit
  • SQL code used to solve the problem.

  • The first 10 results of the query.

Question 6

Find the 10 team and year pairs that have the most number of Errors (E) between 1960 and 1970. Display their Win and Loss counts too. What is the name of the team that appears in 3rd place in the ranking of the team and year pairs?

The teams table is where you should look for this question.

The BETWEEN clause is useful here.

It is OK to use multiple queries to answer the question.

Examples that utilize the relevant topics in this problem can be found here.

Items to submit
  • SQL code used to solve the problem.

  • The first 10 results of the query.

Question 7

Find the playerID for Bob Lemon. What year and team was he on when he got the most wins as a pitcher (use table pitching)? What year and team did he win the most games as a manager (use table managers)?

It is OK to use multiple queries to answer the question.

There was a tie among the two years in which Bob Lemon had the most wins as a pitcher.

Examples that utilize the relevant topics in this problem can be found here.

Items to submit
  • SQL code used to solve the problem.

  • The first 10 results of the query.

Question 8

For the AL West (use lgID and divID to specify this), find the home run (HR), walk (BB), and stolen base (SB) totals by team between 2000 and 2010. Which team and year combo led in each category in the decade?

The teams table is where you should look for this question.

It is OK to use multiple queries to answer the question.

Use divID == 'W' as one of the conditions. Please note using double quotes: divID == "W" will not work.

Examples that utilize the relevant topics in this problem can be found here.

Items to submit
  • SQL code used to solve the problem.

  • The team-year combination that ranked top in each category.

Question 9

Get a list of the following by year: wins (W), losses (L), Home Runs Hit (HR), homeruns allowed (HRA), and total home game attendance (attendance) for the Detroit Tigers when winning a World Series (WSWin is Y) or when winning league champion (LgWin is Y).

The teams table is where you should look for this question.

Be careful with the order of operations for AND and OR. Remember you can force order of operations using parentheses.

Examples that utilize the relevant topics in this problem can be found here.

Items to submit
  • SQL code used to solve the problem.

  • The first 10 results of the query.