STAT 29000: Project 11 — Fall 2020

Motivation: Being able to use results of queries as tables in new queries (also known as writing sub-queries), and calculating values like MIN, MAX, and AVG in aggregate are key skills to have in order to write more complex queries. In this project we will learn about aliasing, writing sub-queries, and calculating aggregate values.

Context: We are in the middle of a series of projects focused on working with databases and SQL. In this project we introduce aliasing, sub-queries, and calculating aggregate values using a much larger dataset!

Scope: SQL, SQL in R

Learning objectives
  • Demonstrate the ability to interact with popular database management systems within R.

  • Solve data-driven problems using a combination of SQL and R.

  • Basic clauses: SELECT, ORDER BY, LIMIT, DESC, ASC, COUNT, WHERE, FROM, etc.

  • Showcase the ability to filter, alias, and write subqueries.

  • Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING. Explain when to use having, and when to use where.

Dataset

The following questions will use the elections database. Similar to Project 10, this database is hosted on Scholar. Moreover, Question 1 also involves the following data files found in Scholar:

/class/datamine/data/election/itcontYYYY.txt (for example, data for year 1980 would be /class/datamine/data/election/itcont1980.txt)

A public sample of the data can be found here:

Questions

For this project you will need to connect to the database elections using the RMariaDB package in R. Include the following code chunk in the beginning of your RMarkdown file:

````markdown

library(RMariaDB)
con <- dbConnect(RMariaDB::MariaDB(),
                 host="scholar-db.rcac.purdue.edu",
                 db="elections",
                 user="elections_user",
                 password="Dataelect!98")

``

When a question involves SQL queries in this project, you may use a SQL code chunk (with {sql}), or an R code chunk (with {r}) and functions like dbGetQuery as you did in Project 10. Please refer to Question 5 in the project template for examples.

Question 1

Approximately how large was the lahman database (use the sqlite database in Scholar: /class/datamine/data/lahman/lahman.db)? Use UNIX utilities you’ve learned about this semester to write a line of code to return the size of that .db file (in MB).

The data we consider in this project are much larger. Use UNIX utilities (bash and awk) to write another line of code that calculates the total amount of data in the elections folder /class/datamine/data/election/. How much data (in MB) is there?

The data in that folder has been added to the elections database, all aggregated in the elections table. Write a SQL query that returns the number of rows of data are in the database. How many rows of data are in the table elections?

These are some examples of how to get the sizes of collections of files in UNIX:

The SQL query will take some time! Be patient.

You may use more than one code chunk in your RMarkdown file for the different tasks.

We will accept values that represent either apparent or allocated size, as well as estimated disk usage. To get the size from ls and du to match, use the --apparent-size option with du.

A Megabyte (MB) is actually 1000^2 bytes, not 1024^2. A Mebibyte (MiB) is 1024^2 bytes. See here for more information. For this question, either solution will be given full credit. This is a potentially useful example.

Items to submit
  • Line of code (bash/awk) to show the size (in MB) of the lahman database file.

  • Approximate size of the lahman database in MB.

  • Line of code (bash/awk) to calculate the size (in MB) of the entire elections dataset in /class/datamine/data/election.

  • The size of the elections data in MB.

  • SQL query used to find the number of rows of data in the elections table in the elections database.

  • The number of rows in the elections table in the elections database.

Question 2

Write a SQL query using the LIKE command to find a unique list of zip_code that start with "479".

Write another SQL query and answer: How many unique zip_code are there that begin with "479"?

Here are some examples about SQL that might be relevant for Questions 2 and 3 in this project.

The first query returns a list of zip codes, and the second returns a count.

Make sure you only select zip_code.

Items to submit
  • SQL queries used to answer the question.

  • The first 5 results from running the query.

Question 3

Write a SQL query that counts the number of donations (rows) that are from Indiana. How many donations are from Indiana? Rewrite the query and create an alias for our field so it doesn’t read COUNT(*) but rather Indiana Donations.

You may enclose an alias’s name in quotation marks (single or double) when the name contains space.

Items to submit
  • SQL query used to answer the question.

  • The result of the SQL query.

Question 4

Rewrite the query in (3) so the result is displayed like: IN: 1234567. Note, if instead of "IN" we wanted "OH", only the WHERE clause should be modified, and the display should automatically change to OH: 1234567. In other words, the state abbreviation should be dynamic, not static.

This video demonstrates how to use CONCAT in a MySQL query:

Use CONCAT and aliasing to accomplish this.

Remember, state contains the state abbreviation.

Items to submit
  • SQL query used to answer the question.

Question 5

In (2) we wrote a query that returns a unique list of zip codes that start with "479". In (3) we wrote a query that counts the number of donations that are from Indiana. Use our query from (2) as a sub-query to find how many donations come from areas with zip codes starting with "479". What percent of donations in Indiana come from said zip codes?

This video gives two examples of sub-queries:

You can simply manually calculate the percent using the count in (2) and (5).

Items to submit
  • SQL queries used to answer the question.

  • The percentage of donations from Indiana from `zip_code`s starting with "479".

Question 6

In (3) we wrote a query that counts the number of donations that are from Indiana. When running queries like this, a natural "next question" is to ask the same question about another state. SQL gives us the ability to calculate functions in aggregate when grouping by a certain column. Write a SQL query that returns the state, number of donations from each state, the sum of the donations (transaction_amt). Which 5 states gave the most donations (highest count)? Order you result from most to least.

In this video we demonstrate GROUP BY, ORDER BY, DESC, and other aspects of MySQL that might help with this question:

You may want to create an alias in order to sort.

Items to submit
  • SQL query used to answer the question.

  • Which 5 states gave the most donations?