# Think Summer: Project 2 — 2021

## Submission

Students need to submit the following 2 files by 10:00PM EST through Gradescope inside Brightspace.

1. A compiled PDF.

2. A `.asciidoc` file.

To create these files, the first step is to run every cell in the notebook. To do this click   Run All Cells. Next, to create the PDF, click on   Export Notebook As…​  PDF. Last, to create the `.asciidoc` file, click on   Export Notebook As…​  Asciidoc.

 When you export the `.asciidoc` file, it may download a `.zip` file. Make sure to extract and submit only the `.asciidoc` file contained therein. You are subject to lose credit if you submit the `.zip` file instead of the `.asciidoc` file.

## Project

Motivation: SQL is an incredibly powerful tool that allows you to process and filter massive amounts of data — amounts of data where tools like spreadsheets start to fail. You can perform SQL queries directly within the R environment, and doing so allows you to quickly perform ad-hoc analyses.

Context: This project is specially designed for Purdue University’s Think Summer program, in conjunction with Purdue University’s integrative data science initiative, The Data Mine.

Scope: SQL, SQL in R

Learning Objectives
• Demonstrate the ability to interact with popular database management systems within R.

• Solve data-driven problems using a combination of SQL and R.

• Use basic SQL commands: select, order by, limit, desc, asc, count, where, from.

• Perform grouping and aggregate data using group by and the following functions: count, max, sum, avg, like, having.

## Dataset

The following questions will use the `imdb` database found in Scholar, our computing cluster.

This database has 6 tables, namely:

`akas`, `crew`, `episodes`, `people`, `ratings`, and `titles`.

You have a variety of options to connect with, and run queries on our database:

1. Run SQL queries directly within a Jupyter Lab cell.

2. Connect to and run queries from within R in a Jupyter Lab cell.

3. From a terminal in Scholar.

For consistency and simplicity, we will only cover how to do (1) and (2).

First, for both (1) and (2) you must launch a new Jupyter Lab instance. To do so, please follow the instructions below.

1. Open a browser and navigate to gateway.scholar.rcac.purdue.edu, and login using your Purdue Career Account credentials. You should be presented with a screen similar to figure (1).

Figure 1. OnDemand
2. Click on "My Interactive Sessions", and you should be presented with a screen similar to figure (2).

Figure 2. Your interactive Scholar sessions
3. Click on Jupyter Lab in the left-hand menu. You should be presented with a screen similar to figure (3). Select the following settings:

• Queue: scholar (Max 4.0 hours)

• Number of hours: 2.5

• Processor cores requested: 1

• Memory requested (in Mb): 5120

Figure 3. Jupyter Lab settings
4. When satisfied, click , and wait for a minute. In a few moments, you should be presented with a screen similar to figure (4).

Figure 4. Jupyter Lab ready to connect
5. When you are ready, click . A new browser tab will launch and you will be presented with a screen similar to figure (5).

6. Under the "Notebook" menu, please select the (look for the big "F"). Finally, you will be presented with a screen similar to figure (6).

Figure 6. Ready-to-use Jupyter Lab notebook

You now have a running Jupyter Lab notebook ready for you to use. This Jupyter Lab instance is running on the Scholar cluster in the basement of the MATH building. By using OnDemand, you’ve essentially carved out a small portion of the compute power to use. Congratulations! Now please follow along below depending on whether you’d like to do option (1) or option (2).

To run queries directly in a Jupyter Lab cell (1), please do the following.

1. In the first cell, run the following code. This code loads an extension that allows you to directly run SQL queries in a cell as long as that cell has `%%sql` at the top of the cell.

``````%load_ext sql
%sql sqlite:////class/datamine/data/movies_and_tv/imdb.db``````
2. After running that cell (for example, using Ctrl+Enter), you can directly run future queries in each cell by starting the cell with `%%sql` in the first line. For example.

``````%%sql

SELECT * FROM titles LIMIT 5;``````

While this method has its advantages, there are some advantages to having interop between R and SQL — for example, you could quickly create cool graphics using data in the database and R.

To run queries from within R (2), please do the following.

1. In the first cell, run the following code. This code loads an extension that allows you to directly run R code in a cell as long as that cell has `%%R` at the top of the cell.

``%load_ext rpy2.ipython``
2. After running that cell (for example, using Ctrl+Enter), you can directly run R code in any cell that starts with `%%R` in the first line. For example.

``````%%R

my_vec <- c(1,2,3)
my_vec``````

Now, because we are able to run R code, we can connect to the database, make queries, and build plots, all in a single cell. For example.

``````%%R

library(RSQLite)
library(ggplot2)

conn <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/movies_and_tv/imdb.db")
myDF <- dbGetQuery(conn, "SELECT * FROM titles LIMIT 5;")

ggplot(myDF) +
geom_point(aes(x=primary_title, y=runtime_minutes)) +
labs(x = 'Title', y= 'Minutes')``````
Figure 7. R output
 It is perfectly acceptable to mix and match SQL cells and R cells in your project.

## Questions

### Question 1

How many actors have lived to be more than 115 years old? Find the names, birth years, and death years for all actors and actresses who lived more than 115 years.

 Simple arithmetic can be done directly in a SQL predicate. For example, you can use the following. ``WHERE died - born > 115``
 Use the `people` table.
 Remember more than 115 years implies 115 years doesn’t qualify but 116 does.

Relevant topics: SQL, queries

Items to submit
• SQL query used to solve this problem. (1 pt)

• Output from running the SQL query. (0.5 pts)

### Question 2

Use the `ratings` table to discover how many films have a rating of at least 8 and at least 50000 votes. Limit your output to the first 15 results.

Relevant topics: SQL, queries

Items to submit
• SQL query used to solve this problem. (1 pt)

• Output from running the SQL query. (0.5 pts)

### Question 3

In the `titles` table, the `genres` column specifies the genre of each movie. Use `COUNT` to find out how unique genres occur in the database. At this time, don’t consider the fact that multiple `genres` could be listed in a single row. Any unique combination/permutation of genres should be considered a unique genre. Use the keyword `COUNT` in combination with `DISTINCT`.

Relevant topics: SQL, queries

Items to submit
• SQL query used to solve this problem. (1 pt)

• Output from running the SQL query. (1 pt)

### Question 4

In question 2, we asked you to kind of count how many unique genres there are in the database. In this question, we are going to dig in a little bit more, and use a combination of SQL and R to figure out the actual number of unique genres in the database.

First, use a (slightly modified version of) the SQL query from question 3 to pull the data into an R data.frame. The result should be a data.frame with a single column named `genres` of length 2242. You can access the column of a data.frame like so.

``````myDF <- dbGetQuery(conn, "<some query>")
genres <- myDF\$genres

Our list, `genres`, contains strings with comma-separated values, which are the actual genres. If you combine each of the comma-separated string values into one giant comma-separated string, you could then easily split the string into individual values. How many unique genres are there?

 You can use the following R functions to solve this problem: `paste`, `unlist`, `strsplit`.
 Here is an example of splitting a string. ``````my_string <- "first;second;third" result <- unlist(strsplit(my_string, ";"))``````
 Before you answer the question, make sure all of the genres are actually genres and not erroneous data!

Relevant topics: paste

Items to submit
• All code used to solve this problem. (3 pts)

• A list of the unique genres. (1 pt)

• One sentence explaining why using SQL was valuable in this instance (rather than just using R — think speed). (1 pt)

 The following are challenge questions and are worth 0 points. If you get done early give them a try!

### Question 5

In the previous question, we were able to get the number of unique genres. In this question, let’s take this one step further. Use the `table` function in R to calculate how many times each genre appears in the database.

 You’ll need to modify your query from question 3.

Relevant topics: table

Items to submit
• Code used to solve this problem.

• Output from running the code.

### Question 6

In the previous question, we got a count of the number of times each genre appeared in the database. Create a dotchart illustrating this data.

 Make sure to exclude the erroneous data!

Relevant topics: dotcharts

Items to submit
• Code used to solve this problem.

• Output from running the code.