# Think Summer: Project 4 — 2021

## Submission

Students need to submit the following 2 files by 10:00PM EST through Gradescope inside Brightspace.

1. A compiled PDF.

2. A `.asciidoc` file.

To create these files, the first step is to run every cell in the notebook. To do this click   Run All Cells. Next, to create the PDF, click on   Export Notebook As…​  PDF. Last, to create the `.asciidoc` file, click on   Export Notebook As…​  Asciidoc.

 When you export the `.asciidoc` file, it may download a `.zip` file. Make sure to extract and submit only the `.asciidoc` file contained therein. You are subject to lose credit if you submit the `.zip` file instead of the `.asciidoc` file.

## Project

Motivation: SQL is an incredibly powerful tool that allows you to process and filter massive amounts of data — amounts of data where tools like spreadsheets start to fail. You can perform SQL queries directly within the R environment, and doing so allows you to quickly perform ad-hoc analyses.

Context: This project is specially designed for Purdue University’s Think Summer program, in conjunction with Purdue University’s integrative data science initiative, The Data Mine.

Scope: SQL, SQL in R

Learning Objectives
• Demonstrate the ability to interact with popular database management systems within R.

• Solve data-driven problems using a combination of SQL and R.

• Use basic SQL commands: select, order by, limit, desc, asc, count, where, from.

• Perform grouping and aggregate data using group by and the following functions: count, max, sum, avg, like, having.

## Dataset

The following questions will use the `imdb` database found in Scholar, our computing cluster.

This database has 6 tables, namely:

`akas`, `crew`, `episodes`, `people`, `ratings`, and `titles`.

You have a variety of options to connect with, and run queries on our database:

1. Run SQL queries directly within a Jupyter Lab cell.

2. Connect to and run queries from within R in a Jupyter Lab cell.

3. From a terminal in Scholar.

For consistency and simplicity, we will only cover how to do (1) and (2).

First, for both (1) and (2) you must launch a new Jupyter Lab instance. To do so, please follow the instructions below.

1. Open a browser and navigate to gateway.scholar.rcac.purdue.edu, and login using your Purdue Career Account credentials. You should be presented with a screen similar to figure (1).

Figure 1. OnDemand
2. Click on "My Interactive Sessions", and you should be presented with a screen similar to figure (2).

Figure 2. Your interactive Scholar sessions
3. Click on Jupyter Lab in the left-hand menu. You should be presented with a screen similar to figure (3). Select the following settings:

• Queue: scholar (Max 4.0 hours)

• Number of hours: 2.5

• Processor cores requested: 1

• Memory requested (in Mb): 5120

Figure 3. Jupyter Lab settings
4. When satisfied, click , and wait for a minute. In a few moments, you should be presented with a screen similar to figure (4).

Figure 4. Jupyter Lab ready to connect
5. When you are ready, click . A new browser tab will launch and you will be presented with a screen similar to figure (5).

6. Under the "Notebook" menu, please select the (look for the big "F"). Finally, you will be presented with a screen similar to figure (6).

Figure 6. Ready-to-use Jupyter Lab notebook

You now have a running Jupyter Lab notebook ready for you to use. This Jupyter Lab instance is running on the Scholar cluster in the basement of the MATH building. By using OnDemand, you’ve essentially carved out a small portion of the compute power to use. Congratulations! Now please follow along below depending on whether you’d like to do option (1) or option (2).

To run queries directly in a Jupyter Lab cell (1), please do the following.

1. In the first cell, run the following code. This code loads an extension that allows you to directly run SQL queries in a cell as long as that cell has `%%sql` at the top of the cell.

``````%load_ext sql
%sql sqlite:////class/datamine/data/movies_and_tv/imdb.db``````
2. After running that cell (for example, using Ctrl+Enter), you can directly run future queries in each cell by starting the cell with `%%sql` in the first line. For example.

``````%%sql

SELECT * FROM titles LIMIT 5;``````

While this method has its advantages, there are some advantages to having interop between R and SQL — for example, you could quickly create cool graphics using data in the database and R.

To run queries from within R (2), please do the following.

1. In the first cell, run the following code. This code loads an extension that allows you to directly run R code in a cell as long as that cell has `%%R` at the top of the cell.

``%load_ext rpy2.ipython``
2. After running that cell (for example, using Ctrl+Enter), you can directly run R code in any cell that starts with `%%R` in the first line. For example.

``````%%R

my_vec <- c(1,2,3)
my_vec``````

Now, because we are able to run R code, we can connect to the database, make queries, and build plots, all in a single cell. For example.

``````%%R

library(RSQLite)
library(ggplot2)

conn <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/movies_and_tv/imdb.db")
myDF <- dbGetQuery(conn, "SELECT * FROM titles LIMIT 5;")

ggplot(myDF) +
geom_point(aes(x=primary_title, y=runtime_minutes)) +
labs(x = 'Title', y= 'Minutes')``````
Figure 7. R output
 It is perfectly acceptable to mix and match SQL cells and R cells in your project.

## Questions

### Question 1

Aggregate functions like `COUNT, `AVG`, `SUM`, `MIN`, and `MAX` are very useful. In particular, running queries like the following are great.

``SELECT COUNT(*) FROM titles WHERE premiered = 2008;``

However, in this scenario we want to know how many movies premiered in 2008. How often would we rather just see these numbers for every year, rather than 1 year at a time? This is where aggregate functions really start to have more power.

In the `titles` table, the `premiered` column specifies the year that a movie was premiered. Use `COUNT` to find how many movies premiered in each year in the database, in a single query.

 Use only SQL to answer this question.
 If you feel like practicing your R skills, try and solve this using R instead of SQL (for 0 points).
Items to submit
• SQL query used to solve the question. (1.5 pts)

• Output from running the code. (.5 pts)

### Question 2

In question (1), we have an example that starts to demonstrate how those simple aggregate functions are really quite powerful. The results, however, do have some ways that they could be improved. Improve your solution to question (1) in the following ways:

1. Use aliasing to rename the results of the `COUNT` function, so that rather than being labeled `COUNT(*)`, the column appears as `movies premiered`.

2. While it can be interesting to see the number of movies premiering long ago, perhaps we don’t need to see all of this information. Edit your query to only include movies from 1970+.

 Use only SQL to answer this question.
Items to submit
• SQL query used to solve the question. (1.5 pts)

• Output from running the code. (.5 pts)

### Question 3

Who used `LIMIT` and `ORDER BY` to update your query from question (2)? While that is one way to solve that question, the more robust way would be to use the `HAVING` clause. Use `HAVING` to limit the query to only include movies premiering in 1970+.

Relevant topics:

Items to submit
• SQL query used to solve the question. (.5 pts)

• Output from running the code. (.5 pts)

### Question 4

Let’s try to combine a little bit of everything we’ve learned so far. In the previous project, you picked a TV series to perform queries on. Use that same TV series (or, if you don’t want to choose a TV series, title_id "tt0413573" is a good one to use) for this question.

We want to get the `episode_number`, `season_number`, `primary_title`, `title_id`, and `rating` of every episode of your TV series for only seasons where the average `rating` was at least X, in a single query.

This will be a large query with multiple joins, and sub-queries. For this reason, we will break this down into parts, each worth some points.

#### Part 1

First, write a query that gets the `episode_title_id` and `season_number` for every episode of our TV show.

#### Part 2

Next, use your query from <<part-1, part (1).. as a sub-query, and get the `season_number’s for the seasons with an average `rating` of at least 8.0. The result should be a single column (`season_number`) with 11 values (if you are using title_id "tt0413573").

 Remember that a TV show may have an overall rating and individual episode ratings. For example, for Grey’s Anatomy, you can get the overall rating by running this query. ``SELECT rating FROM ratings WHERE title_id = 'tt0413573';`` But, we want you to get the average rating, by season.

#### Part 3

Write a query that gets the `episode_number`, `season_number`, `primary_title`, and `title_id` for the TV show with your title_id (for example, "tt0413573"). Make sure to order the results first by `season_number` and then by `episode_number`.

#### Part 4

At this stage there are only 2 missing components to our query from part (3). First is the fact that all episodes from all seasons are returned. To address this, use logical `AND` and the `IN` operator to limit the returned episodes from your part (3) query to only the `season_number’s returned in your part (2) query.

 This may sound difficult, but it isn’t! Start with your part (3) query, and tack on `AND IN ()`. Of course, you need to fill in `` with the correct column name, and `` with our part (2) query.

#### Part 5

Finally, the last missing component is the individual `rating` for each episode. Simply start with your query from part (4), and perform a join with the `rating` table to get the `rating` for each episode.

In addition, the `rating` isn’t available in our query from part (3).

Relevant topics:

Items to submit
• SQL queries for each of parts 1 - 5. (.5 pts each)

• Output from running queries from each of parts 1 - 5. (.5 pts each)