Think Summer: Project 3 — 2021

Submission

Students need to submit the following 2 files **by 10:00P through Gradescope inside Brightspace.

  1. A compiled PDF.

  2. A .asciidoc file.

To create these files, the first step is to run every cell in the notebook. To do this click Run  Run All Cells. Next, to create the PDF, click on File  Export Notebook As…​  PDF. Last, to create the .asciidoc file, click on File  Export Notebook As…​  Asciidoc.

When you export the .asciidoc file, it may download a .zip file. Make sure to extract and submit only the .asciidoc file contained therein. You are subject to lose credit if you submit the .zip file instead of the .asciidoc file.

Project

Motivation: SQL is an incredibly powerful tool that allows you to process and filter massive amounts of data — amounts of data where tools like spreadsheets start to fail. You can perform SQL queries directly within the R environment, and doing so allows you to quickly perform ad-hoc analyses.

Context: This project is specially designed for Purdue University’s Think Summer program, in conjunction with Purdue University’s integrative data science initiative, The Data Mine.

Scope: SQL, SQL in R

Learning Objectives
  • Demonstrate the ability to interact with popular database management systems within R.

  • Solve data-driven problems using a combination of SQL and R.

  • Use basic SQL commands: select, order by, limit, desc, asc, count, where, from.

  • Perform grouping and aggregate data using group by and the following functions: count, max, sum, avg, like, having.

Dataset

The following questions will use the imdb database found in Scholar, our computing cluster.

This database has 6 tables, namely:

akas, crew, episodes, people, ratings, and titles.

You have a variety of options to connect with, and run queries on our database:

  1. Run SQL queries directly within a Jupyter Lab cell.

  2. Connect to and run queries from within R in a Jupyter Lab cell.

  3. From a terminal in Scholar.

For consistency and simplicity, we will only cover how to do (1) and (2).

First, for both (1) and (2) you must launch a new Jupyter Lab instance. To do so, please follow the instructions below.

  1. Open a browser and navigate to gateway.scholar.rcac.purdue.edu, and login using your Purdue Career Account credentials. You should be presented with a screen similar to figure (1).

    OnDemand
    Figure 1. OnDemand
  2. Click on "My Interactive Sessions", and you should be presented with a screen similar to figure (2).

    Your interactive Scholar sessions
    Figure 2. Your interactive Scholar sessions
  3. Click on Jupyter Lab in the left-hand menu. You should be presented with a screen similar to figure (3). Select the following settings:

    • Queue: scholar (Max 4.0 hours)

    • Number of hours: 2.5

    • Processor cores requested: 1

    • Memory requested (in Mb): 5120

      Jupyter Lab settings
      Figure 3. Jupyter Lab settings
  4. When satisfied, click Launch, and wait for a minute. In a few moments, you should be presented with a screen similar to figure (4).

    Jupyter Lab ready to connect
    Figure 4. Jupyter Lab ready to connect
  5. When you are ready, click Connect to Jupyter. A new browser tab will launch and you will be presented with a screen similar to figure (5).

    Kernel menu
    Figure 5. Kernel menu
  6. Under the "Notebook" menu, please select the f2021-s2022 (look for the big "F"). Finally, you will be presented with a screen similar to figure (6).

    Ready Jupyter Lab notebook
    Figure 6. Ready-to-use Jupyter Lab notebook

    You now have a running Jupyter Lab notebook ready for you to use. This Jupyter Lab instance is running on the Scholar cluster in the basement of the MATH building. By using OnDemand, you’ve essentially carved out a small portion of the compute power to use. Congratulations! Now please follow along below depending on whether you’d like to do option (1) or option (2).

To run queries directly in a Jupyter Lab cell (1), please do the following.

  1. In the first cell, run the following code. This code loads an extension that allows you to directly run SQL queries in a cell as long as that cell has %%sql at the top of the cell.

    %load_ext sql
    %sql sqlite:////class/datamine/data/movies_and_tv/imdb.db
  2. After running that cell (for example, using Ctrl+Enter), you can directly run future queries in each cell by starting the cell with %%sql in the first line. For example.

    %%sql
    
    SELECT * FROM titles LIMIT 5;

    While this method has its advantages, there are some advantages to having interop between R and SQL — for example, you could quickly create cool graphics using data in the database and R.

To run queries from within R (2), please do the following.

  1. In the first cell, run the following code. This code loads an extension that allows you to directly run R code in a cell as long as that cell has %%R at the top of the cell.

    %load_ext rpy2.ipython
  2. After running that cell (for example, using Ctrl+Enter), you can directly run R code in any cell that starts with %%R in the first line. For example.

    %%R
    
    my_vec <- c(1,2,3)
    my_vec

    Now, because we are able to run R code, we can connect to the database, make queries, and build plots, all in a single cell. For example.

    %%R
    
    library(RSQLite)
    library(ggplot2)
    
    conn <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/movies_and_tv/imdb.db")
    myDF <- dbGetQuery(conn, "SELECT * FROM titles LIMIT 5;")
    
    ggplot(myDF) +
        geom_point(aes(x=primary_title, y=runtime_minutes)) +
        labs(x = 'Title', y= 'Minutes')
    R output
    Figure 7. R output
It is perfectly acceptable to mix and match SQL cells and R cells in your project.

Questions

Question 1

In this project, instead of connecting to our SQLite database, instead, let’s connect to the same database (the same imdb data), but using a different RDBMS (relational database management system), MariaDB. It is easy to make this change, and you only need to modify a single line of code. Just change the following from this

%sql sqlite:////class/datamine/data/movies_and_tv/imdb.db

to this

%sql mariadb+pymysql://imdb_user:[email protected]/imdb

Other than that single change, you most likely won’t notice a single change! When making a decision about what RDBMS to use, you should never limit yourself to what you are familiar with as it may be relatively easy to use something new!

A primary key is a field in a table which uniquely identifies a row in the table. Primary keys must be unique values. This is enforced at the database level.

A foreign key is a field whose value matches a primary key in a different table. A table can have 0-1 primary key, but it can have 0+ foreign keys.

Examine the titles table. Do you think there are any primary keys? How about foreign keys? Now examine the episodes table. Based on observation and the column names, do you think there are any primary keys? How about foreign keys?

Answer this solution in a markdown cell. Write the text in a code cell, in the menu, click menu:[Code > Markdown]. The appearance of the text in your cell may change. Run the cell, and the text should render neatly.

Relevant topics: primary and foreign keys

Items to submit
  • List any primary or foreign keys in the titles table. (.5 pt)

  • List any primary or foreign keys in the episodes table. (.5 pt)

  • Any code you used to answer this question.

Question 2

If you paste a title_id to the end of the following url, it will pull up the page for the title. For example, www.imdb.com/title/tt0413573 leads to the page for the TV series Grey’s Anatomy. Write a SQL query to confirm that the title_id "tt0413573" does indeed belong to Grey’s Anatomy. Then browse imdb.com and find your favorite TV show. Get the title_id from the url of your favorite TV show, and run the following query to confirm that the TV show is in our database.

SELECT * FROM titles WHERE title_id='<title id here>';
Make sure to replace "<title id here>" with the title_id of your favorite show. If your show does not appear, or has only a single season, pick another show until you find one we have in our database (that has multiple seasons).

Relevant topics: SQL, queries

Items to submit
  • SQL query used to confirm that title_id "tt0413573" does indeed belong to Grey’s Anatomy. (.5 pts)

  • The output of the query. (.5 pt)

  • The title_id of your favorite TV show. (.5 pts)

  • SQL query used to confirm the title_id for your favorite TV show. (.5 pts)

Question 3

The episode_title_id column in the episodes table references titles of individual episodes of a TV series. The show_title_id references the titles of the show itself. With that in mind, write a query that gets a list of all of the episodes_title_id’s (found in the `episodes table), with the associated primary_title (found in the titles table) for each episode of Grey’s Anatomy.

This video demonstrates how to extract titles of episodes in the imdb database.

Relevant topics: SQL, queries, joins

Items to submit
  • SQL query used to answer the question. (3 pts)

  • Output from running the SQL query. (2 pts)

Question 4

Joins are a critical concept to understand. They appear everywhere where relational data is found. In R, the merge function performs the same operations as joins. In python’s pandas package the merge method for the DataFrame object performs the same operations. Take some time to read this section.

In question 2 from the previous project, we asked you to use the ratings table to discover how many films have a rating of at least 8 and at least 50000 votes. You may have noticed, while you can easily do that, the end result is not human understandable. We see that there are films with those features but we don’t know what film title_id "tt0010323" is for. This is a great example where a simple join can answer this question for us.

Write a query that prints the primary_title, rating, and votes for all films with a rating of at least 8 and at least 50000 votes. Like in the previous version of this question, limit your output to 15 results.

Assume our left table is ratings and our right table is titles. What would conceptually change if instead of using an INNER JOIN we used a LEFT JOIN? Does it make a difference in this case? Why or why not?

Relevant topics: SQL, queries, joins

Items to submit
  • SQL query used to answer the question. (1 pts)

  • Output from running the SQL query.

  • 1-2 sentences explaining what the conceptual change would be if you used a LEFT JOIN instead of an INNER JOIN. (.5 pts)

  • A statement whether or not LEFT JOIN vs INNER JOIN makes a difference in this specific case or not, and why. (.5 pts)

The following are challenge questions and are worth 0 points. If you get done early give them a try!

Question 5

We want to write a query that returns the title and rating of the highest rated episode of your favorite TV show, which you chose in question 2. In order to do so, we will break the task into two parts in (5) and (6). First, write a query that returns a list of just episode_title_ids (found in the episodes table), with the associated primary_title (found in the titles table) for each episode.

Relevant topics: SQL, queries, joins

Items to submit
  • SQL query used to answer the question.

  • Output from running the SQL query.

Question 6

Write a query that adds the rating to the end of each episode. To do so, use the query you wrote in (5) as a subquery. Which episode has the highest rating? Is it also your favorite episode?

Relevant topics: SQL, queries, joins

Items to submit
  • SQL query used to answer the question.

  • Output from running the SQL query.