Think Summer: Project 3 — 2021
Submission
Students need to submit the following 2 files **by 10:00P through Gradescope inside Brightspace.
-
A compiled PDF.
-
A
.asciidoc
file.
To create these files, the first step is to run every cell in the notebook. To do this click .asciidoc
file, click on .
When you export the .asciidoc file, it may download a .zip file. Make sure to extract and submit only the .asciidoc file contained therein. You are subject to lose credit if you submit the .zip file instead of the .asciidoc file.
|
Project
Motivation: SQL is an incredibly powerful tool that allows you to process and filter massive amounts of data — amounts of data where tools like spreadsheets start to fail. You can perform SQL queries directly within the R environment, and doing so allows you to quickly perform ad-hoc analyses.
Context: This project is specially designed for Purdue University’s Think Summer program, in conjunction with Purdue University’s integrative data science initiative, The Data Mine.
Scope: SQL, SQL in R
Dataset
The following questions will use the imdb
database found in Scholar, our computing cluster.
This database has 6 tables, namely:
akas
, crew
, episodes
, people
, ratings
, and titles
.
You have a variety of options to connect with, and run queries on our database:
-
Run SQL queries directly within a Jupyter Lab cell.
-
Connect to and run queries from within R in a Jupyter Lab cell.
-
From a terminal in Scholar.
For consistency and simplicity, we will only cover how to do (1) and (2).
First, for both (1) and (2) you must launch a new Jupyter Lab instance. To do so, please follow the instructions below.
-
Open a browser and navigate to gateway.scholar.rcac.purdue.edu, and login using your Purdue Career Account credentials. You should be presented with a screen similar to figure (1).
Figure 1. OnDemand -
Click on "My Interactive Sessions", and you should be presented with a screen similar to figure (2).
Figure 2. Your interactive Scholar sessions -
Click on Jupyter Lab in the left-hand menu. You should be presented with a screen similar to figure (3). Select the following settings:
-
Queue: scholar (Max 4.0 hours)
-
Number of hours: 2.5
-
Processor cores requested: 1
-
Memory requested (in Mb): 5120
Figure 3. Jupyter Lab settings
-
-
When satisfied, click Launch, and wait for a minute. In a few moments, you should be presented with a screen similar to figure (4).
Figure 4. Jupyter Lab ready to connect -
When you are ready, click Connect to Jupyter. A new browser tab will launch and you will be presented with a screen similar to figure (5).
Figure 5. Kernel menu -
Under the "Notebook" menu, please select the f2021-s2022 (look for the big "F"). Finally, you will be presented with a screen similar to figure (6).
Figure 6. Ready-to-use Jupyter Lab notebookYou now have a running Jupyter Lab notebook ready for you to use. This Jupyter Lab instance is running on the Scholar cluster in the basement of the MATH building. By using OnDemand, you’ve essentially carved out a small portion of the compute power to use. Congratulations! Now please follow along below depending on whether you’d like to do option (1) or option (2).
To run queries directly in a Jupyter Lab cell (1), please do the following.
-
In the first cell, run the following code. This code loads an extension that allows you to directly run SQL queries in a cell as long as that cell has
%%sql
at the top of the cell.%load_ext sql %sql sqlite:////class/datamine/data/movies_and_tv/imdb.db
-
After running that cell (for example, using Ctrl+Enter), you can directly run future queries in each cell by starting the cell with
%%sql
in the first line. For example.%%sql SELECT * FROM titles LIMIT 5;
While this method has its advantages, there are some advantages to having interop between R and SQL — for example, you could quickly create cool graphics using data in the database and R.
To run queries from within R (2), please do the following.
-
In the first cell, run the following code. This code loads an extension that allows you to directly run R code in a cell as long as that cell has
%%R
at the top of the cell.%load_ext rpy2.ipython
-
After running that cell (for example, using Ctrl+Enter), you can directly run R code in any cell that starts with
%%R
in the first line. For example.%%R my_vec <- c(1,2,3) my_vec
Now, because we are able to run R code, we can connect to the database, make queries, and build plots, all in a single cell. For example.
%%R library(RSQLite) library(ggplot2) conn <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/movies_and_tv/imdb.db") myDF <- dbGetQuery(conn, "SELECT * FROM titles LIMIT 5;") ggplot(myDF) + geom_point(aes(x=primary_title, y=runtime_minutes)) + labs(x = 'Title', y= 'Minutes')
Figure 7. R output
It is perfectly acceptable to mix and match SQL cells and R cells in your project. |
Questions
Question 1
In this project, instead of connecting to our SQLite database, instead, let’s connect to the same database (the same imdb data), but using a different RDBMS (relational database management system), MariaDB. It is easy to make this change, and you only need to modify a single line of code. Just change the following from this
%sql sqlite:////class/datamine/data/movies_and_tv/imdb.db
to this
%sql mariadb+pymysql://imdb_user:[email protected]/imdb
Other than that single change, you most likely won’t notice a single change! When making a decision about what RDBMS to use, you should never limit yourself to what you are familiar with as it may be relatively easy to use something new!
A primary key is a field in a table which uniquely identifies a row in the table. Primary keys must be unique values. This is enforced at the database level.
A foreign key is a field whose value matches a primary key in a different table. A table can have 0-1 primary key, but it can have 0+ foreign keys.
Examine the titles table. Do you think there are any primary keys? How about foreign keys? Now examine the episodes
table. Based on observation and the column names, do you think there are any primary keys? How about foreign keys?
Answer this solution in a markdown cell. Write the text in a code cell, in the menu, click menu:[Code > Markdown]. The appearance of the text in your cell may change. Run the cell, and the text should render neatly. |
Relevant topics: primary and foreign keys
-
List any primary or foreign keys in the
titles
table. (.5 pt) -
List any primary or foreign keys in the
episodes
table. (.5 pt) -
Any code you used to answer this question.
Question 2
If you paste a title_id
to the end of the following url, it will pull up the page for the title. For example, www.imdb.com/title/tt0413573 leads to the page for the TV series Grey’s Anatomy. Write a SQL query to confirm that the title_id
"tt0413573" does indeed belong to Grey’s Anatomy. Then browse imdb.com and find your favorite TV show. Get the title_id
from the url of your favorite TV show, and run the following query to confirm that the TV show is in our database.
SELECT * FROM titles WHERE title_id='<title id here>';
Make sure to replace "<title id here>" with the title_id of your favorite show. If your show does not appear, or has only a single season, pick another show until you find one we have in our database (that has multiple seasons).
|
-
SQL query used to confirm that
title_id
"tt0413573" does indeed belong to Grey’s Anatomy. (.5 pts) -
The output of the query. (.5 pt)
-
The
title_id
of your favorite TV show. (.5 pts) -
SQL query used to confirm the
title_id
for your favorite TV show. (.5 pts)
Question 3
The episode_title_id
column in the episodes
table references titles of individual episodes of a TV series. The show_title_id
references the titles of the show itself. With that in mind, write a query that gets a list of all of the episodes_title_id’s (found in the `episodes
table), with the associated primary_title
(found in the titles
table) for each episode of Grey’s Anatomy.
This video demonstrates how to extract titles of episodes in the imdb database.
|
-
SQL query used to answer the question. (3 pts)
-
Output from running the SQL query. (2 pts)
Question 4
Joins are a critical concept to understand. They appear everywhere where relational data is found. In R, the merge
function performs the same operations as joins. In python’s pandas
package the merge
method for the DataFrame
object performs the same operations. Take some time to read this section.
In question 2 from the previous project, we asked you to use the ratings
table to discover how many films have a rating of at least 8 and at least 50000 votes. You may have noticed, while you can easily do that, the end result is not human understandable. We see that there are films with those features but we don’t know what film title_id
"tt0010323" is for. This is a great example where a simple join can answer this question for us.
Write a query that prints the primary_title
, rating
, and votes
for all films with a rating of at least 8 and at least 50000 votes. Like in the previous version of this question, limit your output to 15 results.
Assume our left table is ratings
and our right table is titles
. What would conceptually change if instead of using an INNER JOIN
we used a LEFT JOIN
? Does it make a difference in this case? Why or why not?
-
SQL query used to answer the question. (1 pts)
-
Output from running the SQL query.
-
1-2 sentences explaining what the conceptual change would be if you used a
LEFT JOIN
instead of anINNER JOIN
. (.5 pts) -
A statement whether or not
LEFT JOIN
vsINNER JOIN
makes a difference in this specific case or not, and why. (.5 pts)
The following are challenge questions and are worth 0 points. If you get done early give them a try! |
Question 5
We want to write a query that returns the title and rating of the highest rated episode of your favorite TV show, which you chose in question 2. In order to do so, we will break the task into two parts in (5) and (6). First, write a query that returns a list of just episode_title_ids
(found in the episodes
table), with the associated primary_title
(found in the titles
table) for each episode.
-
SQL query used to answer the question.
-
Output from running the SQL query.