Think Summer: Project 1 — 2021
Submission
Students need to submit the following 2 files by 10:00PM EST through Gradescope inside Brightspace.
-
A compiled PDF.
-
A
.asciidoc
file.
To create these files, the first step is to run every cell in the notebook. To do this click .asciidoc
file, click on .
When you export the .asciidoc file, it may download a .zip file. Make sure to extract and submit only the .asciidoc file contained therein. You are subject to lose credit if you submit the .zip file instead of the .asciidoc file.
|
Project
Motivation: SQL is an incredibly powerful tool that allows you to process and filter massive amounts of data — amounts of data where tools like spreadsheets start to fail. You can perform SQL queries directly within the R environment, and doing so allows you to quickly perform ad-hoc analyses.
Context: This project is specially designed for Purdue University’s Think Summer program, in conjunction with Purdue University’s integrative data science initiative, The Data Mine.
Scope: SQL, SQL in R
Dataset
The following questions will use the imdb
database found in Scholar, our computing cluster.
This database has 6 tables, namely:
akas
, crew
, episodes
, people
, ratings
, and titles
.
You have a variety of options to connect with, and run queries on our database:
-
Run SQL queries directly within a Jupyter Lab cell.
-
Connect to and run queries from within R in a Jupyter Lab cell.
-
From a terminal in Scholar.
For consistency and simplicity, we will only cover how to do (1) and (2).
First, for both (1) and (2) you must launch a new Jupyter Lab instance. To do so, please follow the instructions below.
-
Open a browser and navigate to gateway.scholar.rcac.purdue.edu, and login using your Purdue Career Account credentials. You should be presented with a screen similar to figure (1).
Figure 1. OnDemand -
Click on "My Interactive Sessions", and you should be presented with a screen similar to figure (2).
Figure 2. Your interactive Scholar sessions -
Click on Jupyter Lab in the left-hand menu. You should be presented with a screen similar to figure (3). Select the following settings:
-
Queue: scholar (Max 4.0 hours)
-
Number of hours: 2.5
-
Processor cores requested: 1
-
Memory requested (in Mb): 5120
Figure 3. Jupyter Lab settings
-
-
When satisfied, click Launch, and wait for a minute. In a few moments, you should be presented with a screen similar to figure (4).
Figure 4. Jupyter Lab ready to connect -
When you are ready, click Connect to Jupyter. A new browser tab will launch and you will be presented with a screen similar to figure (5).
Figure 5. Kernel menu -
Under the "Notebook" menu, please select the f2021-s2022 (look for the big "F"). Finally, you will be presented with a screen similar to figure (6).
Figure 6. Ready-to-use Jupyter Lab notebookYou now have a running Jupyter Lab notebook ready for you to use. This Jupyter Lab instance is running on the Scholar cluster in the basement of the MATH building. By using OnDemand, you’ve essentially carved out a small portion of the compute power to use. Congratulations! Now please follow along below depending on whether you’d like to do option (1) or option (2).
To run queries directly in a Jupyter Lab cell (1), please do the following.
-
In the first cell, run the following code. This code loads an extension that allows you to directly run SQL queries in a cell as long as that cell has
%%sql
at the top of the cell.%load_ext sql %sql sqlite:////class/datamine/data/movies_and_tv/imdb.db
-
After running that cell (for example, using Ctrl+Enter), you can directly run future queries in each cell by starting the cell with
%%sql
in the first line. For example.%%sql SELECT * FROM titles LIMIT 5;
While this method has its advantages, there are some advantages to having interop between R and SQL — for example, you could quickly create cool graphics using data in the database and R.
To run queries from within R (2), please do the following.
-
In the first cell, run the following code. This code loads an extension that allows you to directly run R code in a cell as long as that cell has
%%R
at the top of the cell.%load_ext rpy2.ipython
-
After running that cell (for example, using Ctrl+Enter), you can directly run R code in any cell that starts with
%%R
in the first line. For example.%%R my_vec <- c(1,2,3) my_vec
Now, because we are able to run R code, we can connect to the database, make queries, and build plots, all in a single cell. For example.
%%R library(RSQLite) library(ggplot2) conn <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/movies_and_tv/imdb.db") myDF <- dbGetQuery(conn, "SELECT * FROM titles LIMIT 5;") ggplot(myDF) + geom_point(aes(x=primary_title, y=runtime_minutes)) + labs(x = 'Title', y= 'Minutes')
Figure 7. R output
It is perfectly acceptable to mix and match SQL cells and R cells in your project. |
Questions
Question 1
Read some information on SQL here. Explore the 6 tables in the database, and state 1 interesting fact that you find about at least one of the tables.
Normally, when using SQLite, the easiest way to display the tables in the database is by running
To list the tables using a SQL cell, run the following.
|
-
Code used to explore the database. (1 pt)
-
A sentence describing at least 1 interesting fact about at least one of the tables. (1 pt)
Question 2
Find and print the title_id
, rating
, and number of votes (votes
) for all movies that received at least 2 million votes. In a second query (and new cell), use the information you found in the previous query to identify the primary_title
of these movies.
Use the ratings table for the first query, and the titles table for the second.
|
At least includes a movie having exactly 2,000,000 votes. |
-
SQL queries used to solve this problem. (2 pts)
-
Output from running the SQL query. (1 pt)
Question 3
Find the primary_title
of every movie that is over 2 hours long or that premiered after 1990. Order the result from newest premier year to oldest, and limit the output to 15 movies. Make sure premiered
and runtime_minutes
are not NULL
.
Relevant topics: SQL, queries, order of operations
Make sure the |
Be careful! Order of operations is important here. |
-
SQL query used to solve this problem. (3 pts)
-
Explain in 1-2 sentences why order of operations is important here. (1 pt)
-
Output from running the SQL query. (1 pt)
The following are challenge questions and are worth 0 points. If you get done early give them a try! |
Question 4
What movie has the longest primary title? Answer this question using just SQL.
Relevant topics: SQLite length
-
SQL query used to solve this problem.
-
Output from running this query.
Question 5
LIKE
is a very powerful tool. You can read about SQLite’s version of LIKE
here. Use LIKE
and/or R to get a count of how many movies (type='movie') that starts with each letter of the alphabet. Can you think of another way to do this? If so, show us, and explain what you did!
Relevant topics: SQLite LIKE
-
Code used to solve this problem.
-
1-2 sentences explaining what your code does and why you like your method of solving the problem.