STAT 29000: Project 13 — Fall 2020

Motivation: Databases you will work with won’t necessarily come organized in the way that you like. Getting really comfortable writing longer queries where you have to perform many joins, alias fields and tables, and aggregate results, is important. In addition, gaining some familiarity with terms like primary key, and foreign key will prove useful when you need to search for help online. In this project we will write some more complicated queries with a fun database. Proper preparation prevents poor performance, and that means practice!

Context: We are towards the end of a series of projects that give you an opportunity to practice using SQL. In this project, we will reinforce topics you’ve already learned, with a focus on subqueries and joins.

Scope: SQL, sqlite

Learning objectives
  • Write and run SQL queries in sqlite on real-world data.

  • Identify primary and foreign keys in a SQL table.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/movies_and_tv/imdb.db

For this project you will use SQLite to access the data. To connect to the database, copy and paste the following before your solutions in your .Rmd:

````markdown

library(RSQLite)
imdb <- dbConnect(RSQLite::SQLite(), "/class/datamine/data/movies_and_tv/imdb.db")

``

If you want to use a SQLite-specific function like .tables (or prefer to test things in the Terminal), you will need to use the Terminal to connect to the database and run queries. To do so, you can connect to RStudio Server at rstudio.scholar.rcac.purdue.edu, and navigate to the terminal. In the terminal execute the command:

sqlite3 /class/datamine/data/movies_and_tv/imdb.db

From there, the SQLite-specific commands will function properly. They will not function properly in an SQL code chunk. To display the SQLite-specific commands in a code chunk without running the code, use a code chunk with the option eval=F like this:

````markdown

SELECT * FROM titles LIMIT 5;

``

This will allow the code to be displayed without throwing an error.

Questions

Question 1

A primary key is a field in a table which uniquely identifies a row in the table. Primary keys must be unique values, and this is enforced at the database level. A foreign key is a field whose value matches a primary key in a different table. A table can have 0-1 primary key, but it can have 0+ foreign keys. Examine the titles table. Do you think there are any primary keys? How about foreign keys? Now examine the episodes table. Based on observation and the column names, do you think there are any primary keys? How about foreign keys?

A primary key can also be a foreign key.

Here are two videos. The first video will remind you how to find the names of all of the tables in the imdb database. The second video will introduce you to the titles and episodes tables in the imdb database.

Items to submit
  • List any primary or foreign keys in the titles table.

  • List any primary or foreign keys in the episodes table.

Question 2

If you paste a title_id to the end of the following url, it will pull up the page for the title. For example, www.imdb.com/title/tt0413573 leads to the page for the TV series Grey’s Anatomy. Write a SQL query to confirm that the title_id tt0413573 does indeed belong to Grey’s Anatomy. Then browse imdb.com and find your favorite TV show. Get the title_id from the url of your favorite TV show and run the following query, to confirm that the TV show is in our database:

SELECT * FROM titles WHERE title_id='<title id here>';

Make sure to replace "<title id here>" with the title_id of your favorite show. If your show does not appear, or has only a single season, pick another show until you find one we have in our database with multiple seasons.

Items to submit
  • SQL query used to confirm that title_id tt0413573 does indeed belong to Grey’s Anatomy.

  • The output of the query.

  • The title_id of your favorite TV show.

  • SQL query used to confirm the title_id for your favorite TV show.

  • The output of the query.

Question 3

The episode_title_id column in the episodes table references titles of individual episodes of a TV series. The show_title_id references the titles of the show itself. With that in mind, write a query that gets a list of all episodes_title_id (found in the episodes table), with the associated primary_title (found in the titles table) for each episode of Grey’s Anatomy.

This video shows how to extract titles of episodes in the imdb database.

Items to submit
  • SQL query used to solve the problem in a code chunk.

Question 4

We want to write a query that returns the title and rating of the highest rated episode of your favorite TV show, which you chose in (2). In order to do so, we will break the task into two parts in (4) and (5). First, write a query that returns a list of episode_title_id (found in the episodes table), with the associated primary_title (found in the titles table) for each episode.

This part is just like question (3) but this time with your favorite TV show, which you chose in (2).

This video shows how to use a subquery, to JOIN a total of three tables in the imdb database.

Items to submit
  • SQL query used to solve the problem in a code chunk.

  • The first 5 results from your query.

Question 5

Write a query that adds the rating to the end of each episode. To do so, use the query you wrote in (4) as a subquery. Which episode has the highest rating? Is it also your favorite episode?

Examples that utilize the relevant topics in this problem can be found here.

Items to submit
  • SQL query used to solve the problem in a code chunk.

  • The episode_title_id, primary_title, and rating of the top rated episode from your favorite TV series, in question (2).

  • A statement saying whether the highest rated episode is also your favorite episode.