STAT 29000: Project 15 — Fall 2020

Motivation: We’ve done a lot of work with SQL this semester. Let’s review concepts in this project and mix and match R and SQL to solve data-driven problems.

Context: In this project, we will reinforce topics you’ve already learned, with a focus on SQL.

Scope: SQL, sqlite, R

Learning objectives
  • Write and run SQL queries in sqlite on real-world data.

  • Use SQL from within R.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/movies_and_tv/imdb.db

Questions

Question 1

What is the first year where our database has > 1000 titles? Use the premiered column in the titles table as our year. What year has the most titles?

There could be missing values in premiered. We want to avoid them in SQL queries, for now. You can learn more about the missing values (or NULL) in SQL here.

Items to submit
  • SQL queries used to answer the questions.

  • What year is the first year to have > 1000 titles?

  • What year has the most titles?

Question 2

What, and how many, unique type are there from the titles table? For the year found in question (1) with the most titles, how many titles of each type are there?

Items to submit
  • SQL queries used to answer the questions.

  • How many and what are the unique types from the titles table?

  • A list of type and and count for the year (premiered) that had the most titles.

F.R.I.E.N.D.S is a popular tv show. They have an interesting naming convention for the names of their episodes. They all begin with the text "The One …​". There are 6 primary characters in the show: Chandler, Joey, Monica, Phoebe, Rachel, and Ross. Let’s use SQL and R to take a look at how many times each characters' names appear in the title of the episodes.

Question 3

Write a query that gets the episode_title_id, primary_title, rating, and votes, of all of the episodes of Friends (title_id is tt0108778).

You can slightly modify the solution to question (5) in project 13.

Items to submit
  • SQL query used to answer the question.

  • First 5 results of the query.

Question 4

Now that you have a working query, connect to the database and run the query to get the data into an R data frame. In previous projects, we learned how to used regular expressions to search for text. For each character, how many episodes `primary_title`s contained their name?

Items to submit
  • R code in a code chunk that was used to find the solution.

  • The solution pasted below the code chunk.

Question 5

Create a graphic showing our results in (4) using your favorite package. Make sure the plot has a good title, x-label, y-label, and try to incorporate some of the following colors: #273c8b, #bd253a, #016f7c, #f56934, #016c5a, #9055b1, #eaab37.

Items to submit
  • The R code used to generate the graphic.

  • The graphic in a png or jpg/jpeg format.

Question 6

Use a combination of SQL and R to find which of the following 3 genres has the highest average rating for movies (see type column from titles table): Romance, Comedy, Animation. In the titles table, you can find the genres in the genres column. There may be some overlap (i.e. a movie may have more than one genre), this is ok.

To query rows which have the genre Action as one of its genres:

SELECT * FROM titles WHERE genres LIKE '%action%';
Items to submit
  • Any code you used to solve the problem in a code chunk.

  • The average rating of each of the genres listed for movies.