STAT 39000: Project 15 — Fall 2020

Motivation: We’ve done a lot of work with SQL this semester. Let’s review concepts in this project and mix and match R and SQL to solve data-driven problems.

Context: In this project, we will reinforce topics you’ve already learned, with a focus on SQL.

Scope: SQL, sqlite, R

Learning objectives
  • Write and run SQL queries in sqlite on real-world data.

  • Use SQL from within R.


The following questions will use the dataset found in Scholar:


F.R.I.E.N.D.S is a popular tv show. They have an interesting naming convention for the names of their episodes. They all begin with the text "The One …​". There are 6 primary characters in the show: Chandler, Joey, Monica, Phoebe, Rachel, and Ross. Let’s use SQL and R to take a look at how many times each characters' names appear in the title of the episodes.


Question 1

Write a query that gets the episode_title_id, primary_title, rating, and votes, of all of the episodes of Friends (title_id is tt0108778).

You can slightly modify the solution to question (5) in project 13.

Items to submit
  • SQL query used to answer the question.

  • First 5 results of the query.

Question 2

Now that you have a working query, connect to the database and run the query to get the data into an R data frame. In previous projects, we learned how to used regular expressions to search for text. For each character, how many episodes `primary_title`s contained their name?

Items to submit
  • R code in a code chunk that was used to find the solution.

  • The solution pasted below the code chunk.

Question 3

Create a graphic showing our results in (2) using your favorite package. Make sure the plot has a good title, x-label, y-label, and try to incorporate some of the following colors: #273c8b, #bd253a, #016f7c, #f56934, #016c5a, #9055b1, #eaab37.

Items to submit
  • The R code used to generate the graphic.

  • The graphic in a png or jpg/jpeg format.

Question 4

Now we will turn our focus to other information in the database. Use a combination of SQL and R to find which of the following 3 genres has the highest average rating for movies (see type column from titles table): Romance, Comedy, Animation. In the titles table, you can find the genres in the genres column. There may be some overlap (i.e. a movie may have more than one genre), this is ok.

To query rows which have the genre Action as one of its genres:

SELECT * FROM titles WHERE genres LIKE '%action%';
Items to submit
  • Any code you used to solve the problem in a code chunk.

  • The average rating of each of the genres listed for movies.

Question 5

Write a function called top_episode in R which accepts the path to the imdb.db database, as well as the title_id of a tv series (for example, "tt0108778" or "tt1266020"), and returns the season_number, episode_number, primary_title, and rating of the highest rated episode in the series. Test it out on some of your favorite series, and share the results.

Items to submit
  • Any code you used to solve the problem in a code chunk.

  • The results for at least 3 of your favorite tv series.