# STAT 39000: Project 15 — Fall 2020

Motivation: We’ve done a lot of work with SQL this semester. Let’s review concepts in this project and mix and match R and SQL to solve data-driven problems.

Context: In this project, we will reinforce topics you’ve already learned, with a focus on SQL.

Scope: SQL, sqlite, R

Learning objectives
• Write and run SQL queries in `sqlite` on real-world data.

• Use SQL from within R.

## Dataset

The following questions will use the dataset found in Scholar:

`/class/datamine/data/movies_and_tv/imdb.db`

F.R.I.E.N.D.S is a popular tv show. They have an interesting naming convention for the names of their episodes. They all begin with the text "The One …​". There are 6 primary characters in the show: Chandler, Joey, Monica, Phoebe, Rachel, and Ross. Let’s use SQL and R to take a look at how many times each characters' names appear in the title of the episodes.

## Questions

### Question 1

Write a query that gets the `episode_title_id`, `primary_title`, `rating`, and `votes`, of all of the episodes of Friends (`title_id` is tt0108778).

 You can slightly modify the solution to question (5) in project 13.
Items to submit
• SQL query used to answer the question.

• First 5 results of the query.

### Question 2

Now that you have a working query, connect to the database and run the query to get the data into an R data frame. In previous projects, we learned how to used regular expressions to search for text. For each character, how many episodes `primary_title`s contained their name?

Items to submit
• R code in a code chunk that was used to find the solution.

• The solution pasted below the code chunk.

### Question 3

Create a graphic showing our results in (2) using your favorite package. Make sure the plot has a good title, x-label, y-label, and try to incorporate some of the following colors: #273c8b, #bd253a, #016f7c, #f56934, #016c5a, #9055b1, #eaab37.

Items to submit
• The R code used to generate the graphic.

• The graphic in a png or jpg/jpeg format.

### Question 4

Now we will turn our focus to other information in the database. Use a combination of SQL and R to find which of the following 3 genres has the highest average rating for movies (see `type` column from `titles` table): Romance, Comedy, Animation. In the `titles` table, you can find the genres in the `genres` column. There may be some overlap (i.e. a movie may have more than one genre), this is ok.

To query rows which have the genre Action as one of its genres:

``SELECT * FROM titles WHERE genres LIKE '%action%';``
Items to submit
• Any code you used to solve the problem in a code chunk.

• The average rating of each of the genres listed for movies.

### Question 5

Write a function called `top_episode` in R which accepts the path to the `imdb.db` database, as well as the `title_id` of a tv series (for example, "tt0108778" or "tt1266020"), and returns the `season_number`, `episode_number`, `primary_title`, and `rating` of the highest rated episode in the series. Test it out on some of your favorite series, and share the results.

Items to submit
• Any code you used to solve the problem in a code chunk.

• The results for at least 3 of your favorite tv series.