Think Summer: Project 2 — 2022

Submission

Students need to submit the following file by 10:00PM EST through Gradescope inside Brightspace.

  1. A Jupyter notebook (a .ipynb file).

We’ve provided you with a template notebook for you to use. Please carefully read this section to get started.

When you are finished with the project, please make sure to run every cell in the notebook prior to submitting. To do this click Run  Run All Cells. Next, to export your notebook (your .ipynb file), click on File  Download, and download your .ipynb file.

Questions

Question 1

How many actors have lived to be more than 115 years old? Find the names, birth years, and death years for all actors and actresses who lived more than 115 years.

Simple arithmetic can be done directly in a SQL predicate. For example, you can use the following.

WHERE died - born > 115

Use the people table.

Remember more than 115 years implies 115 years doesn’t qualify but 116 does.

Relevant topics: SQL, queries

Items to submit
  • SQL query used to solve this problem. (1 pt)

  • Output from running the SQL query. (0.5 pts)

Question 2

Use the ratings table to discover how many films have a rating of at least 8 and at least 50000 votes. In a separate cell, show 15 rows with this property.

Relevant topics: SQL, queries

Items to submit
  • SQL query used to solve this problem. (1 pt)

  • Output from running the SQL query. (0.5 pts)

Question 3

In the titles table, the genres column specifies the genre of each movie. Use COUNT to find out how unique genres occur in the database. At this time, don’t consider the fact that multiple genres could be listed in a single row. Any unique combination/permutation of genres should be considered a unique genre. Use the keyword COUNT in combination with DISTINCT.

Relevant topics: SQL, queries

Items to submit
  • SQL query used to solve this problem. (1 pt)

  • Output from running the SQL query. (1 pt)

Question 4

In question 3, we asked you to kind of count how many unique genres there are in the database. In this question, we are going to dig in a little bit more, and use a combination of SQL and R to figure out the actual number of unique genres in the database.

First, use a (slightly modified version of) the SQL query from question 3 to pull the data into an R data.frame. The result should be a data.frame with a single column named genres of length 2283. You can access the column of a data.frame like so.

myDF <- dbGetQuery(conn, "<some query>")
genres <- myDF$genres
head(genres)

Our list, genres, contains strings with comma-separated values, which are the actual genres. If you combine each of the comma-separated string values into one giant comma-separated string, you could then easily split the string into individual values. How many unique genres are there?

You can use the following R functions to solve this problem: unlist, strsplit, and unique.

Here is an example of splitting a string.

my_string <- "first;second;third"
result <- unlist(strsplit(my_string, ";"))

Before you answer the question, make sure all of the genres are actually genres and not erroneous data!

Relevant topics: paste

Items to submit
  • All code used to solve this problem. (3 pts)

  • A list of the unique genres. (1 pt)

  • One sentence explaining why using SQL was valuable in this instance (rather than just using R — think speed). (1 pt)

The following are challenge questions and are worth 0 points. If you get done early give them a try!

Question 5

In the previous question, we were able to get the number of unique genres. In this question, let’s take this one step further. Use the table function in R to calculate how many times each genre appears in the database.

You’ll need to modify your query from question 3.

Relevant topics: table

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 6

In the previous question, we got a count of the number of times each genre appeared in the database. Create a dotchart illustrating this data.

Make sure to exclude the erroneous data!

Relevant topics: dotcharts

Items to submit
  • Code used to solve this problem.

  • Output from running the code.