Think Summer: Project 2 Solutions — 2022

%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db

1. How many actors have lived to be more than 115 years old? Find the names, birth years, and death years for all actors and actresses who lived more than 115 years.

The people who lived more than 115 years can be found as follows, by using the condition that died-born is bigger than 115

SELECT *, died-born FROM people WHERE died-born > 115 LIMIT 10;

Now we can use the COUNT function to see that there are 7 such actors who lived more than 115 years.

SELECT COUNT(died-born) FROM people WHERE died-born > 115 LIMIT 5;

2. Use the ratings table to discover how many films have a rating of at least 8 and at least 50000 votes. In a separate cell, show 15 rows with this property.

We can use conditions to ensure that rating and votes are large enough, and then we can display 15 such results.

SELECT * FROM ratings WHERE (rating >= 8) AND (votes >= 50000) LIMIT 15;

Then we can use the COUNT function to see that there are 670 such titles altogether.

SELECT COUNT(*) FROM ratings WHERE (rating >= 8) AND (votes >= 50000) LIMIT 15;

3. In the titles table, the genres column specifies the genre of each movie. Use COUNT to find out how unique genres occur in the database. At this time, don’t consider the fact that multiple genres could be listed in a single row. Any unique combination/permutation of genres should be considered a unique genre. Use the keyword COUNT in combination with DISTINCT.

At first, we view the genres as tuples, for instance, Action,Adult is a genre (separated by commas). We can do this the same as we did above, just changing the variable type to the variable genres.

SELECT COUNT(*), genres FROM titles GROUP BY genres LIMIT 100;

Now we see that there are 2283 such genres:


4. Use a combination of SQL and R to figure out the actual number of unique genres in the database.

We need to run this at the start, to load the R library for SQL, and to load the database.

conn <- dbConnect(RSQLite::SQLite(), "/anvil/projects/tdm/data/movies_and_tv/imdb.db")

We focus on separating the genres into their individual genres. Remember that they are combined, using commas, in the format that we originally have. Here are the first few genres:


Now we split them according to the commas in each:

strsplit(head(myDF$genres), ",")

This will be new for many/most of you, but we can unlist them in R, so that they are not listed separately anymore, but instead, they are in one big vector.

unlist(strsplit(head(myDF$genres), ","))

and now we can use unique to see a list of the genres, removing any duplications:

unique(unlist(strsplit(head(myDF$genres), ",")))

Since this works on the head, we can remove the head now, and see the 29 such genres. Notice that the 21st such genre is missing, i.e., it is empty, so we do not know the genres for some of the movies. So there are 28 different genres altogether.

unique(unlist(strsplit(myDF$genres, ",")))

5. In the previous question, we were able to get the number of unique genres. In this question, let’s take this one step further. Use the table function in R to calculate how many times each genre appears in the database.

Instead of using the unique function (which gives us one copy of each entry), we can use the table function, to discover how many times that each genre appears.

table(unlist(strsplit(myDF$genres, ",")))

6. In the previous question, we got a count of the number of times each genre appeared in the database. Create a dotchart illustrating this data.

Instead of making a table, we can make a dotchart.

plot(table(unlist(strsplit(myDF$genres, ","))))

As we noted above, the last value was erroneous, so we can remove it. A negative index will remove a value, so using -29 will remove the last value in the table.

plot(table(unlist(strsplit(myDF$genres, ",")))[-29])