Think Summer: Project 2 Solutions — 2022
1. How many actors have lived to be more than 115 years old? Find the names, birth years, and death years for all actors and actresses who lived more than 115 years.
The people who lived more than 115 years can be found as follows,
by using the condition that
died-born is bigger than 115
%%sql SELECT *, died-born FROM people WHERE died-born > 115 LIMIT 10;
Now we can use the
COUNT function to see that there are 7 such actors who lived more than 115 years.
%%sql SELECT COUNT(died-born) FROM people WHERE died-born > 115 LIMIT 5;
2. Use the ratings table to discover how many films have a rating of at least 8 and at least 50000 votes. In a separate cell, show 15 rows with this property.
We can use conditions to ensure that rating and votes are large enough, and then we can display 15 such results.
%%sql SELECT * FROM ratings WHERE (rating >= 8) AND (votes >= 50000) LIMIT 15;
Then we can use the
COUNT function to see that there are 670 such titles altogether.
%%sql SELECT COUNT(*) FROM ratings WHERE (rating >= 8) AND (votes >= 50000) LIMIT 15;
3. In the
titles table, the genres column specifies the genre of each movie. Use
COUNT to find out how unique genres occur in the database. At this time, don’t consider the fact that multiple
genres could be listed in a single row. Any unique combination/permutation of genres should be considered a unique genre. Use the keyword
COUNT in combination with
At first, we view the genres as tuples, for instance,
Action,Adult is a genre (separated by commas). We can do this the same as we did above, just changing the variable type to the variable genres.
%%sql SELECT COUNT(*), genres FROM titles GROUP BY genres LIMIT 100;
Now we see that there are 2283 such genres:
%%sql SELECT COUNT(DISTINCT genres) FROM titles LIMIT 5;
We need to run this at the start, to load the R library for SQL, and to load the database.
%%R library(RSQLite) conn <- dbConnect(RSQLite::SQLite(), "/anvil/projects/tdm/data/movies_and_tv/imdb.db")
We focus on separating the genres into their individual genres. Remember that they are combined, using commas, in the format that we originally have. Here are the first few genres:
Now we split them according to the commas in each:
%%R strsplit(head(myDF$genres), ",")
This will be new for many/most of you, but we can
unlist them in R, so that they are not listed separately anymore, but instead, they are in one big vector.
%%R unlist(strsplit(head(myDF$genres), ","))
and now we can use
unique to see a list of the genres, removing any duplications:
%%R unique(unlist(strsplit(head(myDF$genres), ",")))
Since this works on the
head, we can remove the
head now, and see the 29 such genres. Notice that the 21st such genre is missing, i.e., it is empty, so we do not know the genres for some of the movies. So there are 28 different genres altogether.
%%R unique(unlist(strsplit(myDF$genres, ",")))
5. In the previous question, we were able to get the number of unique genres. In this question, let’s take this one step further. Use the
table function in R to calculate how many times each genre appears in the database.
Instead of using the
unique function (which gives us one copy of each entry), we can use the
table function, to discover how many times that each genre appears.
%%R table(unlist(strsplit(myDF$genres, ",")))
6. In the previous question, we got a count of the number of times each genre appeared in the database. Create a dotchart illustrating this data.
Instead of making a
table, we can make a
%%R plot(table(unlist(strsplit(myDF$genres, ","))))
As we noted above, the last value was erroneous, so we can remove it. A negative index will remove a value, so using -29 will remove the last value in the table.
%%R plot(table(unlist(strsplit(myDF$genres, ",")))[-29])