Think Summer: Day 2 Notes — 2022

Loading the database

%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db

Avoiding NULL values, and making calculations within our SQL queries

We can start by loading the titles table.

%%sql
SELECT * FROM titles LIMIT 5;

and then making sure that we avoid rows in which premiered is NULL and the rows in which ended is NULL.

%%sql
SELECT * FROM titles WHERE (premiered IS NOT NULL)
   AND (ended IS NOT NULL) LIMIT 5;

Then we can calculate the difference between the year that the show ended and the year that the show premiered.

%%sql
SELECT *, ended-premiered FROM titles WHERE (premiered IS NOT NULL)
   AND (ended IS NOT NULL) LIMIT 5;

We can given this new variable a name, for instance, the length of the show’s run on TV (in years). Then we can order the results by the length in years, given in DESC (descending) order.

%%sql
SELECT *, ended-premiered AS length FROM titles WHERE (premiered IS NOT NULL)
   AND (ended IS NOT NULL) ORDER BY length DESC LIMIT 5;

How long was Friends on TV?

We can use the query above as a starting point, just looking up Friends as the title and seeing which shows with that title were on TV after 1993. We see that Friends was on TV for 10 years.

%%sql
SELECT *, ended-premiered AS length FROM titles WHERE (premiered IS NOT NULL)
   AND (ended IS NOT NULL) AND
(primary_title = 'Friends') AND (premiered > 1993) LIMIT 5;

How many types of titles are there?

Here are a few of the types of titles

%%sql
SELECT type FROM titles LIMIT 5;

There are lots of repeats, so we ask for DISTINCT types, i.e., removing the repetitions.

%%sql
SELECT DISTINCT type FROM titles LIMIT 5;

and now we can ask for a few more, i.e., we can increase the limit.

%%sql
SELECT DISTINCT type FROM titles LIMIT 100;

Looks like there are 12 types altogether:

%%sql
SELECT COUNT(DISTINCT type) FROM titles LIMIT 100;

How many times did each type occur?

We can group the types and count each of them.

%%sql
SELECT COUNT(*), type FROM titles GROUP BY type LIMIT 100;

How many times did each genre occur?

At first, we view the genres as tuples, for instance, Action,Adult is a genre (separated by commas). We can do this the same as we did above, just changing the variable type to the variable genres.

%%sql
SELECT COUNT(*), genres FROM titles GROUP BY genres LIMIT 100;

Now we see that there are 2283 such genres:

%%sql
SELECT COUNT(DISTINCT genres) FROM titles LIMIT 5;

We will come back to the question above, about the total number of genres, when we learn how to import SQL queries into R dataframes.

We can revisit Question 2 from Project 1, about the movies that received at least 2 million votes, and finding the titles of those movies.

We will need the titles table and the ratings table.

%%sql
SELECT * FROM titles LIMIT 5;
%%sql
SELECT * FROM ratings LIMIT 5;

Now we join these two tables, and restrict the results to those movies with at least 2000000 votes.

%%sql
SELECT * FROM titles AS t JOIN ratings AS r
ON t.title_id = r.title_id WHERE votes > 2000000 LIMIT 5;

Who are the actors and actresses in the TV show Friends?

We will need the people table and the crew table.

%%sql
SELECT * FROM people LIMIT 5;
%%sql
SELECT * FROM crew LIMIT 5;

Now we join these two tables together.

%%sql
SELECT * FROM crew AS c JOIN people AS p ON c.person_id = p.person_id LIMIT 5;

and now we also join with the titles table, and we focus on the title_id for Friends, which is tt0108778. There are 10 people listed, from the Friends TV show.

%%sql
SELECT * FROM titles AS t JOIN crew AS c ON t.title_id = c.title_id
JOIN people AS p ON c.person_id = p.person_id
WHERE t.title_id = 'tt0108778' LIMIT 50;

and 8 of them are actors or actresses

%%sql
SELECT * FROM titles AS t JOIN crew AS c ON t.title_id = c.title_id
JOIN people AS p ON c.person_id = p.person_id
WHERE (t.title_id = 'tt0108778')
AND ((c.category = 'actress') OR (c.category = 'actor')) LIMIT 50;

1. James Caan died last week. What was his highest rated movie?

He appeared in The Godfather, which has a rating of 9.2

%%sql
SELECT * FROM titles AS t JOIN crew AS c ON t.title_id = c.title_id
 JOIN people AS p ON c.person_id = p.person_id
 JOIN ratings AS r ON t.title_id = r.title_id
 WHERE (p.name = 'James Caan') AND (t.type = 'movie') ORDER BY r.rating DESC LIMIT 5;

2. How many movies has Emma Watson appeared in?

She has appeared in a total of 18 movies.

%%sql
SELECT COUNT(*) FROM titles AS t JOIN crew AS c ON t.title_id = c.title_id
 JOIN people AS p ON c.person_id = p.person_id
 WHERE (p.name = 'Emma Watson') AND (t.type = 'movie');

The most popular movie that premiered in 1940 was The Great Dictator, with a rating of 8.4

%%sql
SELECT * FROM titles AS t JOIN ratings AS r ON t.title_id = r.title_id
 WHERE (t.premiered = 1940) AND (t.type = 'movie') ORDER BY r.rating DESC LIMIT 5;

4. How many times has The Awakening been used as a title?

The Awakening has been used 131 times as a title

%%sql
SELECT COUNT(*) FROM titles WHERE primary_title = 'The Awakening' LIMIT 5;

5. How many episodes of Friends were there?

We start by finding the title_id for Friends.

%%sql
SELECT * FROM titles WHERE (primary_title = 'Friends') AND (premiered > 1992) LIMIT 5;

Now we find the number of episodes per season

%%sql
SELECT COUNT(*), season_number FROM episodes WHERE show_title_id = 'tt0108778' GROUP BY season_number;

Season 10 differs from what I expected (I was guessing that there would be 18 episodes), so I checked further on this.

%%sql
SELECT * FROM episodes AS e JOIN titles AS t ON e.episode_title_id = t.title_id WHERE show_title_id = 'tt0108778' AND season_number = 10 ORDER BY episode_number;

OK so they combined The Last One, which is two episodes, into just one listing.

So there are 235 episodes listed, although there were actually 236 episodes in the show altogether!

%%sql
SELECT COUNT(*) FROM episodes WHERE show_title_id = 'tt0108778';