STAT 29000: Project 12 — Fall 2021

Motivation: Databases are (usually) comprised of many tables. It is imperative that we learn how to combine data from multiple tables using queries. To do so we perform "joins"! In this project we will explore learn about and practice using joins on our imdb database, as it has many tables where the benefit of joins is obvious.

Context: We’ve introduced a variety of SQL commands that let you filter and extract information from a database in an systematic way. In this project we will introduce joins, a powerful method to combine data from different tables.

Scope: SQL, sqlite, joins

Learning Objectives
  • Briefly explain the differences between left and inner join and demonstrate the ability to use the join statements to solve a data-driven problem.

  • Perform grouping and aggregate data using group by and the following functions: COUNT, MAX, SUM, AVG, LIKE, HAVING.

  • Showcase the ability to filter, alias, and write subqueries.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/movies_and_tv/imdb.db

In addition, the following is an illustration of the database to help you understand the data.

Database diagram from https://dbdiagram.io
Figure 1. Database diagram from dbdiagram.io

For this project, we will be using the imdb sqlite database. This database contains the data in the directory listed above.

To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook.

%load_ext sql
%sql sqlite:////depot/datamine/data/movies_and_tv/imdb.db

The first command loads the sql extension. The second command connects to the database.

For every following cell where you want to run a SQL query, prepend %%sql to the top of the cell — just like we do for R or bash cells.

Questions

Question 1

In the previous project, we provided you with a query to get the number of MCU movies that premiered in each year.

Now that we are learning about joins, we have the ability to make much more interesting queries!

Use the provided list of title_id values to get a list of the MCU movie primary_title values, premiered values, and rating (from the provided list of MCU movies).

Which movie had the highest rating? Modify your query to return only the 5 highest and 5 lowest rated movies (again, from the MCU list).

List of MCU title_ids
('tt0371746', 'tt0800080', 'tt1228705', 'tt0800369', 'tt0458339', 'tt0848228', 'tt1300854', 'tt1981115', 'tt1843866', 'tt2015381', 'tt2395427', 'tt0478970', 'tt3498820', 'tt1211837', 'tt3896198', 'tt2250912', 'tt3501632', 'tt1825683', 'tt4154756', 'tt5095030', 'tt4154664', 'tt4154796', 'tt6320628', 'tt3480822', 'tt9032400', 'tt9376612', 'tt9419884', 'tt10648342', 'tt9114286')
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

Run the following query.

%%sql

SELECT * FROM titles WHERE title_id IN ('tt0371746', 'tt0800080', 'tt1228705', 'tt0800369', 'tt0458339', 'tt0848228', 'tt1300854', 'tt1981115', 'tt1843866', 'tt2015381', 'tt2395427', 'tt0478970', 'tt3498820', 'tt1211837', 'tt3896198', 'tt2250912', 'tt3501632', 'tt1825683', 'tt4154756', 'tt5095030', 'tt4154664', 'tt4154796', 'tt6320628', 'tt3480822', 'tt9032400', 'tt9376612', 'tt9419884', 'tt10648342', 'tt9114286');

Pay close attention to the movies in the output. You will notice there are movies presented in this query that are (likely) not in the query results you got for question (1).

Write a query that returns the primary_title of those movies not shown in the result of question (1) but that are shown in the result of the query above. You can use the query in question (1) as a subquery to answer this.

Can you notice a pattern to said movies?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

In the previous questions we explored what is actually the difference between an INNER JOIN, and a LEFT JOIN. It is likely you used an INNER JOIN/JOIN in your solution to question (1). As a result, the MCU movies that did not yet have a rating in IMDB are not shown in the output of question (1).

Modify your query from question (1) so that it returns a list of all MCU movies with their associated rating, regardless of whether or not the movie has a rating.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

In the previous project, question (5) asked you to write a query that returns the average number of words in the primary_title column, by year, and only for years where the average number of words in the primary_title is less than 3.

Okay, great. What would be more interesting would be to see the average number of words in the primary_title column for titles with a rating of 8.5 or higher. Write a query to do that. How many words on average does a title with 8.5 or higher rating have?

Write another query that does the same for titles with < 8.5 rating. Is the average title length notably different?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

We have a fun database, and you’ve learned a new trick (joins). Use your newfound knowledge to write a query that uses joins to accomplish a task you couldn’t previously (easily) tackle, and answers a question you are interested in.

Explain what your query does, and talk about the results. Explain why you chose either a LEFT join or INNER join.

Items to submit
  • A written question about the movies/tv shows in the database.

  • Code used to solve this problem.

  • Output from running the code.

  • Explanation of the results, what your query does, and why you chose either a LEFT or INNER join.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.