TDM 20100: Project 8 — 2022

Motivation: Structured Query Language (SQL) is a language used for querying and manipulating data in a database. SQL can handle much larger amounts of data than R and Python can alone. SQL is incredibly powerful. In fact, Cloudflare, a billion dollar company, had much of its starting infrastructure built on top of a Postgresql database (per this thread on hackernews). Learning SQL is well worth your time!

Context: There are a multitude of RDBMSs (relational database management systems). Among the most popular are: MySQL, MariaDB, Postgresql, and SQLite. As we’ve spent much of this semester in the terminal, we will start in the terminal using SQLite.

Scope: SQL, sqlite

Learning Objectives
  • Explain the advantages and disadvantages of using a database over a tool like a spreadsheet.

  • Describe basic database concepts like: rdbms, tables, indexes, fields, query, clause.

  • Basic clauses: select, order by, limit, desc, asc, count, where, from, etc.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/movies_and_tv/imdb.db

In addition, the following is an illustration of the database to help you understand the data.

Database diagram from https://dbdiagram.io
Figure 1. Database diagram from dbdiagram.io

For this project, we will be using the imdb sqlite database. This database contains the data in the directory listed above.

To run SQL queries in a Jupyter Lab notebook, first run the following in a cell at the top of your notebook to establish a connection with the database.

%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb.db

For every following cell where you want to run a SQL query, prepend %%sql to the top of the cell — just like we do for R or bash cells.

Questions

Question 1

Get started by taking a look at the available tables in the database. What tables are available?

You’ll want to prepend %%sql to the top of the cell — it should be the very first line of the cell (no comments or anything else before it).

%%sql

-- Query here

In sqlite, you can show the tables using the following query:

.tables

Unfortunately, sqlite-specific functions can’t be run in a Jupyter Lab cell like that. Instead, we need to use a different query.

SELECT tbl_name FROM sqlite_master WHERE type='table';
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

It’s always a good idea to get an idea what your table(s) looks like. A good way to do this is to get the first 5 rows of data from the table. Write and run 6 queries that return the first 5 rows of data of each table.

To get a better idea of the size of the data, you can use the count clause to get the number of rows in each table. Write an run 6 queries that returns the number of rows in each table.

Run each query in a separate cell, and remember to limit the query to return only 5 rows each.

You can use the limit clause to limit the number of rows returned.

Relevant topics: queries, useful example

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

This dataset contains movie data from imdb.com (an Amazon company). As you can probably guess, it would be difficult to load the data from those tables into a nice, neat dataframe — it would just take too much memory on most systems!

Okay, let’s dig into the titles table a little bit. Run the following query.

SELECT * FROM titles LIMIT 5;

As you can see, every row has a title_id for the associated title of a movie or tv show (or other). What is this title_id? Check out the following link:

At this point, you may suspect that it is the id imdb uses to identify a movie or tv show. Well, let’s see if that is true. Query our database to get any matching titles from the titles table matching the title_id provided in the link above.

The WHERE clause can be used to filter the results of a query.

Relevant topics: queries, useful example

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

That is pretty cool! Not only do you understand what the title_id means inside the database — but now you know that you can associate a web page with each title_id — for example, if you run the following query, you will get a title_id for a "short" called "Carmencita".

SELECT * FROM titles LIMIT 5;
Output
title_id, type, ...
tt0000001, short, ...

If you navigate to www.imdb.com/title/tt0000001/, sure enough, you’ll see a neatly formatted page with data about the movie!

Okay great. Now, if you take a look at the episodes table, you’ll see that there are both an episode_title_id and show_title_id associated with each row.

Let’s try and make sense of this the same way we did before. Write a query using the WHERE clause to find all rows in the episodes table where episode_title_id is tt0903747. What did you get?

Now, write a query using the WHERE clause to find all rows in the episodes table where show_title_id is tt0903747. What did you get?

Relevant topics: queries, useful example

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Very interesting! It looks like we didn’t get any results when we queried for episode_title_id with an id of tt0903747, but we did for show_title_id. This must mean these ids can represent both a show as well as the episode of a show. By that logic, we should be able to find the title of one of the Breaking Bad episodes, in the same way we found the title of the show itself, right?

Okay, take a look at the results of your second query from question (4). Choose one of the episode_title_id values, and query the titles table to find the title of that episode.

Finally, in a browser, verify that the title of the episode is correct. To verify this, take the episode_title_id and plug it into the following link.

So, I used tt1232248 for my query. I would check to make sure it matches this.

Relevant topics: queries, useful example

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.