Think Summer: Project 4 — 2022


Students need to submit the following file by 10:00PM EST through Gradescope inside Brightspace.

  1. A Jupyter notebook (a .ipynb file).

We’ve provided you with a template notebook for you to use. Please carefully read this section to get started.

When you are finished with the project, please make sure to run every cell in the notebook prior to submitting. To do this click Run  Run All Cells. Next, to export your notebook (your .ipynb file), click on File  Download, and download your .ipynb file.


Question 1

Aggregate functions like COUNT, AVG, SUM, MIN, and MAX are very useful. In particular, running queries like the following are great.

SELECT COUNT(*) FROM titles WHERE premiered = 2008;

However, in this scenario we want to know how many movies premiered in 2008. How often would we rather just see these numbers for every year, rather than 1 year at a time? This is where aggregate functions really start to have more power.

In the titles table, the premiered column specifies the year that a movie was premiered. Use COUNT to find how many movies premiered in each year in the database, in a single query.

Use only SQL to answer this question.
If you feel like practicing your R skills, try and solve this using R instead of SQL (for 0 points).

Relevant topics: SQL, queries, aggregate functions

Items to submit
  • SQL query used to solve the question. (1.5 pts)

  • Output from running the code. (.5 pts)

Question 2

In question (1), we have an example that starts to demonstrate how those simple aggregate functions are really quite powerful. The results, however, do have some ways that they could be improved. Improve your solution to question (1) in the following ways:

  1. Use aliasing to rename the results of the COUNT function, so that rather than being labeled COUNT(*), the column appears as movies premiered.

  2. While it can be interesting to see the number of movies premiering long ago, perhaps we don’t need to see all of this information. Edit your query to only include movies from 1970+.

Use only SQL to answer this question.

Relevant topics: SQL, queries, aggregate functions

Items to submit
  • SQL query used to solve the question. (1.5 pts)

  • Output from running the code. (.5 pts)

Question 3

Let’s try to combine a little bit of everything we’ve learned so far. In the previous project, you picked a TV series to perform queries on. Use that same TV series (or, if you don’t want to choose a TV series, title_id "tt0413573" is a good one to use) for this question.

We want to get the episode_number, season_number, primary_title, title_id, and rating of every episode of your TV series for only seasons where the average rating was at least X, in a single query.

This will be a large query with multiple joins, and sub-queries. For this reason, we will break this down into parts, each worth some points.

Part 1

First, write a query that gets the episode_title_id and season_number for every episode of our TV show.

Part 2

Next, use your query from part (1) as a sub-query, and get the season_number`s for the seasons with an average `rating of at least 8.0. The result should be a single column (season_number) with 10 values (if you are using title_id "tt0413573").

Remember that a TV show may have an overall rating and individual episode ratings. For example, for Grey’s Anatomy, you can get the overall rating by running this query.

SELECT rating FROM ratings WHERE title_id = 'tt0413573';

But, we want you to get the average rating, by season.

Part 3

Write a query that gets the episode_number, season_number, primary_title, and title_id for the TV show with your title_id (for example, "tt0413573"). Make sure to order the results first by season_number and then by episode_number.

Part 4

At this stage there are only 2 missing components to our query from part (3). First is the fact that all episodes from all seasons are returned. To address this, use logical AND and the IN operator to limit the returned episodes from your part (3) query to only the `season_number`s returned in your part (2) query.

This may sound difficult, but it isn’t! Start with your part (3) query, and tack on AND <column name> IN (<sub query>). Of course, you need to fill in <column name> with the correct column name, and <sub query> with our part (2) query.

Part 5

Finally, the last missing component is the individual rating for each episode. Simply start with your query from part (4), and perform a join with the ratings table to get the rating for each episode.

In addition, the rating isn’t available in our query from part (3).

Relevant topics:

Items to submit
  • SQL queries for each of parts 1 - 5. (.5 pts each)

  • Output from running queries from each of parts 1 - 5. (.5 pts each)

Question 4

Use R to solve this question. (This question does not need a tapply.) What was the most popular day to travel in 2005, in terms of the total number of flights? What was the least popular day to travel?

Use the tapply function in R to solve the Questions 5, 6, 7 below.

Question 5

Which airplane (listed by TailNum) flew the most miles altogether in 2005?

Question 6

Among the three big New York City airports (JFK, LGA, EWR), which of these airports had the worst DepDelay (on average) in 2005? (Can you solve this with 1 line of R, using a tapply, rather than 3 lines of R? Hint: After you run the tapply, you can index your results using [c("JFK", "LGA", "EWR")] to lookup all 3 airports at once.)

Question 7

Which flight path (i.e., which Origin-to-Dest pair) has the longest average departure delay?