# Think Summer: Project 4 — 2022

## Submission

Students need to submit the following file by 10:00PM EST through Gradescope inside Brightspace.

1. A Jupyter notebook (a `.ipynb` file).

We’ve provided you with a template notebook for you to use. Please carefully read this section to get started.

 When you are finished with the project, please make sure to run every cell in the notebook prior to submitting. To do this click Run  Run All Cells. Next, to export your notebook (your `.ipynb` file), click on File  Download, and download your `.ipynb` file.

## Questions

### Question 1

Aggregate functions like `COUNT`, `AVG`, `SUM`, `MIN`, and `MAX` are very useful. In particular, running queries like the following are great.

``SELECT COUNT(*) FROM titles WHERE premiered = 2008;``

However, in this scenario we want to know how many movies premiered in 2008. How often would we rather just see these numbers for every year, rather than 1 year at a time? This is where aggregate functions really start to have more power.

In the `titles` table, the `premiered` column specifies the year that a movie was premiered. Use `COUNT` to find how many movies premiered in each year in the database, in a single query.

 Use only SQL to answer this question.
 If you feel like practicing your R skills, try and solve this using R instead of SQL (for 0 points).

Relevant topics: SQL, queries, aggregate functions

Items to submit
• SQL query used to solve the question. (1.5 pts)

• Output from running the code. (.5 pts)

### Question 2

In question (1), we have an example that starts to demonstrate how those simple aggregate functions are really quite powerful. The results, however, do have some ways that they could be improved. Improve your solution to question (1) in the following ways:

1. Use aliasing to rename the results of the `COUNT` function, so that rather than being labeled `COUNT(*)`, the column appears as `movies premiered`.

2. While it can be interesting to see the number of movies premiering long ago, perhaps we don’t need to see all of this information. Edit your query to only include movies from 1970+.

 Use only SQL to answer this question.

Relevant topics: SQL, queries, aggregate functions

Items to submit
• SQL query used to solve the question. (1.5 pts)

• Output from running the code. (.5 pts)

### Question 3

Let’s try to combine a little bit of everything we’ve learned so far. In the previous project, you picked a TV series to perform queries on. Use that same TV series (or, if you don’t want to choose a TV series, title_id "tt0413573" is a good one to use) for this question.

We want to get the `episode_number`, `season_number`, `primary_title`, `title_id`, and `rating` of every episode of your TV series for only seasons where the average `rating` was at least X, in a single query.

This will be a large query with multiple joins, and sub-queries. For this reason, we will break this down into parts, each worth some points.

#### Part 1

First, write a query that gets the `episode_title_id` and `season_number` for every episode of our TV show.

#### Part 2

Next, use your query from part (1) as a sub-query, and get the `season_number`s for the seasons with an average `rating` of at least 8.0. The result should be a single column (`season_number`) with 10 values (if you are using title_id "tt0413573").

 Remember that a TV show may have an overall rating and individual episode ratings. For example, for Grey’s Anatomy, you can get the overall rating by running this query. ``SELECT rating FROM ratings WHERE title_id = 'tt0413573';`` But, we want you to get the average rating, by season.

#### Part 3

Write a query that gets the `episode_number`, `season_number`, `primary_title`, and `title_id` for the TV show with your title_id (for example, "tt0413573"). Make sure to order the results first by `season_number` and then by `episode_number`.

#### Part 4

At this stage there are only 2 missing components to our query from part (3). First is the fact that all episodes from all seasons are returned. To address this, use logical `AND` and the `IN` operator to limit the returned episodes from your part (3) query to only the `season_number`s returned in your part (2) query.

 This may sound difficult, but it isn’t! Start with your part (3) query, and tack on `AND IN ()`. Of course, you need to fill in `` with the correct column name, and `` with our part (2) query.

#### Part 5

Finally, the last missing component is the individual `rating` for each episode. Simply start with your query from part (4), and perform a join with the `ratings` table to get the `rating` for each episode.

In addition, the `rating` isn’t available in our query from part (3).

Relevant topics:

Items to submit
• SQL queries for each of parts 1 - 5. (.5 pts each)

• Output from running queries from each of parts 1 - 5. (.5 pts each)

### Question 4

Use R to solve this question. (This question does not need a `tapply`.) What was the most popular day to travel in 2005, in terms of the total number of flights? What was the least popular day to travel?

 Use the `tapply` function in R to solve the Questions 5, 6, 7 below.

### Question 5

Which airplane (listed by `TailNum`) flew the most miles altogether in 2005?

### Question 6

Among the three big New York City airports (`JFK`, `LGA`, `EWR`), which of these airports had the worst `DepDelay` (on average) in 2005? (Can you solve this with 1 line of R, using a `tapply`, rather than 3 lines of R? Hint: After you run the `tapply`, you can index your results using `[c("JFK", "LGA", "EWR")]` to lookup all 3 airports at once.)

### Question 7

Which flight path (i.e., which Origin-to-Dest pair) has the longest average departure delay?