Think Summer: Project 4 — 2022
Submission
Students need to submit the following file by 10:00PM EST through Gradescope inside Brightspace.
-
A Jupyter notebook (a
.ipynb
file).
We’ve provided you with a template notebook for you to use. Please carefully read this section to get started.
When you are finished with the project, please make sure to run every cell in the notebook prior to submitting. To do this click |
Questions
Question 1
Aggregate functions like COUNT
, AVG
, SUM
, MIN
, and MAX
are very useful. In particular, running queries like the following are great.
SELECT COUNT(*) FROM titles WHERE premiered = 2008;
However, in this scenario we want to know how many movies premiered in 2008. How often would we rather just see these numbers for every year, rather than 1 year at a time? This is where aggregate functions really start to have more power.
In the titles
table, the premiered
column specifies the year that a movie was premiered. Use COUNT
to find how many movies premiered in each year in the database, in a single query.
Use only SQL to answer this question. |
If you feel like practicing your R skills, try and solve this using R instead of SQL (for 0 points). |
Relevant topics: SQL, queries, aggregate functions
-
SQL query used to solve the question. (1.5 pts)
-
Output from running the code. (.5 pts)
Question 2
In question (1), we have an example that starts to demonstrate how those simple aggregate functions are really quite powerful. The results, however, do have some ways that they could be improved. Improve your solution to question (1) in the following ways:
-
Use aliasing to rename the results of the
COUNT
function, so that rather than being labeledCOUNT(*)
, the column appears asmovies premiered
. -
While it can be interesting to see the number of movies premiering long ago, perhaps we don’t need to see all of this information. Edit your query to only include movies from 1970+.
Use only SQL to answer this question. |
Relevant topics: SQL, queries, aggregate functions
-
SQL query used to solve the question. (1.5 pts)
-
Output from running the code. (.5 pts)
Question 3
Let’s try to combine a little bit of everything we’ve learned so far. In the previous project, you picked a TV series to perform queries on. Use that same TV series (or, if you don’t want to choose a TV series, title_id "tt0413573" is a good one to use) for this question.
We want to get the episode_number
, season_number
, primary_title
, title_id
, and rating
of every episode of your TV series for only seasons where the average rating
was at least X, in a single query.
This will be a large query with multiple joins, and sub-queries. For this reason, we will break this down into parts, each worth some points.
Part 1
First, write a query that gets the episode_title_id
and season_number
for every episode of our TV show.
Part 2
Next, use your query from part (1) as a sub-query, and get the season_number`s for the seasons with an average `rating
of at least 8.0. The result should be a single column (season_number
) with 10 values (if you are using title_id "tt0413573").
Remember that a TV show may have an overall rating and individual episode ratings. For example, for Grey’s Anatomy, you can get the overall rating by running this query.
But, we want you to get the average rating, by season. |
Part 3
Write a query that gets the episode_number
, season_number
, primary_title
, and title_id
for the TV show with your title_id (for example, "tt0413573"). Make sure to order the results first by season_number
and then by episode_number
.
Part 4
At this stage there are only 2 missing components to our query from part (3). First is the fact that all episodes from all seasons are returned. To address this, use logical AND
and the IN
operator to limit the returned episodes from your part (3) query to only the `season_number`s returned in your part (2) query.
This may sound difficult, but it isn’t! Start with your part (3) query, and tack on |
Part 5
Finally, the last missing component is the individual rating
for each episode. Simply start with your query from part (4), and perform a join with the ratings
table to get the rating
for each episode.
In addition, the rating
isn’t available in our query from part (3).
Relevant topics:
-
SQL queries for each of parts 1 - 5. (.5 pts each)
-
Output from running queries from each of parts 1 - 5. (.5 pts each)
Question 4
Use R to solve this question.
(This question does not need a tapply
.) What was the most popular day to travel in 2005, in terms of the total number of flights? What was the least popular day to travel?
Use the |
Question 6
Among the three big New York City airports (JFK
, LGA
, EWR
), which of these airports had the worst DepDelay
(on average) in 2005? (Can you solve this with 1 line of R, using a tapply
, rather than 3 lines of R? Hint: After you run the tapply
, you can index your results using [c("JFK", "LGA", "EWR")]
to lookup all 3 airports at once.)