TDM 10100: Project 7 — 2022

Motivation: A couple of bread-and-butter functions that are a part of the base R are: subset, and merge. subset provides a more natural way to filter and select data from a data.frame. merge brings the principals of combining data that SQL uses, to R.

Context: We’ve been getting comfortable working with data in within the R environment. Now we are going to expand our toolset with these useful functions, all the while gaining experience and practice wrangling data!

Scope: r, subset, merge, tapply

Learning Objectives
  • Gain proficiency using split, merge, and subset.

  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • Demonstrate how to use tapply to solve data-driven problems.

Make sure to read about, and use the template found here, and the important information about projects submissions here.


The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/movies_and_tv/titles.csv

  • /anvil/projects/tdm/data/movies_and_tv/episodes.csv

  • /anvil/projects/tdm/data/movies_and_tv/people.csv

  • /anvil/projects/tdm/data/movies_and_tv/ratings.csv


Question 1

Please select 6000 memory when launching Jupyter for this project.

Data can come in a lot of different formats and from a lot of different locations. It is not uncommon to have one or more files that need to be combined together before analysis is performed. merge is a popular function in most data wrangling libraries. It is extremely similar and essentially equivalent to a JOIN in SQL.

Read in each of the datasets into data.frames called: titles, episodes, people, and ratings.

Read the data in using the following code. fread is a very fast and efficient way to read in data.


titles <- data.frame(fread("/anvil/projects/tdm/data/movies_and_tv/titles.csv"))
episodes <- data.frame(fread("/anvil/projects/tdm/data/movies_and_tv/episodes.csv"))
people <- data.frame(fread("/anvil/projects/tdm/data/movies_and_tv/people.csv"))
ratings <- data.frame(fread("/anvil/projects/tdm/data/movies_and_tv/ratings.csv"))
  • What are all the different listed genres (in the titles table)?

  • Look at the years column and the genres column. In which year did the most comedies debut?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

Use the episode_title_id column and the title_id column from the episodes and titles data.frame’s (respectively) to merge the two data.frames.

Ultimately, we want to end up with a new data.frame that contains the primary_title for every episodes in the episodes table. Use the merge function to accomplish this.

The merge function in R allows two data frames to be combined by common columns. This function allows the user to combine data similar to the way SQL would using `JOIN`s. Visual representation of SQL Joins

This is also a really great explanation of merge in R.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Use merge (a few times) to create a new data.frame that contains at least the following information for only the show called "Friends". "Friends" (the show itself) has a title_id of tt0108778. Each episode of Friends, has its own title_id which contains the information for the specific episode as well.

  • The primary_title of the episode — call it episode_title.

  • The primary_title of the show itself — call it show_title.

  • The rating of the show itself — call it show_rating.

  • The rating of the episode — call it episode_rating.

Start by getting a subset of the episodes table that contains only information for the show Friends. That way, we aren’t working with as much data.

Show the top 5 rows of your final data.frame that contain the top 5 rated episodes.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Use regular old indexing to find all episodes of friends with an episode_rating greater than 9 and season_number of exactly 5.

Repeat the process, but this time use the subset function instead.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

subset is a sometimes useful function that allows you to index data.frame’s in a less verbose manner. Read this.

While it maybe appears to be a clean way to subset data, I’d suggest avoiding it over explicit long-form indexing. Read this fantastic article by Dr. Hadley Wickham on non-standard evaluation. Take for example, the following (a bit contrived) example using the dataframe we got in question (3).

season_number = 6
results[results$episode_rating > 9 & results$season_number == season_number,]
subset(results, episode_rating > 9 & season_number == season_number)

Read that provided article and do your best to explain why subset gets a different result than our example that uses regular indexing.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.