TDM 10100: Project 7 — 2023

Motivation: A couple of bread-and-butter functions that are a part of the base R are: subset, and merge. subset provides a more natural way to filter and select data from a data.frame. merge brings the principals of combining data that SQL uses, to R.

Context: We’ve been getting comfortable working with data in within the R environment. Now we are going to expand our tool set with these useful functions, all the while gaining experience and practice wrangling data!

Scope: r, subset, merge, tapply

Learning Objectives
  • Gain proficiency using split, merge, and subset.

  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • Demonstrate how to use tapply to solve data-driven problems.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/icecream/combined/products.csv -` /anvil/projects/tdm/data/icecream/combined/reviews.csv`

  • /anvil/projects/tdm/data/movies_and_tv/titles.csv

  • /anvil/projects/tdm/data/movies_and_tv/episodes.csv

  • /anvil/projects/tdm/data/movies_and_tv/ratings.csv

Questions

Please select 3 cores when launching JupyterLab for this project.

Data can come in a lot of different formats and from a lot of different locations. It is common to have several files that need to be combined together, before analysis is performed. The merge function is helpful for this purpose. The way that we merge files is different in each language and data science too. With R, there is a built-in merge function that makes things easy! (Of course students in TDM 10100 have not yet learned about SQL databases, but many of you will learn SQL databases someday too. The merge function is very similar to the ability to merge tables in SQL databases.)

Read the data in using the following code. We used read.csv for this purpose in the past. The fread function is a much faster and more efficient way to read in data.

library(data.table)

products <- fread("/anvil/projects/tdm/data/icecream/combined/products.csv")
reviews <- fread("/anvil/projects/tdm/data/icecream/combined/reviews.csv")
titles <- fread("/anvil/projects/tdm/data/movies_and_tv/titles.csv")
episodes <- fread("/anvil/projects/tdm/data/movies_and_tv/episodes.csv")
ratings <- fread("/anvil/projects/tdm/data/movies_and_tv/ratings.csv")

Please remember to run the library(data.table) line, before you use the fread function. Otherwise, you will get an error in a pink box in JupyterLab like this:

Error in fread: could not find function "fread"

Question 1 (1 pt)

We will use the products data.frame for this question.

  1. What are all the different ingredients in the first record of products?

  2. Consider the rating column and the ingredients column. Consider only the products in which "GUAR GUM" is one of the ingredients. (You will need to use either grep or grepl or something similar, to find these products. Hint: You should have 85 such products.) List the ratings for these 85 products, in decreasing order.

Please find out the distribution of ratings for the ice cream which ingredients include "GUAR GUM", display the result in descending order

Question 2 (1 pt)

We will use the products and reviews data.frames for this question.

  1. Use the brand and key columns from both the products data.frame and reviews data.frame to merge the two data frames. This will give a new data.frame that contains the product details and their associated reviews.

If you do not specify the brand and key columns for the merge, then you will get an error, because the ingredients function contains characters in the products data frame but contains numeric values in the reviews data frame.

Question 3 (3 pts)

We will use the episodes, titles and ratings data.frames for questions 3 through Question 5

  1. Use merge (a few times) to create a new data.frame that contains at least the following four columns for only the episodes of the show called "Stranger Things". The show itself called "Stranger Things" has a title_id of tt4574334. You can find this on IMDB here: www.imdb.com/title/tt4574334/ Each episode of Stranger Things has its own title_id that contains the information for the specific episode as well. For your output: Show the top 5 rows of your final data.frame, containing the top 5 rated episodes of Stranger Things.

    • The primary_title of the show itself — call it show_title.

    • The primary_title of the episode — call it episode_title.

    • The rating of the show itself — call it show_rating.

    • The rating of the episode — call it episode_rating.

Start by getting a subset of the episodes table that contains only information for the show Stranger Things. To do this, you will need to make a subset of the data frame that only has information for Stranger Things show. That way, we aren’t working with as much data.

Make sure to show the top 5 rows of your final data.frame, containing the top 5 rated episodes of Stranger Things!

In the videos, I did not rename the columns. You might want to rename them, because it might help you, but you do not need to rename them. It’s up to you. I’m trying to be a little flexible and to provide guidance without being too strict either.

Question 4 (1 pt)

For question 4, use the data frame that you built in Question 3.

  1. Use regular old indexing to find all episodes of "Stranger Things" with an episode_rating less than 8.5 and season_number of exactly 3.

  2. Repeat the process, but this time use the subset function instead.

Make sure that the dimensions of the data frames that you get in question 4a and 4b are the same sizes!

Question 5 (2 pts)

For question 5, use the data frame that you built in Question 3.

The subset function allows you to index data.frame’s in a less verbose manner. Read this.

While it maybe appears to be a clean way to subset data, I’d suggest avoiding it over explicit long-form indexing. Read this fantastic article by Dr. Hadley Wickham on non-standard evaluation. Take for example, the following (a bit contrived) example using the dataframe we got in question (3).

Note: You do not need to write much for your answer. It is OK if you try the example below, and you see that it fails (and it will fail for sure!), and then you say something like, "I will try hard to not use variable names that overlap with other variable names". Or something like that! We simply want to ensure that students are choosing to use good variable names.

season_number <- 3
subset(StrangerThingsBigMergedDF, (season_number == season_number) & (rating.y < 8.5))
  1. Read that provided article and do your best to explain why subset gets a different result than our example that uses regular indexing.

Project 07 Assignment Checklist

  • Jupyter Lab notebook with your code, comments and output for the assignment

    • firstname-lastname-project07.ipynb.

  • R code and comments for the assignment

    • firstname-lastname-project07.R.

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.