TDM 10100: Project 8 — 2023

Motivation: Functions are an important part of writing efficient code.
Functions allow us to repeat and reuse code. If you find you using a set of coding steps over and over, a function may be a good way to reduce your lines of code!

Context: We’ve been learning about and using functions these last few weeks.
To learn how to write your own functions we need to learn some of the terminology and components.

Scope: r, functions

Learning Objectives
  • Gain proficiency using split, merge, and subset.

  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

  • Read and write basic (csv) data.

  • Comprehend what a function is, and the components of a function in R.

Dataset(s)

We will use the same dataset(s) as last week:

  • /anvil/projects/tdm/data/icecream/combined/products.csv

  • /anvil/projects/tdm/data/icecream/combined/reviews.csv

Please choose 3 cores when launching the JupyterLab for this project.

fread- is a fast and efficient way to read in data.

library(data.table)

products <- fread("/anvil/projects/tdm/data/icecream/combined/products.csv")
reviews <- fread("/anvil/projects/tdm/data/icecream/combined/reviews.csv")

Please remember to run the library(data.table) line, before you use the fread function. Otherwise, you will get an error in a pink box in JupyterLab like this:

Error in fread: could not find function "fread"

We will see how to write our own function, so that we can make a repetitive operation easier, by turning it into a single command.

We need to take care to name the function something concise but meaningful, so that other users can understand what the function does.

Function parameters can also be called formal arguments.

A function contains multiple interrelated statements. We can "call" the function, which means that we run all of the statements from the function.

Functions can be built-in or can be created by the user (user-defined).

Some examples of built in functions are:
  • min(), max(), mean(), median()

  • print()

  • head()

Syntax of a function

what_you_name_the_function <- function (parameters) {
  statement(s) that are executed when the function runs
  the last line of the function is the returned value
}

Questions

Question 1 (2 pts)

To gain better insights into our data, let’s make two simple plots. The following are two examples. You can create your own plots.

  1. In project 07, you found the different ingredients for the first record in the products data frame. We may get all of the ingredients from the products data frame, and find the top 10 most frequently used ingredients. Then we can create a bar chart for the distribution of the number of times that each ingredient appears.

  2. A line plot to visualize the distribution of the reviews of the products.

  3. What information are you gaining from these graphs?

The table function can be useful to get the distribution of the number of times that each ingredient appears.

This is a good website for bar plot examples: www.statmethods.net/graphs/bar.html

This is a good website for line plot examples: www.sthda.com/english/wiki/line-plots-r-base-graphs

Making a dotchart for Question 1 is helpful and insightful, as demonstrated in the video. BUT we also want you to see how to make a bar plot and a line plot. Do not worry about the names of the ingredients too much. If only a few names of ingredients appear on the x-axis for Question 1, that is OK wiht us. We just want to show the distribution (in other words, the numbers) of times that items appear. We are less concerned about the item names themselves.

Question 2 (1 pt)

For practice, now that you have a basic understanding of how to make a function, we will use that knowledge, applied to our dataset.

Here are pieces of a function we will use on this dataset; products, reviews and products' rating put them in the correct order

* merge_results <- merge(products_df, reviews_df, by="key")
* }
* function(products_df, reviews_df, myrating)
* return(products_reviews_results)
* {
* products_reviews_results <- merge_results[merge_results$rating >= myrating, ]
* products_reviews_by_rating <-

Question 3 (1 pt)

Take the above function and add comments explaining what the function does at each step.

Question 4 (2 pts)

my_selection <- products_reviews_by_rating(products, reviews, 4.5)

Use the code above, to answer the following question. We want you to use the data frame my_selection when solving Question 4. (Do not use the full products data frame for Question 4.)

  1. How many products are there (altogether) that have rating at least 4.5? (This is supposed to be simple: You can just find the number of rows of the data frame my_selection.)

The function merged two data sets products and reviews. Both of them have an ingredients column, so we need to use the ingredients column from products by referring to`ingredients.x`.

Question 5 (2 pts)

For Question 5, go back to the full products data frame. (In other words, do not limit yourself to my_selection any more.) When you are constructing your function in part a, it should be helpful to review the videos from Question 1.

  1. Now create a function that takes 1 ingredient as the input, and finds the number of products that contain that ingredient.

  2. Use your function to determine how many products contain SALT as an ingredient.

(Note: If you test the function with "GUAR GUM", for instance, you will see that there are 85 products with "GUAR GUM" as an ingredient, as we learned in the previous project.)

Project 08 Assignment Checklist

  • Jupyter Lab notebook with your code, comments and output for the assignment

    • firstname-lastname-project08.ipynb

  • R code and comments for the assignment

    • firstname-lastname-project08.R.

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.