STAT 19000: Project 8 — Fall 2021

Motivation: A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code!

Context: We’ve been learning about and using functions all year! Now we are going to learn more about some of the terminology and components of a function, as you will certainly need to be able to write your own functions soon.

Scope: r, functions

Learning Objectives
  • Gain proficiency using split, merge, and subset.

  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • Demonstrate how to use tapply to solve data-driven problems.

  • Comprehend what a function is, and the components of a function in R.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/goodreads/csv/interactions_subset.csv

Questions

Question 1

Read the interactions_subset.csv into a data.frame called interactions. We have provided you with the function get_probability_of_review below.

After reading in the data, run the code below, and add comments explaining what the function is doing at each step.

# A function that, given a string (userID) and a value (min_rating) returns a value (probability_of_reviewing).
get_probability_of_review <- function(interactions_dataset, userID, min_rating) {
        # FILL IN EXPLANATION HERE
        user_data <- subset(interactions_dataset, user_id == userID)

        # FILL IN EXPLANATION HERE
        read_user_data <- subset(user_data, is_read == 1)

        # FILL IN EXPLANATION HERE
        read_user_min_rating_data <- subset(read_user_data, rating >= min_rating)

        # FILL IN EXPLANATION HERE
        probability_of_reviewing <- mean(read_user_min_rating_data$is_reviewed)

        # Return the result
        return(probability_of_reviewing)
}

get_probability_of_review(interactions_dataset = interactions, userID = 5000, min_rating = 3)

Provide 1-2 sentences explaining overall what the function is doing and what arguments it requires.

You may want to use fread function from the library data.table to read in the data.

library(data.table)
interactions <- fread("/path/to/dataset")

Your kernel may crash! As it turns out, the subset function is not very memory efficient (never fully trust a function). When you launch your Jupyter Lab session, if you use 3072 MB of memory, your kernel is likely to crash on this example. If (instead) you use 5120 MB of memory when you launch your session, you should have sufficient memory to run these examples.

Relevant topics: function, subset

Items to submit
  • R code used to solve this problem.

  • Modified get_probability_of_review with comments explaining each step.

  • 1-2 sentences explaining overall what the function is doing.

  • Number and name of arguments for the function, get_probability_of_review.

Question 2

We want people that use our function to be able to get results even if they don’t provide a minimum rating value.

Modify the function get_probability_of_review so min_rating has the default value of 0. Test your function as follows.

get_probability_of_review(interactions_dataset = interactions, userID = 5000)

Now, in R (and in most languages), you can provide the arguments out of order, as long as you provide the argument name on the left of the equals sign and the value on the right. For example the following will still work.

get_probability_of_review(userID = 5000, interactions_dataset = interactions)

In addition, you don’t have to provide the argument names when you call the function, however, you do have to place the arguments in order when you do.

get_probability_of_review(interactions, 5000)
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Our function may not be the most efficient. However, we can reduce the code a little bit! Modify our function so we only use the subset function once, rather than 3 times.

Test your modified function on userID 5000. Do you get the same results as above?

Now, instead of using subset, just use regular old indexing in your function. Do your results agree with both versions above?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Run the code below. Explain what happens, and why it is happening.

head(read_user_min_rating_data)

Google "Scoping in R", and read.

Items to submit
  • The results of running the R code.

  • 1-2 sentences explaining what happened.

  • 1-2 sentences explaining why it is happening.

Question 5

Apply our function to the interactions dataset to get, for a sample of 10 users, the probability of reviewing books given that they liked the book.

Save this probability to a vector called prob_review.

To do so, determine a minimum rating (min_rating) value when calculating that probability. Provide 1-2 sentences explaining why you chose this value.

You can use the function sample to get a random sample of 10 users.

You can pick any 10 users you want to compose your sample.

Items to submit
  • R code used to solve this problem.

  • The results of running the R code.

  • 1-2 sentences explaining why you this particular minimum rating value.

Question 6

Change the minimum rating value, and re-calculate the probability for your selected 10 users.

Make 1 (or more) plot(s) to compare the results you got with the different minimum rating value. Write 1-2 sentences describing your findings.

Items to submit
  • R code used to solve this problem.

  • The results of running the R code.

  • 1-2 sentences comparing the results for question (5) and (6).

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.