# STAT 19000: Project 8 — Fall 2021

Motivation: A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code!

Context: We’ve been learning about and using functions all year! Now we are going to learn more about some of the terminology and components of a function, as you will certainly need to be able to write your own functions soon.

Scope: r, functions

Learning Objectives
• Gain proficiency using split, merge, and subset.

• Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

• Read and write basic (csv) data.

• Explain and demonstrate: positional, named, and logical indexing.

• Demonstrate how to use tapply to solve data-driven problems.

• Comprehend what a function is, and the components of a function in R.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

## Dataset(s)

The following questions will use the following dataset(s):

• `/depot/datamine/data/goodreads/csv/interactions_subset.csv`

## Questions

### Question 1

Read the `interactions_subset.csv` into a data.frame called `interactions`. We have provided you with the function `get_probability_of_review` below.

``````# A function that, given a string (userID) and a value (min_rating) returns a value (probability_of_reviewing).
get_probability_of_review <- function(interactions_dataset, userID, min_rating) {
# FILL IN EXPLANATION HERE
user_data <- subset(interactions_dataset, user_id == userID)

# FILL IN EXPLANATION HERE

# FILL IN EXPLANATION HERE

# FILL IN EXPLANATION HERE

# Return the result
return(probability_of_reviewing)
}

get_probability_of_review(interactions_dataset = interactions, userID = 5000, min_rating = 3)``````

Provide 1-2 sentences explaining overall what the function is doing and what arguments it requires.

 You may want to use `fread` function from the library `data.table` to read in the data.
``````library(data.table)
 Your kernel may crash! As it turns out, the `subset` function is not very memory efficient (never fully trust a function). When you launch your Jupyter Lab session, if you use 3072 MB of memory, your kernel is likely to crash on this example. If (instead) you use 5120 MB of memory when you launch your session, you should have sufficient memory to run these examples.

Relevant topics: function, subset

Items to submit
• R code used to solve this problem.

• Modified `get_probability_of_review` with comments explaining each step.

• 1-2 sentences explaining overall what the function is doing.

• Number and name of arguments for the function, `get_probability_of_review`.

### Question 2

We want people that use our function to be able to get results even if they don’t provide a minimum rating value.

Modify the function `get_probability_of_review` so `min_rating` has the default value of 0. Test your function as follows.

``get_probability_of_review(interactions_dataset = interactions, userID = 5000)``

Now, in R (and in most languages), you can provide the arguments out of order, as long as you provide the argument name on the left of the equals sign and the value on the right. For example the following will still work.

``get_probability_of_review(userID = 5000, interactions_dataset = interactions)``

In addition, you don’t have to provide the argument names when you call the function, however, you do have to place the arguments in order when you do.

``get_probability_of_review(interactions, 5000)``
Items to submit
• Code used to solve this problem.

• Output from running the code.

### Question 3

Our function may not be the most efficient. However, we can reduce the code a little bit! Modify our function so we only use the `subset` function once, rather than 3 times.

Test your modified function on userID 5000. Do you get the same results as above?

Now, instead of using `subset`, just use regular old indexing in your function. Do your results agree with both versions above?

Items to submit
• Code used to solve this problem.

• Output from running the code.

### Question 4

Run the code below. Explain what happens, and why it is happening.

``head(read_user_min_rating_data)``
Items to submit
• The results of running the R code.

• 1-2 sentences explaining what happened.

• 1-2 sentences explaining why it is happening.

### Question 5

Apply our function to the `interactions` dataset to get, for a sample of 10 users, the probability of reviewing books given that they liked the book.

Save this probability to a vector called `prob_review`.

To do so, determine a minimum rating (`min_rating`) value when calculating that probability. Provide 1-2 sentences explaining why you chose this value.

 You can use the function `sample` to get a random sample of 10 users.
 You can pick any 10 users you want to compose your sample.
Items to submit
• R code used to solve this problem.

• The results of running the R code.

• 1-2 sentences explaining why you this particular minimum rating value.

### Question 6

Change the minimum rating value, and re-calculate the probability for your selected 10 users.

Make 1 (or more) plot(s) to compare the results you got with the different minimum rating value. Write 1-2 sentences describing your findings.

Items to submit
• R code used to solve this problem.

• The results of running the R code.

• 1-2 sentences comparing the results for question (5) and (6).

 Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.