STAT 19000: Project 9 — Fall 2020

Motivation: A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code!

Context: We’ve been learning about and using functions all year! Now we are going to learn more about some of the terminology and components of a function, as you will certainly need to be able to write your own functions soon.

Scope: r, functions

Learning objectives
  • Gain proficiency using split, merge, and subset.

  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • Demonstrate how to use tapply to solve data-driven problems.

  • Comprehend what a function is, and the components of a function in R.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/goodreads/csv

Questions

Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

Question 1

We’ve provided you with a function below. How many arguments does the function have, and what are their names? You can get a book_id from the URL of a goodreads book’s webpage.

Two examples:

Find 2 or 3 book_id and test out the function until you get two successes. Explain in words, what the function is doing, and what options you have.

library(imager)
books <- read.csv("/class/datamine/data/goodreads/csv/goodreads_books.csv")
authors <- read.csv("/class/datamine/data/goodreads/csv/goodreads_book_authors.csv")
get_author_name <- function(my_authors_dataset, my_author_id){
        return(my_authors_dataset[my_authors_dataset$author_id==my_author_id,'name'])
}
fun_plot <- function(my_authors_dataset, my_books_dataset, my_book_id, display_cover=T) {
    book_info <- my_books_dataset[my_books_dataset$book_id==my_book_id,]
    all_books_by_author <- my_books_dataset[my_books_dataset$author_id==book_info$author_id,]
    author_name <- get_author_name(my_authors_dataset, book_info$author_id)

    img <- load.image(book_info$image_url)

    if(display_cover){
        par(mfrow=c(1,2))
        plot(img, axes=FALSE)
    }

    plot(all_books_by_author$num_pages, all_books_by_author$average_rating,
         ylim=c(0,5.1), pch=21, bg='grey80',
         xlab='Number of pages', ylab='Average rating',
         main=paste('Books by', author_name))

    points(book_info$num_pages, book_info$average_rating,pch=21, bg='orange', cex=1.5)
}
Items to submit
  • How many arguments does the function have, and what are their names?

  • The result of using the function on 2-3 `book_id`s.

  • 1-2 sentences explaining what the function does (generally), and what (if any) options the function provides you with.

Question 2

You may have encountered a situation where the my_book_id was not in our dataset, and hence, didn’t get plotted. When writing functions, it is usually best to try and foresee issues like this and have the function fail gracefully, instead of showing some ugly (and sometimes unclear) warning. Add some code at the beginning of our function that checks to see if my_book_id is within our dataset, and if it does not exist, prints "Book ID not found.", and exits the function. Test it out on book_id=123 and book_id=19063.

Run ?stop to see if that is a function that may be useful.

Items to submit
  • R code with your new and improved function.

  • The results from fun_plot(123).

  • The results from fun_plot(19063).

Question 3

We have this nice get_author_name function that accepts a dataset (in this case, our authors dataset), and a book_id and returns the name of the author. Write a new function called get_author_id that accepts an authors name and returns the author_id of the author.

You can test your function using some of these examples:

get_author_id(authors, "Brandon Sanderson") # 38550
get_author_id(authors, "J.K. Rowling") # 1077326
Items to submit
  • R code containing your new function.

  • The results of using your new function on a few authors.

Question 4

See the function below.

search_books_for_word <- function(word) {
    return(books[grepl(word, books$description, fixed=T),]$title)
}

Given a word, search_books_for_word returns the titles of books where the provided word is inside the book’s description. search_books_for_word utilizes the books dataset internally. It requires that the books dataset has been loaded into the environment prior to running (and with the correct name). By including and referencing objects defined outside of our function’s scope within our function (in this case the variable books), our search_books_for_word function will be more prone to errors, as any changes to those objects may break our function. For example:

our_function <- function(x) {
  print(paste("Our argument is:", x))
  print(paste("Our variable is:", my_variable))
}
# our variable outside the scope of our_function
my_variable <- "dog"
# run our_function
our_function("first")
# change the variable outside the scope of our function
my_variable <- "cat"
# run our_function again
our_function("second")
# imagine a scenario where "my_variable" doesn't exist, our_function would break!
rm(my_variable)
our_function("third")

Fix our search_books_for_word function to accept the books dataset as an argument called my_books_dataset and utilize my_books_dataset within the function instead of the global variable books.

Items to submit
  • R code with your new and improved function.

  • An example using the updated function.

Question 5

Write your own custom function. Make sure your function includes at least 2 arguments. If you access one of our datasets from within your function (which you definitely should do), use what you learned in (4), to avoid future errors dealing with scoping. Your function could output a cool plot, interesting tidbits of information, or anything else you can think of. Get creative and make a function that is fun to use!

Items to submit
  • R code used to solve the problem.

  • Examples using your function with included output.