STAT 19000: Project 13 — Fall 2020

Motivation: It’s important to be able to lookup and understand the documentation of a new function. You may have looked up the documentation of functions like paste0 or sapply, and noticed that in the "usage" section, one of the arguments is an ellipsis (…​). Well, unless you understand what this does, it’s hard to really get it. In this project, we will experiment with ellipsis, and write our own function that utilizes one.

Context: We’ve learned about, used, and written functions in many projects this semester. In this project, we will utilize some of the less-known features of functions.

Scope: r, functions

Learning objectives
  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • Utilize apply functions in order to solve a data-driven problem.

  • Gain proficiency using split, merge, and subset.

  • Demostrate the ability to create basic graphs with default settings.

  • Demonstratre the ability to modify axes labels and titles.

  • Incorporate legends using legend().

  • Demonstrate the ability to customize a plot (color, shape/linetype).

  • Convert strings to dates, and format dates using the lubridate package.


The following questions will use the dataset found in Scholar:



Question 1

Read /class/datamine/data/beer/beers.csv into a data.frame named beers. Read /class/datamine/data/beer/breweries.csv into a data.frame named breweries. Read /class/datamine/data/beer/reviews.csv into a data.frame named reviews.

Notice that reviews.csv is a large file. Luckily, you can use a function from the famous data.table package called fread. The function fread is much faster at reading large file compared to read.csv. It reads the data into a class called data.table. We will learn more about this later on. For now, use fread to read in the reviews.csv data then convert it from the data.table class into a data.frame by wrapping the result of fread in the data.frame function.

Do not forget to load the data.table library before attempeting to use the fread function.

Below we show you an example of how fast the fread function is compared to`read.csv`.

microbenchmark(read.csv("/class/datamine/data/beer/reviews.csv", nrows=100000), data.frame(fread("/class/datamine/data/beer/reviews.csv", nrows=100000)), times=5)
Unit: milliseconds
read.csv("/class/datamine/data/beer/reviews.csv", nrows = 1e+05)
data.frame(fread("/class/datamine/data/beer/reviews.csv", nrows = 1e+05))
       min        lq      mean    median        uq       max neval
 5948.6289 6482.3395 6746.8976 7040.5881 7086.6728 7176.2589     5
  120.7705  122.3812  127.9842  128.7794  133.7695  134.2205     5

This video demonstrates how to read the reviews data using fread.

Items to submit
  • R code used to solve the problem.

Question 2

Take some time to explore the datasets. Like many datasets, our data is broken into 3 "tables". What columns connect each table? How many breweries in breweries don’t have an associated beer in beers? How many beers in beers don’t have an associated brewery in breweries?

We compare lists of names using sum or intersect. Similar techniques can be used for Question 2.

Items to submit
  • R code used to solve the problem.

  • A description of columns which connect each of the files.

  • How many breweries don’t have an associated beer in beers.

  • How many beers don’t have an associated brewery in breweries.

Question 3

Run ?sapply and look at the usage section for sapply. If you look at the description for the …​ argument, you’ll see it is "optional arguments to FUN`". What this means is you can specify additional input for the function you are passing to `sapply. One example would be passing T to na.rm in the mean function: sapply(dat, mean, na.rm=T). Use sapply and the strsplit function to separate the types of breweries (types) by commas. Use another sapply to loop through your results and count the number of types for each brewery. Be sure to name your final results n_types. What is the average amount of services (n_types) breweries in IN and MI offer (we are looking for the average of IN and MI combined)? Does that surprise you?

When you have one sapply inside of another, or one loop inside of another, or an if/else statement inside of another, this is commonly referred to as nesting. So when Googling, you can type "nested sapply" or "nested if statements", etc.

We show, in this video, how to find the average number of parts in a midwesterner’s name. Perhaps surprisingly, this same technique will be useful in solving Question 3.

Items to submit
  • R code used to solve the question.

  • 1-2 sentences answering the average amount of services breweries in Indiana and Michigan offer, and commenting on this answer.

Question 4

Write a function called compare_beers that accepts a function that you will call FUN, and any number of vectors of beer ids. The function, compare_beers, should cycle through each vector/groups of beer_id`s, compute the function, `FUN, on the subset of reviews, and print "Group X: some_score" where X is the number 1+, and some_score is the result of applying FUN on the subset of the reviews data.

In the example below the function FUN is the median function and we have two vectors/groups of beer_id`s passed with c(271781) being group 1 and c(125646, 82352) group 2. Note that even though our example only passes two vectors to our `compare_beers function, we want to write the function in a way that we could pass as many vectors as we want to.


compare_beers(reviews, median, c(271781), c(125646, 82352))

This example gives the output:

Group 1: 4
Group 2: 4.56

For your solution to this question, find the behavior of compare_beers in this example:

compare_beers(reviews, median, c(88,92,7971), c(74986,1904), c(34,102,104,355))

There are different approaches to this question. You can use for loops or sapply. It will probably help to start small and build slowly toward the solution.

This first video shows how to use …​ in defining a function.

This second video basically walks students through how to build this function. If you use this video to learn how to build this function, please be sure to acknowledge this in your project solutions.

Items to submit
  • R code used to solve the problem.

  • The result from running the provided example.

Question 5

Beer wars! IN and MI against AZ and CO. Use the function you wrote in question (4) to compare beer_id from each group of states. Make a cool plot of some sort. Be sure to comment on your plot.

Create a vector of beer_ids per group before passing it to your function from (4).

This video demonstrates an example of how to use the compare_beers function.

Items to submit
  • R code used to solve the problem.

  • The result from running your function.

  • The resulting plot.

  • 1-2 sentence commenting on your plot.