STAT 19000: Project 8 — Fall 2020

Motivation: A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code!

Context: We’ve been learning about and using functions all year! Now we are going to learn more about some of the terminology and components of a function, as you will certainly need to be able to write your own functions soon.

Scope: r, functions

Learning objectives
  • Gain proficiency using split, merge, and subset.

  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • Demonstrate how to use tapply to solve data-driven problems.

  • Comprehend what a function is, and the components of a function in R.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/goodreads/csv

Questions

Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

Question 1

Read in the same data, in the same way as the previous project (with the same names). We’ve provided you with the function below. How many arguments does the function have? Name all of the arguments. What is the name of the function? Replace the description column in our books data.frame with the same information, but with stripped punctuation using the function provided.

# A function that, given a string (myColumn), returns the string
# without any punctuation.
strip_punctuation <- function(myColumn) {
    # Use regular expressions to identify punctuation.
    # Replace identified punctuation with an empty string ''.
    desc_no_punc <- gsub('[[:punct:]]+', '', myColumn)

    # Return the result
    return(desc_no_punc)
}

Since gsub accepts a vector of values, you can pass an entire vector to strip_punctuation.

Items to submit
  • R code used to solve the problem.

  • How many arguments does the function have?

  • What are the name(s) of all of the arguments?

  • What is the name of the function?

Question 2

Use the strsplit function to split a string by spaces. Some examples would be:

strsplit("This will split by space.", " ")
strsplit("This. Will. Split. By. A. Period.", "\\.")

An example string is:

test_string <- "This is  a test string  with no punctuation"

Test out strsplit using the provided test_string. Make sure to copy and paste the code that declares test_string. If you counted the words shown in your results, would it be an accurate count? Why or why not?

Relevant topics: [strsplit](#r-strsplit), [functions](#r-writing-functions)

Items to submit
  • R code used to solve the problem.

  • 1-2 sentences explaining why or why not your count would be accurate.

Question 3

Fix the issue in (2), using which. You may need to unlist the strsplit result first. After you’ve accomplished this, you can count the remaining words!

Items to submit
  • R code used to solve the problem (including counting the words).

Question 4

We are finally to the point where we have code from questions (2) and (3) that we think we may want to use many times. Write a function called count_words which, given a string, description, returns the number of words in description. Test out count_words on the description from the second row of books. How many words are in the description?

Items to submit
  • R code used to solve the problem.

  • The result of using the function on the description from the second row of books.

Question 5

Practice makes perfect! Write a function of your own design that is intended on being used with one of our datasets. Test it out and share the results.

You could even pass (as an argument) one of our datasets to your function and calculate a cool statistic or something like that! Maybe your function makes a plot? Who knows?

Items to submit
  • R code used to solve the problem.

  • An example (with output) of using your newly created function.