STAT 19000: Project 4 — Fall 2020

Motivation: Control flow is (roughly) the order in which instructions are executed. We can execute certain tasks or code if certain requirements are met using if/else statements. In addition, we can perform operations many times in a loop using for loops. While these are important concepts to grasp, R differs from other programming languages in that operations are usually vectorized and there is little to no need to write loops.

Context: We are gaining familiarity working in RStudio and writing R code. In this project we introduce and practice using control flow in R.

Scope: r, data.frames, recycling, factors, if/else, for

Learning objectives
  • Explain what "recycling" is in R and predict behavior of provided statements.

  • Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.

  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • List the differences between lists, vectors, factors, and data.frames, and when to use each.

  • Demonstrate a working knowledge of control flow in r: if/else statements, while loops, etc.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/disney

Questions

Question 1

Use read.csv to read in the /class/datamine/data/disney/splash_mountain.csv data into a data.frame called splash_mountain. In the previous project we calculated the mean and standard deviation of the SPOSTMIN (posted minimum wait time). These are vectorized operations (we will learn more about this next project). Instead of using the mean function, use a loop to calculate the mean (average), just like the previous project. Do not use sum either.

Remember, if a value is NA, we don’t want to include it.

Remember, if a value is -999, it means the ride is closed, we don’t want to include it.

This exercise should make you appreciate the variety of useful functions R has to offer!

Items to submit
  • R code used to solve the problem w/comments explaining what the code does.

  • The mean posted wait time.

Question 2

Choose one of the .csv files containing data for a ride. Use read.csv to load the file into a data.frame named ride_name where "ride_name" is the name of the ride you chose. Use a for loop to loop through the ride file and add a new column called status. status should contain a string whose value is either "open", or "closed". If SPOSTMIN or SACTMIN is -999, classify the row as "closed". Otherwise, classify the row as "open". After status is added to your data.frame, convert the column to a factor.

If you want to access two columns at once from a data.frame, you can do: splash_mountain[i, c("SPOSTMIN", "SACTMIN")].

For loops are often [much slower (here is a video to demonstrate)](#r-for-loops-versus-vectorized-functions) than vectorized functions, as we will see in (3) below.

Items to submit
  • R code used to solve the problem w/comments explaining what the code does.

  • The output from running str on ride_name.

In this video, we basically go all the way through Question 2 using a video:

Question 3

Typically you want to avoid using for loops (or even apply functions (we will learn more about these later on, don’t worry)) when they aren’t needed. Instead you can use vectorized operations and indexing. Repeat (2) without using any for loops or apply functions (instead use indexing and the which function). Which method was faster?

To have multiple conditions within the which statement, use | for logical OR and & for logical AND.

You can start by assigning every value in status as "open", and then change the correct values to "closed".

Here is a [complete example (very much like question 3) with another video](#r-example-safe-versus-contaminated) that shows how we can classify objects.

Here is a [complete example with a video](#r-example-for-loops-compared-to-vectorized-functions) that makes a comparison between the concept of a for loop versus the concept for a vectorized function.

Items to submit
  • R code used to solve the problem w/comments explaining what the code does.

  • The output from running str on ride_name.

Question 4

Create a pie chart for open vs. closed for splash_mountain.csv. First, use the table command to get a count of each status. Use the resulting table as input to the pie function. Make sure to give your pie chart a title that somehow indicates the ride to the audience.

Items to submit
  • R code used to solve the problem w/comments explaining what the code does.

  • The resulting plot displayed as output in the RMarkdown.

Question 5

Loop through the vector of files we’ve provided below, and create a pie chart of open vs closed for each ride. Place all 6 resulting pie charts on the same image. Make sure to give each pie chart a title that somehow indicates the ride.

ride_names <- c("splash_mountain", "soarin", "pirates_of_caribbean", "expedition_everest", "flight_of_passage", "rock_n_rollercoaster")
ride_files <- paste0("/class/datamine/data/disney/", ride_names, ".csv")

To place all of the resulting pie charts in the same image, prior to running the for loop, run par(mfrow=c(2,3)).

This is not exactly the same, but it is a similar example, using the campaign election data:

mypiechart <- function(x) {
  myDF <- read.csv( paste0("/class/datamine/data/election/itcont", x, ".txt"), sep="|")
  mystate <- rep("other", times=nrow(myDF))
  mystate[myDF$STATE == "CA"] <- "California"
  mystate[myDF$STATE == "TX"] <- "Texas"
  mystate[myDF$STATE == "NY"] <- "New York"
  myDF$stateclassification <- factor(mystate)
  pie(table(myDF$stateclassification))
}
myyears <- c("1980","1984","1988","1992","1996","2000")
par(mfrow=c(2,3))
for (i in myyears) {
  mypiechart(i)
}

Here is another video, which guides students even more closely through Question 5.

Items to submit
  • R code used to solve the problem w/comments explaining what the code does.

  • The resulting plot displayed as output in the RMarkdown.