STAT 19000: Project 4 — Fall 2020

Motivation: Control flow is (roughly) the order in which instructions are executed. We can execute certain tasks or code if certain requirements are met using if/else statements. In addition, we can perform operations many times in a loop using for loops. While these are important concepts to grasp, R differs from other programming languages in that operations are usually vectorized and there is little to no need to write loops.

Context: We are gaining familiarity working in RStudio and writing R code. In this project we introduce and practice using control flow in R.

Scope: r, data.frames, recycling, factors, if/else, for

Learning objectives
• Explain what "recycling" is in R and predict behavior of provided statements.

• Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.

• Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

• Read and write basic (csv) data.

• Explain and demonstrate: positional, named, and logical indexing.

• List the differences between lists, vectors, factors, and data.frames, and when to use each.

• Demonstrate a working knowledge of control flow in r: if/else statements, while loops, etc.

Dataset

The following questions will use the dataset found in Scholar:

`/class/datamine/data/disney`

Questions

Question 1

Use `read.csv` to read in the `/class/datamine/data/disney/splash_mountain.csv` data into a `data.frame` called `splash_mountain`. In the previous project we calculated the mean and standard deviation of the `SPOSTMIN` (posted minimum wait time). These are vectorized operations (we will learn more about this next project). Instead of using the `mean` function, use a loop to calculate the mean (average), just like the previous project. Do not use `sum` either.

 Remember, if a value is NA, we don’t want to include it.
 Remember, if a value is -999, it means the ride is closed, we don’t want to include it.
 This exercise should make you appreciate the variety of useful functions R has to offer!
Items to submit
• R code used to solve the problem w/comments explaining what the code does.

• The mean posted wait time.

Question 2

Choose one of the `.csv` files containing data for a ride. Use `read.csv` to load the file into a data.frame named `ride_name` where "ride_name" is the name of the ride you chose. Use a for loop to loop through the ride file and add a new column called `status`. `status` should contain a string whose value is either "open", or "closed". If `SPOSTMIN` or `SACTMIN` is -999, classify the row as "closed". Otherwise, classify the row as "open". After `status` is added to your data.frame, convert the column to a `factor`.

 If you want to access two columns at once from a data.frame, you can do: `splash_mountain[i, c("SPOSTMIN", "SACTMIN")]`.
 For loops are often [much slower (here is a video to demonstrate)](#r-for-loops-versus-vectorized-functions) than vectorized functions, as we will see in (3) below.
Items to submit
• R code used to solve the problem w/comments explaining what the code does.

• The output from running `str` on `ride_name`.

In this video, we basically go all the way through Question 2 using a video:

Question 3

Typically you want to avoid using for loops (or even apply functions (we will learn more about these later on, don’t worry)) when they aren’t needed. Instead you can use vectorized operations and indexing. Repeat (2) without using any for loops or apply functions (instead use indexing and the `which` function). Which method was faster?

 To have multiple conditions within the `which` statement, use `|` for logical OR and `&` for logical AND.
 You can start by assigning every value in `status` as "open", and then change the correct values to "closed".
 Here is a [complete example (very much like question 3) with another video](#r-example-safe-versus-contaminated) that shows how we can classify objects.
 Here is a [complete example with a video](#r-example-for-loops-compared-to-vectorized-functions) that makes a comparison between the concept of a for loop versus the concept for a vectorized function.
Items to submit
• R code used to solve the problem w/comments explaining what the code does.

• The output from running `str` on `ride_name`.

Question 4

Create a pie chart for open vs. closed for `splash_mountain.csv`. First, use the `table` command to get a count of each `status`. Use the resulting table as input to the `pie` function. Make sure to give your pie chart a title that somehow indicates the ride to the audience.

Items to submit
• R code used to solve the problem w/comments explaining what the code does.

• The resulting plot displayed as output in the RMarkdown.

Question 5

Loop through the vector of files we’ve provided below, and create a pie chart of open vs closed for each ride. Place all 6 resulting pie charts on the same image. Make sure to give each pie chart a title that somehow indicates the ride.

``````ride_names <- c("splash_mountain", "soarin", "pirates_of_caribbean", "expedition_everest", "flight_of_passage", "rock_n_rollercoaster")
ride_files <- paste0("/class/datamine/data/disney/", ride_names, ".csv")``````
 To place all of the resulting pie charts in the same image, prior to running the for loop, run `par(mfrow=c(2,3))`.

This is not exactly the same, but it is a similar example, using the campaign election data:

``````mypiechart <- function(x) {
myDF <- read.csv( paste0("/class/datamine/data/election/itcont", x, ".txt"), sep="|")
mystate <- rep("other", times=nrow(myDF))
mystate[myDF\$STATE == "CA"] <- "California"
mystate[myDF\$STATE == "TX"] <- "Texas"
mystate[myDF\$STATE == "NY"] <- "New York"
myDF\$stateclassification <- factor(mystate)
pie(table(myDF\$stateclassification))
}
myyears <- c("1980","1984","1988","1992","1996","2000")
par(mfrow=c(2,3))
for (i in myyears) {
mypiechart(i)
}``````

Here is another video, which guides students even more closely through Question 5.

Items to submit
• R code used to solve the problem w/comments explaining what the code does.

• The resulting plot displayed as output in the RMarkdown.