Control Flow

If/else statements

if, else, if else, are specific ways to control if an operation will be preformed. These commands allow for specific actions to be triggered based on the prior result.

Examples

How do I print "Success!" if my expression is TRUE, and "Failure!" otherwise?

Click to see solution
# Randomly assign either TRUE or FALSE to t_or_f.
t_or_f <- sample(c(TRUE,FALSE),1)

if (t_or_f == TRUE) {
  # If t_or_f is TRUE, print success
  print("Success!")
} else {
  # Otherwise, print failure
  print("Failure!")
}
[1] "Failure!"
# t_or_f is already TRUE or FALSE.
# By comparing this to TRUE, with the == TRUE
# comparator, we create a simple guide where
# TRUE == TRUE evaluates to TRUE and
# FALSE == TRUE evaluates to FALSE.
# This simply lets us evaluate TRUE as TRUE
# and FALSE to FALSE.
# Because of this, we do not need to include
# another else statement for if it is neither
# since in every case it will be one or the other.

if (t_or_f) {
  # If t_or_f is TRUE, print success
  print("Success!")
} else {
  # Otherwise, print failure
  print("Failure!")
}
[1] "Failure!"

How do I print "Success!" if my expression is TRUE, "Failure!" if my expression is FALSE, and "Huh?" if it is neither?

Click to see solution
# Randomly assign either TRUE or FALSE to t_or_f.
t_or_f <- sample(c(TRUE,FALSE, "Something else"),1)

if (t_or_f == TRUE) {
  # If t_or_f is TRUE, print success
  print("Success!")
} else if (t_or_f == FALSE) {
  # If t_or_f is FALSE, print failure
  print("Failure!")
} else {
  # Otherwise print huh
  print("Huh?")
}
[1] "Failure!"
# t_or_f is either TRUE, FALSE or Something else.
# By comparing this to TRUE, with the == TRUE
# comparator, we create a simple guide where
# TRUE == TRUE evaluates to TRUE and
# FALSE == TRUE evaluates to FALSE.
# However, in this case, if it is Something else,
# it could cause an error because Something else
# is neither TRUE nor FALSE.
# This is why we include another else statement
# to ensure we account for when t_or_f is not
# TRUE nor FALSE.

if (t_or_f == TRUE) {
  # If t_or_f is TRUE, print success
  print("Success!")
} else if (t_or_f == FALSE) {
  # If t_or_f is FALSE, print failure
  print("Failure!")
} else {
  # Otherwise print huh
  print("Huh?")
}
[1] "Failure!"

For loops

for loops allow us to execute similar code over and over again until we’ve looped through all of the elements. They are useful for performing the same operation to an entire vector of input, for example. Essentially, for loops create a simple way to run the same code for a large number of iterations. For example, if we wanted to format the dates in a list, we could use a for loop to run through each element of the list to format it.

In R, there’s another suite of functions known as apply functions. These functions used to be much faster and more powerful than loops, but essentially accomplished the same goal. However, this is not always true, since a poorly created apply suite function can often be much slower and more inefficient than a well created loop.

Examples

How do I go through (loop) every value in a vector and print the value?

Click to see solution
for (i in 1:10) {
  # In the first iteration of the loop,
  # i will be 1. The next, i will be 2.
  # Etc.
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

How do I break out of a loop before it finishes?

Click to see solution
for (i in 1:10) {
  if (i==7) {
    # When i==7, we will exit the loop.
    break
  }
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6

How do I loop through a vector of names?

Click to see solution
friends <- c("Phoebe", "Ross", "Rachel", "Chandler", "Joey", "Monica")
my_string <- "So no one told you life was gonna be this way, "
for (friend in friends) {
  print(paste0(my_string, friend, "!"))
}
[1] "So no one told you life was gonna be this way, Phoebe!"
[1] "So no one told you life was gonna be this way, Ross!"
[1] "So no one told you life was gonna be this way, Rachel!"
[1] "So no one told you life was gonna be this way, Chandler!"
[1] "So no one told you life was gonna be this way, Joey!"
[1] "So no one told you life was gonna be this way, Monica!"

For more information on paste0, check out the paste and paste0 page.

How do I skip a loop if some expression evaluates to TRUE?

Click to see solution
friends <- c("Phoebe", "Ross", "Mike", "Rachel", "Chandler", "Joey", "Monica")
my_string <- "So no one told you life was gonna be this way, "
for (friend in friends) {
  if (friend == "Mike") {
    # next, skips over the rest of the code for this loop
    # and continues to the next element
    next
  }
  print(paste0(my_string, friend, "!"))
}
[1] "So no one told you life was gonna be this way, Phoebe!"
[1] "So no one told you life was gonna be this way, Ross!"
[1] "So no one told you life was gonna be this way, Rachel!"
[1] "So no one told you life was gonna be this way, Chandler!"
[1] "So no one told you life was gonna be this way, Joey!"
[1] "So no one told you life was gonna be this way, Monica!"

Are there examples in which for loops are not appropriate to use?

Click to see solution

This is usually how we write loops in other languages, e.g., C, C++, Java, Python, etc., if we want to add the first 10 billion integers.

mytotal <- 0
for (i in 1:10000000000) {
  mytotal <- mytotal + i
}
mytotal
[1] 5e+19

but this takes a long time to evaluate. It is easier to write, and much faster to evaluate, if we use the sum function, which is vectorized, i.e., which works on an entire vector of data all at once.

Here, for instance, we add the first 10 billion integers, and the computation occurs almost immediately. The sum function here very simply takes every integer in the parentheses and adds them all together.

sum(1:10000000000)
[1] 5e+19

Can you show an example of how to do the same thing, with a for loop and without a for loop?

Click to see solution

Yes, here is an example about how to compute the average cost of a line of the grocery store data.

myDF <- read.csv("/class/datamine/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv")
head(myDF)
  BASKET_NUM HSHD_NUM PURCHASE_ PRODUCT_NUM SPEND UNITS STORE_R WEEK_NUM YEAR
1         24     1809 03-JAN-16     5817389 -1.50    -1   SOUTH        1 2016
2         24     1809 03-JAN-16     5829886 -1.50    -1   SOUTH        1 2016
3         34     1253 03-JAN-16      539501  2.19     1    EAST        1 2016
4         60     1595 03-JAN-16     5260099  0.99     1    WEST        1 2016
5         60     1595 03-JAN-16     4535660  2.50     2    WEST        1 2016
6        168     3393 03-JAN-16     5602916  4.50     1   SOUTH        1 2016

This is how we find the average cost per line in other languages, for instance, C/C++, Python, Java, etc. The for loop being used here calculates the length of myDF$SPEND, and runs just enough times to reach the end.

amountspent <- 0       # we initialize a variable to keep track of the entire price of the purchases
numberofitems <- 0     # and we initialize a variable to keep track of the number of purchases
for (myprice in myDF$SPEND) {
  amountspent <- amountspent + myprice     # we add the price of the current purchase
  numberofitems <- numberofitems + 1       # and we increment (by 1) the number o purchases processed so far
}
amountspent     # this is the total amount spent on all purchases
[1] 3584366
numberofitems   # this is the total number of purchases
[1] 1e+06
amountspent/numberofitems       # so this is the average
[1] 3.584366
amountspent/length(myDF$SPEND)  # this is an equivalent way to compute the average
[1] 3.584366

Now, that technically works, but it’s not efficient! Let’s try using the mean function instead to get an average:

mean(myDF$SPEND)
[1] 3.584366

As we can see, mean is a much more efficient way to use a vectorized function in R, to accomplish the same purpose. The vector is the column myDF$SPEND (where myDF is a dataframe and the $ allows us to specify the SPEND column in this dataframe). We can just focus our attention on that column from the data frame, and take a mean.

Can you show an example of how to make a new column in a data frame, which classifies things, based on another column?

Click to see solution

Yes, we can make a new column in the grocery store data set.

myDF <- read.csv("/class/datamine/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv")
head(myDF)
  BASKET_NUM HSHD_NUM PURCHASE_ PRODUCT_NUM SPEND UNITS STORE_R WEEK_NUM YEAR
1         24     1809 03-JAN-16     5817389 -1.50    -1   SOUTH        1 2016
2         24     1809 03-JAN-16     5829886 -1.50    -1   SOUTH        1 2016
3         34     1253 03-JAN-16      539501  2.19     1    EAST        1 2016
4         60     1595 03-JAN-16     5260099  0.99     1    WEST        1 2016
5         60     1595 03-JAN-16     4535660  2.50     2    WEST        1 2016
6        168     3393 03-JAN-16     5602916  4.50     1   SOUTH        1 2016

Let’s first make a new vector (the same length as a column of the data frame) in which all of the entries are safe.

mystatus <- rep("safe", times=nrow(myDF))

The rep function here is just helping create a new vector, with the same length as the data frame column, determined by times. Then we can change the entries for the elements of mystatus that occurred on 05-JUL-16 or on 06-JUL-16 to be contaminated.

mystatus[(myDF$PURCHASE_ == "05-JUL-16")|(myDF$PURCHASE_ == "06-JUL-16")] <- "contaminated"

and finally change this into a factor (a categorical data type limited to pre-set values), and add it as a new column in the data frame.

myDF$safetystatus <- factor(mystatus)

Now the head of the data frame looks like this:

head(myDF)
  BASKET_NUM HSHD_NUM PURCHASE_ PRODUCT_NUM SPEND UNITS STORE_R WEEK_NUM YEAR
1         24     1809 03-JAN-16     5817389 -1.50    -1   SOUTH        1 2016
2         24     1809 03-JAN-16     5829886 -1.50    -1   SOUTH        1 2016
3         34     1253 03-JAN-16      539501  2.19     1    EAST        1 2016
4         60     1595 03-JAN-16     5260099  0.99     1    WEST        1 2016
5         60     1595 03-JAN-16     4535660  2.50     2    WEST        1 2016
6        168     3393 03-JAN-16     5602916  4.50     1   SOUTH        1 2016
  safetystatus
1         safe
2         safe
3         safe
4         safe
5         safe
6         safe

and the number of contaminated rows versus safe rows is this:

table(myDF$safetystatus)
contaminated         safe
        2459       997541