# Apply Functions

## Abstract

In R, we might encounter situations where we need to iterate over several (or all) of the elements in a data frame. With languages like Python or Java, for-loops and while-loops are indispensable when dealing with iteration, but in R, we have the apply functions, which offer a faster, more concise way of applying a function to a data set.

 There’s a great online encyclopedia on R that inspired and informed many examples for this page. Check them out!

### `apply`

The basic apply function takes an array/matrix, a `MARGIN` parameter for the desired dimension, and the function you want to apply.

Our test matrix will be 5x5, where the numbers 1-25 will populate each column sequentially:

``````cubed5 <- matrix(c(1:5, 6:10, 11:15, 16:20, 21:25), nrow = 5, ncol = 5)
cubed5``````
```     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22
[3,]    3    8   13   18   23
[4,]    4    9   14   19   24
[5,]    5   10   15   20   25```
• insert examples for `apply` here

Ultimately, that’s all there is to `apply`. One important note is that `apply` will not work for vectors — we discuss vector `apply` functions in a moment.

``````squares <- c(1, 4, 9, 16, 25)
apply(squares, 1, function(x) x ^ 0.5)``````
````Error in apply(squares, 1, function(x) x^0.5) :
dim(X) must have a positive length````

### `lapply & sapply`

`lapply` applies a function to each element of a list, then returns a list that’s been altered by the function. Since there is only one dimension in a list, the `MARGIN` parameter does not apply. Let’s use `sum` on the `squares` vector from before:

``lapply(squares, sum)``
```[[1]]
[1] 1

[[2]]
[1] 4

[[3]]
[1] 9

[[4]]
[1] 16

[[5]]
[1] 25```

This is obviously not what we wanted — the problem here is that vectors and lists are not the same thing and `lapply` treats every element in the vector like its own list. Casting `squares` to a list will fix the issue:

``lapply(list(squares), sum)``
```[[1]]
[1] 55```

`sapply` will function identically to `lapply` unless the output can be simplified, in which case `sapply` executes that simplification. The following occurs when we run `sapply` in place of `lapply` on our `squares` vector.

``sapply(squares, sum)``
`[1]  1  4  9 16 25`

Unless you explicitly need a list, `sapply` can often be the more advantageous function because of its output simplification.

#### How do I get the mean values of the list `pamphlets` with attributes `pages`, `words`, and `letters`?

``lapply(my_list, mean)``
```\$pages
[1] 3

\$words
[1] 30

\$letters
[1] 300```

### `tapply`

The documentation definition for `tapply` is a bit more specific than the others, where the arguments are now `(X, INDEX, FUN)`, with `X` being an object where the `split` function applies, `INDEX` is a factor by which `X` is grouped, and `FUN` is function as before.

To simplify this definition, we can say `tapply` applies `FUN` to `X` when `X` is grouped by `INDEX`. Consider the following: we have a `grades` data.frame that contains information for grade, year, and sex for several students. We can use `tapply` to get the average grade by year in a simple way.

``grades``
```   grade      year  sex
1    100    junior    M
2     99 sophomore    F
3     75 sophomore    M
4     74 sophomore    M
5     44    senior    F
6     69    junior    M
7     88    junior    F
8     99    senior <NA>
9     90  freshman    M
10    92    junior    F```

The solution begins below.

``tapply(grades\$grade, grades\$year, mean)``
``` freshman    junior    senior sophomore
90.00000  87.25000  71.50000  82.66667```

We can use the optional arguments here to remove any rows that contain missing data.

``tapply(grades\$grade, grades\$year, mean, na.rm=T)``
```##  freshman    junior    senior sophomore
##  90.00000  87.25000  44.00000  82.66667```

### Examples

#### How can I find the average of several variables in the `flight` dataset using 1 line of `lapply` code?

We can store the data for 2003 flights as follows:

``myDF <- read.csv("/depot/datamine/data/flights/subset/2003.csv")``

We can categorize the flight distances in groups of <100 miles, 100-200 miles, 200-500 miles, 500-1000 miles, 1000-2000 miles, and 2000+ miles using the `cut` function, then tabulating it

``my_distance_categories <- cut(myDF\$Distance, breaks = c(0,100,200,500,1000,2000,Inf), include.lowest=T)``

We can get the averages of all applicable flights for 4 variables, broken down by the distance categories we just defined.

``````tapply(myDF\$DepDelay, my_distance_categories, mean, na.rm=T)  # the DepDelay in each category
tapply(myDF\$ArrDelay, my_distance_categories, mean, na.rm=T)  # the ArrDelay in each category
tapply(myDF\$TaxiOut, my_distance_categories, mean, na.rm=T)  # the time to TaxiOut in each category
tapply(myDF\$TaxiIn, my_distance_categories, mean, na.rm=T)  # the time to TaxiIn in each category``````

However, we can condense this to one line using `lapply` according to the prompt. To make it easier to read, we can make a temporary data frame `flights_by_distance` with these 4 variables. Then we split the data into 6 data.frames using the distance categories, yielding averages for `DepDelay`, `ArrDelay`, `TaxiOut`, and `TaxiIn`. This will agree exactly with the results of the 4 separate `tapply` functions, but it only takes us 1 call to `lapply`!

``````flights_by_distance <- split( data.frame(myDF\$DepDelay, myDF\$ArrDelay, myDF\$TaxiOut, myDF\$TaxiIn), my_distance_categories )
lapply( flights_by_distance, colMeans, na.rm=T )``````

#### How can I find the average of variables `DRUNK_DR`, `FATALS`, and `PERSONS` in the `fars` dataset using 1 line of `lapply` code?

This is a question that was asked in previous STAT19000 classes when the `apply` functions are introduced. We’ll start by reading in the dataset and adding state names.

 There are more efficient ways to add the names, but this code mirrors the solution to the previous implementation of this question, which we’ll follow from here on out.
``````dat <- read.csv("/depot/datamine/data/fars/7581.csv")
v <- state_names\$state
names(v) <- state_names\$code
dat\$mystates <- v[as.character(dat\$STATE)]``````

If we wanted to get the averages for the 3 variables in question, we can use `tapply` independently:

``````tapply(dat\$DRUNK_DR, dat\$mystates, mean)
tapply(dat\$FATALS, dat\$mystates, mean)
tapply(dat\$PERSONS, dat\$mystates, mean)``````

However, there is an easier way that also fits the requirements of the prompt. We’ll create the data.frame `accidents_by_state` with only these 3 variables for readability:

``````accidents_by_state <- split( data.frame(dat\$DRUNK_DR, dat\$FATALS, dat\$PERSONS), dat\$mystates )
lapply( accidents_by_state, colMeans )``````

The `split` function creates 51 different data.frames based on the values in `mystates`, where `lapply` then uses `colMeans` as its function to get the averages for our 3 variables. Awesome!

#### Use the provided code to create a new column `transformed` in the data.frame `example_df`. `transformed` should contain `TRUE` if the value in column `pre_transformed` is "t", `FALSE` if it is "f", and `NA` otherwise.

``````string_to_bool <- function(value) {
if (value == "t") {
return(TRUE)
} else if (value == "f") {
return(FALSE)
} else {
return(NA)
}
}

example_df <- data.frame(pre_transformed=c("f", "f", "t", "f", "something", "t", "else", ""), other=c(1,2,3,4,5,6,7,8))``````

The solution begins below.

``````example_df\$transformed <- sapply(example_df\$pre_transformed, string_to_bool)
example_df``````
```  pre_transformed other transformed
1               f     1       FALSE
2               f     2       FALSE
3               t     3        TRUE
4               f     4       FALSE
5       something     5          NA
6               t     6        TRUE
7            else     7          NA
8                     8          NA```

#### Here we have not a question, but a demonstration. We use `tapply` in various ways on the Amazon Fine Food Reviews dataset.

The goal of our demonstration is to show the most consistently helpful users in this dataset. This is calculated using the `HelpfulnessNumerator` and `HelpfulnessDenominator` fields in the dataset. As an example, we find the user that wrote the most reviews.

``````myDF <- read.csv("/depot/datamine/data/amazon/amazon_fine_food_reviews.csv")
tail(sort(table(myDF\$UserId)))``````

The user in question is A3OXHLG6DIBRW8, which will be further referred to as A3O. The code below provides two summations: the `HelpfulnessDenominator` sum is the total number of people who read A3O’s reviews, while the `HelpfulnessNumerator` is the number of people who found their reviews helpful. We can call the `sum` functions on both, then taking the quotient to get A3O’s Helpfulness proportion.

``sum(myDF\$HelpfulnessNumerator[myDF\$UserId == "A3OXHLG6DIBRW8"])/sum(myDF\$HelpfulnessDenominator[myDF\$UserId == "A3OXHLG6DIBRW8"])``

Instead of grabbing each user individually, we can use `tapply` to calculate these proportions for all users.

``tapply(myDF\$HelpfulnessNumerator, myDF\$UserId, sum)/tapply(myDF\$HelpfulnessDenominator, myDF\$UserId, sum)``