Apply Functions

piping

Much like the | operator in bash, the %>% operator in R pipes the output from the first expression to the second. For example instead of:

sum(c(1,2,3))
[1] 6

It is extremely common practice in the tidyverse to pipe output from one function to another. For example:

subset <- iris %>%
    subset(Sepal.Length > 5) %>%
    mutate(Sepal.Length.Sq = Sepal.Length^2)

head(subset)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length.Sq
 1          5.1         3.5          1.4         0.2  setosa           26.01
 2          5.4         3.9          1.7         0.4  setosa           29.16
 3          5.4         3.7          1.5         0.2  setosa           29.16
 4          5.8         4.0          1.2         0.2  setosa           33.64
 5          5.7         4.4          1.5         0.4  setosa           32.49
 6          5.4         3.9          1.3         0.4  setosa           29.16

select

select is a handy function used to select columns from a data.frame or tibble. For example:

iris %>% select(Sepal.Length, Species) %>% head()
   Sepal.Length Species
 1          5.1  setosa
 2          4.9  setosa
 3          4.7  setosa
 4          4.6  setosa
 5          5.0  setosa
 6          5.4  setosa

That alone is not that impressive, as we could easily do something like:

iris[, c("Sepal.Length", "Species")] %>% head()
    Sepal.Length Species
 1          5.1  setosa
 2          4.9  setosa
 3          4.7  setosa
 4          4.6  setosa
 5          5.0  setosa
 6          5.4  setosa

However, in the same way you can write 1:4 to represent a vector of numbers from 1-4, you can select columns from Sepal.Length to Petal.Length (and everything in between) by using Sepal.Length:Petal.Length.

iris %>% select(Sepal.Length:Petal.Length) %>% head()
   Sepal.Length Sepal.Width Petal.Length
 1          5.1         3.5          1.4
 2          4.9         3.0          1.4
 3          4.7         3.2          1.3
 4          4.6         3.1          1.5
 5          5.0         3.6          1.4
 6          5.4         3.9          1.7

select is particularly useful when paired with selection helpers, as you can select certain columns based on their names:

iris %>% select(contains(
    "length"
)) %>%
    head()
   Sepal.Length Petal.Length
 1          5.1          1.4
 2          4.9          1.4
 3          4.7          1.3
 4          4.6          1.5
 5          5.0          1.4
 6          5.4          1.7
# or case sensitive

iris %>% select(contains(
    "Length",
    ignore.case=F
)) %>%
    head()
   Sepal.Length Petal.Length
 1          5.1          1.4
 2          4.9          1.4
 3          4.7          1.3
 4          4.6          1.5
 5          5.0          1.4
 6          5.4          1.7

selection helpers

Selection helpers are functions that make selecting variables easier. They are particularly easy to use with select.

everything matches all variables. For example:

iris %>% select(everything()) %>% head()
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
 1          5.1         3.5          1.4         0.2  setosa
 2          4.9         3.0          1.4         0.2  setosa
 3          4.7         3.2          1.3         0.2  setosa
 4          4.6         3.1          1.5         0.2  setosa
 5          5.0         3.6          1.4         0.2  setosa
 6          5.4         3.9          1.7         0.4  setosa

It is primarily useful when used in combination with functions like pivot_longer and pivot_wider.

last_col selects the last variable, possibly with an offset.

iris %>% select(last_col()) %>% head()
   Species
 1  setosa
 2  setosa
 3  setosa
 4  setosa
 5  setosa
 6  setosa

Or, with an offset:

iris %>% select(1:last_col(2)) %>% head()
   Sepal.Length Sepal.Width Petal.Length
 1          5.1         3.5          1.4
 2          4.9         3.0          1.4
 3          4.7         3.2          1.3
 4          4.6         3.1          1.5
 5          5.0         3.6          1.4
 6          5.4         3.9          1.7

contains selects columns where the columns name contains another string. For example:

iris %>% select(contains("sepal")) %>% head()
   Sepal.Length Sepal.Width
 1          5.1         3.5
 2          4.9         3.0
 3          4.7         3.2
 4          4.6         3.1
 5          5.0         3.6
 6          5.4         3.9

contains is case insensitive by default.

In the same way that contains looks for a string within the column names of a data.frame, starts_with and ends_with select columns where column names either start with one or more values or end with one or more values (respectively). For example, to get the columns starting with "Sepal":

iris %>% select(starts_with("sepal")) %>% head()
   Sepal.Length Sepal.Width
 1          5.1         3.5
 2          4.9         3.0
 3          4.7         3.2
 4          4.6         3.1
 5          5.0         3.6
 6          5.4         3.9

Or to get columns that end in "width":

iris %>% select(ends_with("width")) %>% head()
   Sepal.Width Petal.Width
 1         3.5         0.2
 2         3.0         0.2
 3         3.2         0.2
 4         3.1         0.2
 5         3.6         0.2
 6         3.9         0.4

For more fine grain control, matches behaves the same way, but instead of literal string matching, we can feed a regular expression to matches. For example, we could get all columns containing one or more ".":

iris %>% select(matches("+\\.")) %>% head()
   Sepal.Length Sepal.Width Petal.Length Petal.Width
 1          5.1         3.5          1.4         0.2
 2          4.9         3.0          1.4         0.2
 3          4.7         3.2          1.3         0.2
 4          4.6         3.1          1.5         0.2
 5          5.0         3.6          1.4         0.2
 6          5.4         3.9          1.7         0.4

Sometimes, you’ll have datasets with columns labeled sequentially, for example:

head(billboard)
 # A tibble: 6 x 79
   artist track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
   <chr>  <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 2 Pac  Baby… 2000-02-26      87    82    72    77    87    94    99    NA
 2 2Ge+h… The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
 3 3 Doo… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
 4 3 Doo… Loser 2000-10-21      76    76    72    69    67    65    55    59
 5 504 B… Wobb… 2000-04-15      57    34    25    17    17    31    36    49
 6 98^0   Give… 2000-08-19      51    39    34    26    26    19     2     2
 # … with 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
 #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
 #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
 #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
 #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
 #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
 #   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>,
 #   wk49 <dbl>, wk50 <dbl>, wk51 <dbl>, wk52 <dbl>, wk53 <dbl>, wk54 <dbl>,
 #   wk55 <dbl>, wk56 <dbl>, wk57 <dbl>, wk58 <dbl>, wk59 <dbl>, wk60 <dbl>,
 #   wk61 <dbl>, wk62 <dbl>, wk63 <dbl>, wk64 <dbl>, wk65 <dbl>, wk66 <lgl>,
 #   wk67 <lgl>, wk68 <lgl>, wk69 <lgl>, wk70 <lgl>, wk71 <lgl>, wk72 <lgl>,
 #   wk73 <lgl>, wk74 <lgl>, wk75 <lgl>, wk76 <lgl>

Here, we have columns labeled wk1 all the way until wk76. Using num_range and select we can get any number of those specific columns:

billboard %>% select(num_range("wk", 70:75)) %>% head()
 # A tibble: 6 x 6
   wk70  wk71  wk72  wk73  wk74  wk75
   <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
 1 NA    NA    NA    NA    NA    NA
 2 NA    NA    NA    NA    NA    NA
 3 NA    NA    NA    NA    NA    NA
 4 NA    NA    NA    NA    NA    NA
 5 NA    NA    NA    NA    NA    NA
 6 NA    NA    NA    NA    NA    NA

all_of is a selection helper designed to select strictly the columns whose names are inside the provided vector.

my_values <- c("Sepal.Length", "Sepal.Width")
iris %>% select(all_of(my_values)) %>% head()
   Sepal.Length Sepal.Width
 1          5.1         3.5
 2          4.9         3.0
 3          4.7         3.2
 4          4.6         3.1
 5          5.0         3.6
 6          5.4         3.9

But, whenever a single value in your vector isn’t present, an error is thrown.

my_values <- c("Sepal.Length", "Sepal.Width", "Sepal.Weight")
iris %>% select(all_of(my_values)) %>% head()
Error: Cannot subset columns that do not exist.
✖ Column `Sepal.Weight` does not exist.

For times you would like to select the values if they exist, any_of is more useful. It is similar to all_of, but doesn’t check if a value is missing.

my_values <- c("Sepal.Length", "Sepal.Width", "Sepal.Weight")
iris %>% select(any_of(my_values)) %>% head()
   Sepal.Length Sepal.Width
 1          5.1         3.5
 2          4.9         3.0
 3          4.7         3.2
 4          4.6         3.1
 5          5.0         3.6
 6          5.4         3.9

transmute

transmute is a useful function that adds new variables and drops all existing ones. If a variable already exists, it overwrites the variable. For example, let’s say we wanted to capitalize the values of Species in the iris dataset:

iris %>%
    transmute(Species = toupper(Species)) %>%
    head()
   Species
 1  SETOSA
 2  SETOSA
 3  SETOSA
 4  SETOSA
 5  SETOSA
 6  SETOSA

Here, the values in the Species column are overwritten with the fully capitalized version. All of the other columns are dropped. One way to maintain other columns, would be to include them in the transmute call:

iris %>%
    transmute(Species = toupper(Species), Sepal.Length, Sepal.Width) %>%
    head()
   Species Sepal.Length Sepal.Width
 1  SETOSA          5.1         3.5
 2  SETOSA          4.9         3.0
 3  SETOSA          4.7         3.2
 4  SETOSA          4.6         3.1
 5  SETOSA          5.0         3.6
 6  SETOSA          5.4         3.9

Alternatively, you could use mutate, which has the same behavior, but preserves existing variables.

mutate

mutate is just like transmute, but the original data is preserved. For example:

iris %>%
    mutate(Species = toupper(Species)) %>%
    head()
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
 1          5.1         3.5          1.4         0.2  SETOSA
 2          4.9         3.0          1.4         0.2  SETOSA
 3          4.7         3.2          1.3         0.2  SETOSA
 4          4.6         3.1          1.5         0.2  SETOSA
 5          5.0         3.6          1.4         0.2  SETOSA
 6          5.4         3.9          1.7         0.4  SETOSA

Here, since Species already exists as a column, the column is overwritten by our new capitalized values. If the name of the new column does not already exist, the original Species column will remain untouched. For example:

iris %>%
    mutate(Species_Cap = toupper(Species)) %>%
    head()
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species_Cap
 1          5.1         3.5          1.4         0.2  setosa      SETOSA
 2          4.9         3.0          1.4         0.2  setosa      SETOSA
 3          4.7         3.2          1.3         0.2  setosa      SETOSA
 4          4.6         3.1          1.5         0.2  setosa      SETOSA
 5          5.0         3.6          1.4         0.2  setosa      SETOSA
 6          5.4         3.9          1.7         0.4  setosa      SETOSA

mutate is extremely useful, and is difficult (and less intuitive) to replicate in pandas in Python.

case_when

case_when is a function that allows you to vectorize multiple if_else statements. For example, let’s say we want to create a new column in our iris dataset called size, where the value is Large if Sepal.Length is greater than 5, and Not Large otherwise?

new_iris <- iris %>%
    mutate(size = case_when(
        Sepal.Length > 5 ~ "Large",
        Sepal.Length <= 5 ~ "Not Large"
    ))

head(new_iris)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species      size
 1          5.1         3.5          1.4         0.2  setosa     Large
 2          4.9         3.0          1.4         0.2  setosa Not Large
 3          4.7         3.2          1.3         0.2  setosa Not Large
 4          4.6         3.1          1.5         0.2  setosa Not Large
 5          5.0         3.6          1.4         0.2  setosa Not Large
 6          5.4         3.9          1.7         0.4  setosa     Large

Here, mutate is responsible for creating a new column called size, and case_when assigns the value Large when Sepal.Length is greater than 5 and Not Large when Sepal.Length is less than or equal to Not Large. In this case we have exhaustively gone through all of the possible values of our new column, size, because for each and every possible value of Sepal.Length we have an associated value (Large and Not Large). In reality, this is not always possible. For example, let’s remove the second case:

new_iris <- iris %>%
    mutate(size = case_when(
        Sepal.Length > 5 ~ "Large"
    ))

head(new_iris)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  size
 1          5.1         3.5          1.4         0.2  setosa Large
 2          4.9         3.0          1.4         0.2  setosa  <NA>
 3          4.7         3.2          1.3         0.2  setosa  <NA>
 4          4.6         3.1          1.5         0.2  setosa  <NA>
 5          5.0         3.6          1.4         0.2  setosa  <NA>
 6          5.4         3.9          1.7         0.4  setosa Large

As you can see, by default, if no cases match, NA is the resulting value. One common technique to handle "all other cases" is the following:

new_iris <- iris %>%
    mutate(size = case_when(
        Sepal.Length > 5 ~ "Large",
        TRUE ~ "Not Large"
    ))

head(new_iris)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species      size
 1          5.1         3.5          1.4         0.2  setosa     Large
 2          4.9         3.0          1.4         0.2  setosa Not Large
 3          4.7         3.2          1.3         0.2  setosa Not Large
 4          4.6         3.1          1.5         0.2  setosa Not Large
 5          5.0         3.6          1.4         0.2  setosa Not Large
 6          5.4         3.9          1.7         0.4  setosa     Large

Here, each case is evaluated. If at the end, there was no match, TRUE is always a match, and therefore the result will be Not Large.

between

between is a dead simple function from dplyr that is an efficiently implemented shortcut for the following:

x <- 5

print(x >= 4 && x <= 10)
 [1] TRUE
# instead you can use between
between(x, 4, 10)
 [1] TRUE

glimpse

filter

arrange

group_by

group_by is a function commonly used in conjunction with mutate, transmute, and summarize. It is useful when you want to perform a tapply-like operation on a data.frame. For example, let’s say you wanted to get the average Petal.Length by Species. Using tapply, you would do something like:

tapply(iris$Petal.Length, iris$Species, mean)
     setosa versicolor  virginica
      1.462      4.260      5.552

While useful, tapply 's end result isn’t in a format that is conducive to further analysis or wrangling. For example, what if we wanted to calculate and then plot (in ggplot) the difference between the mean Petal.Length and the mean Sepal.Length by Species? Using tapply, you would have to do something like:

diff <- tapply(iris$Petal.Length, iris$Species, mean) - tapply(iris$Sepal.Length, iris$Species, mean)
myDF <- data.frame(Species = names(diff), diff = unname(diff))
ggplot(myDF, aes(x=diff, y=Species)) + geom_bar(stat="identity")
tidyverse groupby example 1

Again, a little bit more difficult to read than the following, and if you had more operations to complete, the previous example would make it difficult to do even more. In the following example, however, we can continue to utilize and build on myDF:

myDF <- iris %>%
    group_by(Species) %>%
    mutate(diff=mean(Petal.Length) - mean(Sepal.Length))

myDF %>% ggplot(aes(x=diff, y=Species)) + geom_bar(stat="identity")
tidyverse groupby example 2

summarize

summarize is a useful function to get a new, tidy, data frame that is a summary of some other data. It’s particularly useful in conjunction with group_by, when you want to compare groups.

For example, let’s say you wanted to the following:

  • Create a new column called Sepal.Length.Cat with values small when Sepal.Length < 5.1, large when Sepal.Length >= 5.8, and medium otherwise.

  • Get a summary containing the average Sepal.Width by Sepal.Length.Cat and Species.

  • Get a summary containing the variation in averages for each Species.

iris %>%
    mutate(Sepal.Length.Cat = case_when(
        Sepal.Length < 5.1 ~ "small",
        Sepal.Length >= 5.8 ~ "large",
        TRUE ~ "medium"
    )) %>%
    group_by(Sepal.Length.Cat, Species) %>%
    summarize(avg_sepal_width_grouped = mean(Sepal.Width)) %>%
    group_by(Species) %>%
    summarize(std_of_avgs = sd(avg_sepal_width_grouped))
 `summarise()` regrouping output by 'Sepal.Length.Cat' (override with `.groups` argument)
 `summarise()` ungrouping output (override with `.groups` argument)
 # A tibble: 3 x 2
   Species    std_of_avgs
   <fct>            <dbl>
 1 setosa           0.402
 2 versicolor       0.329
 3 virginica        0.255

As you can see, it has some pretty powerful functionality that would be more difficult to replicate (and harder to read) using base R.

str_extract and str_extract_all

str_extract and str_extract_all are useful functions from the stringr package. You can install the package by running:

install.packages("stringr")

str_extract extracts the text which matches the provided regular expression or pattern. Note that this differs from grep in a major way. grep simply returns the index in which a pattern match was found. str_extract returns the actual matching text. Note that grep typically returns the entire line where a match was found. str_extract returns only the part of the line or text that matches the pattern. For example:

text <- c("cat", "mat", "spat", "spatula", "gnat")

# All 5 "lines" of text were a match.
grep(".*at", text)
 [1] 1 2 3 4 5
text <- c("cat", "mat", "spat", "spatula", "gnat")
stringr::str_extract(text, ".*at")
 [1] "cat"  "mat"  "spat" "spat" "gnat"

As you can see, although all 5 words match our pattern and would be returned by grep, str_extract only returns the actual text that matches the pattern. In this case "spatula" is not a "full" match — the pattern .at only captures the "spat" part of "spatula". In order to capture the rest of the word you would need to add something like .* to the end of the pattern:

text <- c("cat", "mat", "spat", "spatula", "gnat")
stringr::str_extract(text, ".*at.*")
 [1] "cat"     "mat"     "spat"    "spatula" "gnat"

Examples

How can I extract the text between parenthesis in a vector of texts?

text <- c("this is easy for (you)", "there (are) challenging ones", "text is (really awesome) (ok?)")

# Search for a literal "(", followed by any amount of any text other than more parenthesis ([^()]*), followed by a literal ")".
stringr::str_extract(text, "\\([^()]*\\)")
 [1] "(you)"            "(are)"            "(really awesome)"

To get all matches, not just the first match:

text <- c("this is easy for (you)", "there (are) challenging ones", "text is (really awesome) more text (ok?)")

# Search for a literal "(", followed by any amount of any text (.*), followed by a literal ")".
stringr::str_extract_all(text, "\\([^()]*\\)")
 [[1]]
 [1] "(you)"

 [[2]]
 [1] "(are)"

 [[3]]
 [1] "(really awesome)" "(ok?)"

lubridate

lubridate is a fantastic package that makes the typical tasks one would perform on dates, that much easier.

Examples

How do I convert a string "07/05/1990" to a Date?

library(lubridate)
 Attaching package: 'lubridate'
 The following objects are masked from 'package:data.table':

     hour, isoweek, mday, minute, month, quarter, second, wday, week,
     yday, year
 The following objects are masked from 'package:base':

     date, intersect, setdiff, union
dat <- "07/05/1990"
dat <- mdy(dat)
class(dat)
 [1] "Date"

How do I convert a string "31-12-1990" to a Date?

my_string <- "31-12-1990"
dat <- dmy(my_string)
dat
 [1] "1990-12-31"
class(dat)
 [1] "Date"

How do I convert a string "31121990" to a Date?

my_string <- "31121990"
my_date <- dmy(my_string)
my_date
 [1] "1990-12-31"
class(my_date)
 [1] "Date"

How do I extract the day, week, month, quarter, and year from a Date?

my_date <- dmy("31121990")
day(my_date)
 [1] 31
week(my_date)
 [1] 53
month(my_date)
 [1] 12
quarter(my_date)
 [1] 4
year(my_date)
 [1] 1990

strep

strrep is a function that allows you to repeat the characters a given number of times.

Examples

How can I repeat a string of the characters ABC three times?

strrep("ABC", 3)
 [1] "ABCABCABC"

How can I get a vector in which A is repeated twice, B three times, and C four times?

strrep(c("A", "B", "C"), c(2,3,4))
 [1] "AA"   "BBB"  "CCCC"

nchar

nchar is a function which counts the number of characters and symbols in a word or a string. Punctuation and blank spaces are counted as well.

Examples

How can I find the number of characters and or symbols in the word "Protozoa"?

nchar("Protozoa")
 [1] 8

How can I find the number of characters and or symbols for the following strings all at once: "pneumonoultramicroscopicsilicovolcanoconiosis", "password: DatamineRocks#stat1900@"?

string_vector <- c("pneumonoultramicroscopicsilicovolcanoconiosis", "password: DatamineRocks#stat1900@")
nchar(string_vector)
 [1] 45 33

Fun Fact: pneumonoultramicroscopicsilicovolcanoconiosis is the longest word in the English dictionary.

Resources

A comprehensive cheatsheet for lubridate is linked below. It’s an excellent resource to begin learning and working with lubridate quickly.