grep and Other RegEx Functions

grep and grepl

grep stands for " globally search for a regular expression and print all matches," just as in UNIX. The function allows you to use regular expressions to search for a pattern in a vector of strings or characters, and returns the index (indices) of the match(es).

Additionally, the function grepl (derived from grep-logical) uses the same inputs, but returns a logical vector, where TRUE indicates a match at that index, and FALSE indicates the opposite.

Examples

Given a character vector, return the index of any words ending in 's'.

Solution
grep(".*s$", c("waffle", "waffles", "pancake", "pancakes"))
[1] 2 4

Given a character vector, return TRUE of any words ending in 's', FALSE otherwise.

Solution
grepl(".*s$", c("cats", "bats", "geese", "meese"))
# fun fact: meese is not the plural of moose
[1]  TRUE  TRUE FALSE FALSE

grep and grepl can be used on individual strings, though they match the entire string, not the index of the character that matches the regular expression. For grep, a hit would return [1] 1 and a miss would return integer(0). For grepl, you still get TRUE or FALSE.


sub and gsub

Oftentimes finding the indices of matches in your text isn’t what you want — your goal is to change the text into a format that’s better for parsing. For this, we have sub and gsub. These functions take a regular expression and a replacement expression, applying the replacement to a string or a vector of strings.

The key difference here is that sub applies only to the first match, while gsub applies to all matches (derived from global-substitution).

Examples

Given a string, replace the first 'l' with '?'

Solution
sub("l", "?", "The best part of waking up is Folgers in your cup")
# not sponsored or affiliated
[1] "The best part of waking up is Fo?gers in your cup"

Given a string, replace all vowels with '!'

Solution
gsub("[aeiou]", "!", "The best part of waking up is Folgers in your cup")
# globally not sponsored or affiliated
[1] "Th! b!st p!rt !f w!k!ng !p !s F!lg!rs !n y!!r c!p"

For these functions, it’s equally valid to apply vector-wise or individually. Applying on vectors will repeat the substitution process for each individual string, so naturally a single string would work.


Resources

Regular expressions are hard, even for some veteran programmers, as the rules and match characters are subtly different for each programming language. This is a great resource for RegEx basics — everything on the second page is useful — and string manipulations which encompass more than that of sub and gsub.