Variables

Overview

A variable is a named reference to a value. Variables have types that define what type of data they hold.

In R, there 5 basic variable types: character, integer, double, logical, and complex. Additionally, you might run into the structures raw, list, closure, special, builtin, environment, and S4.


Basic Variables

Character

In R, a string is technically just a vector of characters. Characters start and end with double quotes ("), or single quotes ('). Strings aren’t as heavily utilized in R, as its functionality focuses more on statistical analysis.

typeof('my_character_vector')
typeof("my_character_vector")
'character'
'character'


Integer

An integer is a numeric value without a fractional part (no decimal). In many dynamic languages, when you create a variable to hold a number without a decimal point, it will be an integer. For example, when using Python, the line variable = 5 will result in variable having type integer, while the line variable = 5.0 will result in variable having type float (Python’s version of a decimal number).

In R, however, this is not the case. In R, 5 and 5.0 will both be stored as doubles unless explicitly set as integers using as.integer().

my_variable <- 5
typeof(my_variable)

my_variable <- 5.0
typeof(my_variable)
'double'
'double'
my_variable <- as.integer(5)
typeof(my_variable)
'integer'

A key exception to this rule is when creating a vector of integers.

test <- 1:5
typeof(test)
test <- 1.2:4
typeof(test)
'integer'
'double'


Double

A double is a numeric value with a fractional part. As shown above, R will cast all numeric inputs to doubles, and you can verify this using typeof. To try and convert a variable to a double, you can use the as.double() function — notably, as.double() will convert a character vector to a double vector if possible.

char_to_double <- as.double("4.0")
typeof(char_to_double)
'double'


Logical

Logical constants refer to the boolean values of TRUE, FALSE, and the bonus NA.

You might see T and F used in place of TRUE and FALSE, which is not incorrect, but the difference is that T and F are variables given by R that can be changed, while TRUE and FALSE are constants and cannot be changed. If you save some other value to T/F, they will no longer correspond to TRUE/FALSE.

Logical values are critical in R. They are constantly used when indexing vectors and data.frames. Remember that the result of a comparison using logical operators is a vector of logical values. For example, the following code will return all values in our vector that are greater than 5.

vec <- 1:10
vec[vec > 5]

The inner statement vec > 5 will only give you a logical vector, indicating TRUE or FALSE depending on the statement’s truth at each index:

vec <- 1:10
vec > 5
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE

So, vec[vec > 5] is really equivalent to the following.

vec[c(FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,TRUE,TRUE)]

What if we wanted to count the number of even values in our vector? One may consider the following code:

vec <- 1:10
length(vec[vec %% 2 == 0])
5

However, a more succint version of this code is as follows.

vec <- 1:10
vec[vec %% 2 == 0]
[1]  2  4  6  8 10

As you can see, because R coerces the logical values to be 1 for TRUE and 0 for FALSE, we can easily discern the even values in our vector.


Other Structures

Complex

While it is less likely that you will need to use complex numbers in R, they have first-class support. Complex numbers are a pair of real numbers, the real and imaginary parts of which are separated by a comma. For example, the following code shows a few ways to create a complex number.

0i ^ (-3:3)
1i^2

R (and other programming languages) consider any variables that start with a number to be a number, unless surrounded by quotes. As such, inputting 1i tells the program that i is an imaginary number.

Be careful, however, as code that you may expect to produce a complex number might not do that:

sqrt(-1)
Warning message:
In sqrt(-1) : NaNs produced

You can read more about complex numbers in R here.


Coercion

Coercion is the process of changing the type of a variable, either explicitly by using a special function or implicitly by performing an operation on a variable of one type, when the operation was meant for another type. The following is an example of coercion:

typeof(paste("test", 5.0))
'character'

Here, 5.0 is a double, and "test" is a character vector. paste is a function expecting character vector(s) as input, and returns the concatenation of the input vectors. We instead passed a character vector and a double, so R intelligently coerced the double to be a character so the operation will completed successfully.

In general, R will coerce types from more to less specific. In the above example, the coercion of 5.0 makes sense — it’s easy to consider 5.0 as the string "5.0", while it’s hard to turn "test" into a double. Another example is the following:

my_integer <- as.integer(5)
my_double <- 7.0
my_result <- my_integer + my_double
typeof(my_result)
'double'

This logic is important for preventing the loss of data — the number 5.12345 cannot be stored as an integer without losing information.


Factors

A factor is R’s way of representing a categorical variable. There are elements in a factor (just like there are elements in a vector), but they are constrained to only be chosen from a specific set of values, called "levels". They are useful when a vector has only a few different values — "Male"/"Female" or "A"/"B"/"C".

There is the factor() function that is used to cast variables as factors, the is.factor() function to test if a variable is a factor, and the levels() function to list all of the factors for a variable.


Examples

How do I test whether or not a vector is a factor?

Click to see solution
test_factor <- factor("Male")
is.factor(test_factor)
[1] TRUE

List the levels we have in vec.

Click to see solution
vec <- factor(c("Male", "Female", "Female"))
levels(vec)
[1] "Female" "Male"

How can I rename the levels of a factor?

Click to see solution
vec <- factor(c("Male", "Female", "Female"))
levels(vec)
[1] "Female" "Male"
levels(vec) <- c("F", "M")
vec
[1] M F F
Levels: F M
# be careful! Order matters, this is wrong:
vec <- factor(c("Male", "Female", "Female"))
levels(vec)
[1] "Female" "Male"
# here we incorrectly rename "Female"'s to "M" instead of "F"
levels(vec) <- c("M", "F")
vec
[1] F M M
Levels: M F

How can I find the number of levels of a factor?

Click to see solution
vec <- factor(c("Male", "Female", "Female"))
nlevels(vec)
[1] 2


Dates

Date is a class which allows you to perform special operations like subtraction, where the number of days between dates are returned. Or addition, where you can add 30 to a Date and a Date is returned where the value is 30 days in the future.

You will usually need to specify the "format" argument based on the format of your date strings.

For example, if you had a string "07/05/1990", the format would be: %m/%d/%Y, where %m matches a zero-padded month value, /’s match literal `/’s, `%d matches a zero-padded day value, and %Y matches a 4 digit year in the format YYYY. If your string was 31-12-90, the format string would be %d-%m-%y. Replace %d, %m, %Y, and %y according to your date strings. A full list of formats can be found here.

Working with dates can be difficult and confusing. See here for more information about a package called lubridate which provides a much easier interface to working with dates.

Examples

How do I convert a string "07/05/1990" to a Date?

Click to see solution
my_string <- "07/05/1990"
my_date <- as.Date(my_string, format="%m/%d/%Y")
my_date
[1] "1990-07-05"

How do I convert a string "31-12-1990" to a Date?

Click to see solution
my_string <- "31-12-1990"
my_date <- as.Date(my_string, format="%d-%m-%Y")
my_date
[1] "1990-12-31"

How do I convert a string "12-31-1990" to a Date?

Click to see solution
my_string <- "12-31-1990"
my_date <- as.Date(my_string, format="%m-%d-%Y")
my_date
[1] "1990-12-31"

How do I convert a string "31121990" to a Date?

Click to see solution
my_string <- "31121990"
my_date <- as.Date(my_string, format="%d%m%Y")
my_date
[1] "1990-12-31"


NA, NaN, and NULL

NA

NA stands for not available. In general, this represents a missing value or a lack of data. Technically, NA is a logical value. You can test this with the following code.

class(NA)
NaN

NaN stands for not a number. This is a special value that is used to indicate that there is a result, it just cannot be represented as a number (for example the result of 0/0). Technically, NaN is a double value. You can test this with the following code.

class(NaN)
NULL

If you have an understanding of NULL from other programming languages, you can carry it over to R. Otherwise, it is safe to think of NULL as something that is neither TRUE nor FALSE. Technically, NULL is its own thing. It is not a logical value, double value, etc. NULL is commonly used to represent an empty object or something that exists but isn’t really defined. When trying to distinguish between NA and NULL, think of NA as a missing value, and NULL as an undefined value.


Examples

How do I tell if a value is NA?

Click to see solution
# test if a value is NA.
value <- NA
is.na(value)
[1] TRUE
# does is.nan return TRUE for NA?
is.nan(value)
[1] FALSE

How do I tell if a value is NaN?

Click to see solution
# test if a value is NaN.
value <- NaN
is.nan(value)
[1] TRUE
value <- 0/0
is.nan(value)
[1] TRUE
# does is.na return TRUE for NaN?
is.na(value)
[1] TRUE

How do I tell if a value is NULL?

Click to see solution
# test if a value is NULL.
value <- NULL
is.null(value)
[1] TRUE
class(value)
[1] "NULL"
# does is.na return TRUE for NULL?
is.na(value)
logical(0)  # no


Resources

A good writeup on the differences between NA and NULL.