data.frames

Basics

Data.frames are one of the most frequently used data structures in R. Data.frames organize data into a 2D table consisting of rows & columns, where each column represents a variable and each row contains one value for each column.


Bracket Subsetting/Indexing

Creating a data.frame is easily done by filling in the columns using vectors, which are declared using c() as follows.

Data Frame Creation

df <- data.frame(cat_1=c(1,2,3), cat_2=c(9,8,7),
                 ok=c(T, T, F), other=c("first", "second", "third"))

head(df)
   cat_1 cat_2    ok  other
 1     1     9  TRUE  first
 2     2     8  TRUE second
 3     3     7 FALSE  third

The parameter names in the data.frame() function become the columns of the data.frame, while the number of rows are determined by the size of the vectors.

The different columns of a data frame can contain different types of values, but the variables within the column must have the same type. In this case, cat_1 and cat_2 contain integers, ok contains booleans, and other contains Strings.


Indexing Rows Numerically

Regular indexing rules apply to R data frames. Pay close attention to the commas in the following examples:

df[1:2, ]
   cat_1 cat_2   ok  other
 1     1     9 TRUE  first
 2     2     8 TRUE second

This method uses the indices of the rows, which are independent of the row names. We can update the names of the rows and subsequently index those as well, if row names are appropriate for the situation.

Row Naming & Indexing on Row Names

row.names(df) <- c("row1", "row2", "row3")
df[c("row1", "row3"), ]
      cat_1  cat_2    ok  other
 row1     1      9  TRUE  first
 row3     3      7 FALSE  third

Though the row names replace the numerical indices in the output, we can still index using either. This same logic applies to columns, which also have intrinsic indices and are required to be named in order to be created.

So far we’ve indexed in two ways, and their differences merit explanation:

  1. : selects indices based on the given sequence. In R, this process is inclusive, meaning that 1:4 will select the first, second, third, and fourth entries.

  2. c() defines a vector, as explained in the Lists & Vectors page, and indexing on vectors will select all rows/columns shared between the vector and data frame.


Logical Indexing

Indexing can also be done logically using a vector of Boolean values:

# selection is True for the first line,
# False for the second, and True for the third

df[c(T,F,T),]
   cat_1 cat_2    ok other
 1     1     9  TRUE first
 3     3     7 FALSE third


For all of the above examples, there was at least one comma — anything before the comma defines row selection, and anything after the comma defines column selection. If you leave out the comma, R will default to column selection.

Column-Default Indexing

df[c("cat_1", "ok")]
      cat_1    ok
 row1     1  TRUE
 row2     2  TRUE
 row3     3 FALSE

This is equivalent to leaving a blank space before the comma:

Indexing Column-Specific

df[, c(1,3)]
      cat_1    ok
 row1     1  TRUE
 row2     2  TRUE
 row3     3 FALSE


We can apply sequence-indexing and logical indexing to columns in the same way. You’ll find that indexing rows and indexing columns is a nearly identical process that is easy to get hold of. We can combine any of the previous methods to index rows and columns simultaneously.

Putting It All Together

df[1:2, c(1,3)]
      cat_1   ok
 row1     1 TRUE
 row2     2 TRUE
df[c(T,F,T), c(T, F, F, F)]
 [1] 1 3


$ Subsetting/Indexing

A key feature of R is the $ operator on data.frames, which is the more common indexing method for R if only one column is needed.

$ Column Indexing

df$cat_1
 [1] 1 2 3

You can extend this to index for row as well using df$column_name[].

It’s good to keep in mind that $ lists column and then row, while just df[ , ] indexing requires row, then column.

Selecting Values from a Column

df$cat_1[c(F,T,F)]
[1] 2


Examples

How can I get the first 2 rows of a data.frame named df?

df <- data.frame(cat_1=c(1,2,3), cat_2=c(9,8,7),
                 ok=c(T, T, F), other=c("first", "second", "third"))
df[1:2,]
   cat_1 cat_2   ok  other
 1     1     9 TRUE  first
 2     2     8 TRUE second


How can I get the first 2 columns of a data.frame named df?

df[,1:2]
   cat_1 cat_2
 1     1     9
 2     2     8
 3     3     7


How can I get the rows where values in the column named cat_1 are greater than 2?

# first example, using $
df[df$cat_1 > 2,]
   cat_1 cat_2    ok other
 3     3     7 FALSE third
# second example, using []
df[df[, c("cat_1")] > 2,]
   cat_1 cat_2    ok other
 3     3     7 FALSE third


How can I get the rows where values in the column named cat_1 are greater than 2 and the values in the column named cat_2 are less than 9?

df[df$cat_1 > 2 & df$cat_2 < 9,]
   cat_1 cat_2    ok other
 3     3     7 FALSE third


How can I get the rows where values in the column named cat_1 are greater than 2 or the values in the column named cat`_2 are less than 9?

df[df$cat_1 > 2 | df$cat_2 < 9,]
   cat_1 cat_2    ok  other
 2     2     8  TRUE second
 3     3     7 FALSE  third


How do I sample n rows randomly from a data.frame called df?

df[sample(nrow(df), n),]

You could also use the sample_n function from the package dplyr:

sample_n(df, n)


How can I get only columns whose names start with "cat_"?

df[, grep("^cat_", names(df))]
   cat_1 cat_2
 1     1     9
 2     2     8
 3     3     7