Data.frames

Data.frames are one of the primary data structure used very frequently when working in R. Data.frames are tables of same-sized, named columns, where each column has a single type.

You can create a data.frame easily:

df <- data.frame(cat_1=c(1,2,3), cat_2=c(9,8,7), ok=c(T, T, F), other=c("first", "second", "third"))
head(df)
  cat_1 cat_2    ok  other
1     1     9  TRUE  first
2     2     8  TRUE second
3     3     7 FALSE  third

Regular indexing rules apply as well. This is how you index rows. Pay close attention to the trailing comma:

= data.frames


== Abstract

Data.frames are one of the most frequently used data structures in R. Data.frames organize data into a 2D table consisting of rows & columns, where each column represents a variable and each row contains one value for each column.


=== Bracket Subsetting/Indexing

Creating a data.frame is easily done by filling in the columns using vectors, which are declared using c() as follows.

==== Data Frame Creation

df <- data.frame(cat_1=c(1,2,3), cat_2=c(9,8,7),
                 ok=c(T, T, F), other=c("first", "second", "third"))

head(df)
   cat_1 cat_2    ok  other
 1     1     9  TRUE  first
 2     2     8  TRUE second
 3     3     7 FALSE  third

The parameter names in the data.frame() function become the columns of the data.frame, while the number of rows are determined by the size of the vectors.

The different columns of a data frame can contain different types of values, but the variables within the column must have the same type. In this case, cat_1 and cat_2 contain integers, ok contains booleans, and other contains Strings.


Regular indexing rules apply to R data frames. Pay close attention to the commas in the following examples:

==== Indexing Rows Numerically

# Numeric indexing on rows:
df[1:2,]
  cat_1 cat_2   ok  other
1     1     9 TRUE  first
2     2     8 TRUE second
df[c(1,3),]
  cat_1 cat_2    ok other
1     1     9  TRUE first
3     3     7 FALSE third
# Logical indexing on rows:
df[c(T,F,T),]
  cat_1 cat_2    ok other
1     1     9  TRUE first
3     3     7 FALSE third
# Named indexing on rows only works
# if there are named rows:
row.names(df) <- c("row1", "row2", "row3")
df[c("row1", "row3"),]
     cat_1 cat_2    ok other
row1     1     9  TRUE first
row3     3     7 FALSE third

By default, if you don’t include the comma in the square brackets, you are indexing the column:

df[c("cat_1", "ok")]
     cat_1    ok
row1     1  TRUE
row2     2  TRUE
row3     3 FALSE

To index columns, place expressions after the first comma:

# Numeric indexing on columns:
df[, 1]
[1] 1 2 3
df[, c(1,3)]
     cat_1    ok
row1     1  TRUE
row2     2  TRUE
row3     3 FALSE
# Logical indexing on columns:
df[, c(T, F, F, F)]
[1] 1 2 3
# Named indexing on columns.
# This is the more typical method of
# column indexing:
df$cat_1
[1] 1 2 3
# Another way to do named indexing on columns:
df[,c("cat_1", "ok")]
     cat_1    ok
row1     1  TRUE
row2     2  TRUE
row3     3 FALSE

Of course, you can index on columns and rows:

# Numeric indexing on columns and rows:
df[1:2, 1]
[1] 1 2
df[1:2, c(1,3)]
     cat_1   ok
row1     1 TRUE
row2     2 TRUE
# Logical indexing on columns and rows:
df[c(T,F,T), c(T, F, F, F)]
[1] 1 3
# Named indexing on columns and rows.
# This is the more typical method of
# column indexing:
df$cat_1[c(T,F,T)]
[1] 1 3
# Another way to do named indexing on columns and rows:
row.names(df) <- c("row1", "row2", "row3")
df[c("row1", "row3"),c("cat_1", "ok")]
     cat_1    ok
row1     1  TRUE
row3     3 FALSE

== Examples

=== How can I get the first 2 rows of a data.frame named df?

Click to see solution
df <- data.frame(cat_1=c(1,2,3), cat_2=c(9,8,7), ok=c(T, T, F), other=c("first", "second", "third"))
df[1:2,]
  cat_1 cat_2   ok  other
1     1     9 TRUE  first
2     2     8 TRUE second

=== How can I get the first 2 columns of a data.frame named df?

Click to see solution
df[,1:2]
  cat_1 cat_2
1     1     9
2     2     8
3     3     7

=== How can I get the rows where values in the column named cat_1 are greater than 2?

Click to see solution
df[df$cat_1 > 2,]
  cat_1 cat_2    ok other
3     3     7 FALSE third
df[df[, c("cat_1")] > 2,]
  cat_1 cat_2    ok other
3     3     7 FALSE third

=== How can I get the rows where values in the column named cat_1 are greater than 2 and the values in the column named cat_2 are less than 9?

Click to see solution
df[df$cat_1 > 2 & df$cat_2 < 9,]
  cat_1 cat_2    ok other
3     3     7 FALSE third

=== How can I get the rows where values in the column named cat_1 are greater than 2 or the values in the column named cat_2 are less than 9?

Click to see solution
df[df$cat_1 > 2 | df$cat_2 < 9,]
  cat_1 cat_2    ok  other
2     2     8  TRUE second
3     3     7 FALSE  third

=== How do I sample n rows randomly from a data.frame called df?

Click to see solution
df[sample(nrow(df), n),]

Alternatively you could use the sample_n function from the package dplyr:

sample_n(df, n)

=== How can I get only columns whose names start with "cat_"?

Click to see solution
df <- data.frame(cat_1=c(1,2,3), cat_2=c(9,8,7), ok=c(T, T, F), other=c("first", "second", "third"))
df[, grep("^cat_", names(df))]
  cat_1 cat_2
1     1     9
2     2     8
3     3     7
   cat_1 cat_2   ok  other
 1     1     9 TRUE  first
 2     2     8 TRUE second

This method uses the indices of the rows, which are independent of the row names. We can update the names of the rows and subsequently index those as well, if row names are appropriate for the situation.

Row Naming & Indexing on Row Names

row.names(df) <- c("row1", "row2", "row3")
df[c("row1", "row3"),]
      cat_1  cat_2    ok  other
 row1     1      9  TRUE  first
 row3     3      7 FALSE  third

Though the row names replace the numerical indices in the output, we can still index using either. This same logic applies to columns, which also have intrinsic indices and are required to be named in order to be created.

So far we’ve indexed in two ways, and their differences merit explanation:

  1. : selects indices based on the given sequence. In R, this process is inclusive, meaning that 1:4 will select the first, second, third, and fourth entries.

  2. c() defines a vector, as explained in the Lists & Vectors page, and indexing on vectors will select all rows/columns shared between the vector and data frame.


Indexing can also be done logically using a vector of Boolean values:

Logical Indexing

# selection is True for the first line,
# False for the second, and True for the third

df[c(T,F,T),]
   cat_1 cat_2    ok other
 1     1     9  TRUE first
 3     3     7 FALSE third


For all of the above examples, there was at least one comma — anything before the comma defines row selection, and anything after the comma defines column selection. If you leave out the comma, R will default to column selection.

Column-Default Indexing

df[c("cat_1", "ok")]
      cat_1    ok
 row1     1  TRUE
 row2     2  TRUE
 row3     3 FALSE

This is equivalent to leaving a blank space before the comma:

Indexing Column-Specific

df[, c(1,3)]
      cat_1    ok
 row1     1  TRUE
 row2     2  TRUE
 row3     3 FALSE


We can apply sequence-indexing and logical indexing to columns in the same way. You’ll find that indexing rows and indexing columns is a nearly identical process that is easy to get hold of. We can combine any of the previous methods to index rows and columns simultaneously.

Putting It All Together

df[1:2, c(1,3)]
      cat_1   ok
 row1     1 TRUE
 row2     2 TRUE
df[c(T,F,T), c(T, F, F, F)]
 [1] 1 3


$ Subsetting/Indexing

A key feature of R is the $ operator on data.frames, which is the more common indexing method for R.

$ Column Indexing

df$cat_1
 [1] 1 2 3

You can extend this to index for row as well using df$column_name[].

It’s good to keep in mind that $ lists column and then row, while just df[ , ] indexing requires row, then column.

Selecting Values from a Column

df$cat_1[c(F,T,F)]
[1] 2


Examples

How can I get the first 2 rows of a data.frame named df?

df <- data.frame(cat_1=c(1,2,3), cat_2=c(9,8,7),
                 ok=c(T, T, F), other=c("first", "second", "third"))
df[1:2,]
   cat_1 cat_2   ok  other
 1     1     9 TRUE  first
 2     2     8 TRUE second


How can I get the first 2 columns of a data.frame named df?

df[,1:2]
   cat_1 cat_2
 1     1     9
 2     2     8
 3     3     7


How can I get the rows where values in the column named cat_1 are greater than 2?

# first example, using $
df[df$cat_1 > 2,]
   cat_1 cat_2    ok other
 3     3     7 FALSE third
# second example, using []
df[df[, c("cat_1")] > 2,]
   cat_1 cat_2    ok other
 3     3     7 FALSE third


How can I get the rows where values in the column named cat_1 are greater than 2 and the values in the column named cat_2 are less than 9?

df[df$cat_1 > 2 & df$cat_2 < 9,]
   cat_1 cat_2    ok other
 3     3     7 FALSE third


How can I get the rows where values in the column named cat_1 are greater than 2 or the values in the column named cat`_2 are less than 9?

df[df$cat_1 > 2 | df$cat_2 < 9,]
   cat_1 cat_2    ok  other
 2     2     8  TRUE second
 3     3     7 FALSE  third


How do I sample n rows randomly from a data.frame called df?

df[sample(nrow(df), n),]

You could also use the sample_n function from the package dplyr:

sample_n(df, n)


How can I get only columns whose names start with "cat_"?

df <- data.frame(cat_1=c(1,2,3), cat_2=c(9,8,7),
                 ok=c(T, T, F), other=c("first", "second", "third"))
df[, grep("^cat_", names(df))]
   cat_1 cat_2
 1     1     9
 2     2     8
 3     3     7