Data.frames
Data.frames are one of the primary data structure used very frequently when working in R. Data.frames are tables of same-sized, named columns, where each column has a single type.
You can create a data.frame easily:
df <- data.frame(cat_1=c(1,2,3), cat_2=c(9,8,7), ok=c(T, T, F), other=c("first", "second", "third"))
head(df)
cat_1 cat_2 ok other
1 1 9 TRUE first
2 2 8 TRUE second
3 3 7 FALSE third
Regular indexing rules apply as well. This is how you index rows. Pay close attention to the trailing comma:
= data.frames
== Abstract
Data.frames are one of the most frequently used data structures in R. Data.frames organize data into a 2D table consisting of rows & columns, where each column represents a variable and each row contains one value for each column.
=== Bracket Subsetting/Indexing
Creating a data.frame is easily done by filling in the columns using vectors, which are declared using c()
as follows.
==== Data Frame Creation
df <- data.frame(cat_1=c(1,2,3), cat_2=c(9,8,7),
ok=c(T, T, F), other=c("first", "second", "third"))
head(df)
cat_1 cat_2 ok other 1 1 9 TRUE first 2 2 8 TRUE second 3 3 7 FALSE third
The parameter names in the data.frame()
function become the columns of the data.frame, while the number of rows are determined by the size of the vectors.
The different columns of a data frame can contain different types of values, but the variables within the column must have the same type. In this case, cat_1
and cat_2
contain integers, ok
contains booleans, and other
contains Strings.
Regular indexing rules apply to R data frames. Pay close attention to the commas in the following examples:
==== Indexing Rows Numerically
# Numeric indexing on rows:
df[1:2,]
cat_1 cat_2 ok other
1 1 9 TRUE first
2 2 8 TRUE second
df[c(1,3),]
cat_1 cat_2 ok other
1 1 9 TRUE first
3 3 7 FALSE third
# Logical indexing on rows:
df[c(T,F,T),]
cat_1 cat_2 ok other
1 1 9 TRUE first
3 3 7 FALSE third
# Named indexing on rows only works
# if there are named rows:
row.names(df) <- c("row1", "row2", "row3")
df[c("row1", "row3"),]
cat_1 cat_2 ok other
row1 1 9 TRUE first
row3 3 7 FALSE third
By default, if you don’t include the comma in the square brackets, you are indexing the column:
df[c("cat_1", "ok")]
cat_1 ok
row1 1 TRUE
row2 2 TRUE
row3 3 FALSE
To index columns, place expressions after the first comma:
# Numeric indexing on columns:
df[, 1]
[1] 1 2 3
df[, c(1,3)]
cat_1 ok
row1 1 TRUE
row2 2 TRUE
row3 3 FALSE
# Logical indexing on columns:
df[, c(T, F, F, F)]
[1] 1 2 3
# Named indexing on columns.
# This is the more typical method of
# column indexing:
df$cat_1
[1] 1 2 3
# Another way to do named indexing on columns:
df[,c("cat_1", "ok")]
cat_1 ok
row1 1 TRUE
row2 2 TRUE
row3 3 FALSE
Of course, you can index on columns and rows:
# Numeric indexing on columns and rows:
df[1:2, 1]
[1] 1 2
df[1:2, c(1,3)]
cat_1 ok
row1 1 TRUE
row2 2 TRUE
# Logical indexing on columns and rows:
df[c(T,F,T), c(T, F, F, F)]
[1] 1 3
# Named indexing on columns and rows.
# This is the more typical method of
# column indexing:
df$cat_1[c(T,F,T)]
[1] 1 3
# Another way to do named indexing on columns and rows:
row.names(df) <- c("row1", "row2", "row3")
df[c("row1", "row3"),c("cat_1", "ok")]
cat_1 ok
row1 1 TRUE
row3 3 FALSE
== Examples
=== How can I get the first 2 rows of a data.frame named df
?
Click to see solution
df <- data.frame(cat_1=c(1,2,3), cat_2=c(9,8,7), ok=c(T, T, F), other=c("first", "second", "third"))
df[1:2,]
cat_1 cat_2 ok other
1 1 9 TRUE first
2 2 8 TRUE second
=== How can I get the first 2 columns of a data.frame named df
?
Click to see solution
df[,1:2]
cat_1 cat_2
1 1 9
2 2 8
3 3 7
=== How can I get the rows where values in the column named cat_1
are greater than 2?
Click to see solution
df[df$cat_1 > 2,]
cat_1 cat_2 ok other
3 3 7 FALSE third
df[df[, c("cat_1")] > 2,]
cat_1 cat_2 ok other
3 3 7 FALSE third
=== How can I get the rows where values in the column named cat_1
are greater than 2 and the values in the column named cat_2
are less than 9?
Click to see solution
df[df$cat_1 > 2 & df$cat_2 < 9,]
cat_1 cat_2 ok other
3 3 7 FALSE third
=== How can I get the rows where values in the column named cat_1
are greater than 2 or the values in the column named cat_2
are less than 9?
Click to see solution
df[df$cat_1 > 2 | df$cat_2 < 9,]
cat_1 cat_2 ok other
2 2 8 TRUE second
3 3 7 FALSE third
=== How do I sample n rows randomly from a data.frame called df
?
Click to see solution
df[sample(nrow(df), n),]
Alternatively you could use the sample_n
function from the package dplyr
:
sample_n(df, n)
=== How can I get only columns whose names start with "cat_"?
Click to see solution
df <- data.frame(cat_1=c(1,2,3), cat_2=c(9,8,7), ok=c(T, T, F), other=c("first", "second", "third"))
df[, grep("^cat_", names(df))]
cat_1 cat_2
1 1 9
2 2 8
3 3 7
cat_1 cat_2 ok other 1 1 9 TRUE first 2 2 8 TRUE second
This method uses the indices of the rows, which are independent of the row names. We can update the names of the rows and subsequently index those as well, if row names are appropriate for the situation.
Row Naming & Indexing on Row Names
row.names(df) <- c("row1", "row2", "row3")
df[c("row1", "row3"),]
cat_1 cat_2 ok other row1 1 9 TRUE first row3 3 7 FALSE third
Though the row names replace the numerical indices in the output, we can still index using either. This same logic applies to columns, which also have intrinsic indices and are required to be named in order to be created. |
So far we’ve indexed in two ways, and their differences merit explanation:
-
:
selects indices based on the given sequence. In R, this process is inclusive, meaning that1:4
will select the first, second, third, and fourth entries. -
c()
defines a vector, as explained in the Lists & Vectors page, and indexing on vectors will select all rows/columns shared between the vector and data frame.
Indexing can also be done logically using a vector of Boolean values:
Logical Indexing
# selection is True for the first line,
# False for the second, and True for the third
df[c(T,F,T),]
cat_1 cat_2 ok other 1 1 9 TRUE first 3 3 7 FALSE third
For all of the above examples, there was at least one comma — anything before the comma defines row selection, and anything after the comma defines column selection. If you leave out the comma, R will default to column selection.
Column-Default Indexing
df[c("cat_1", "ok")]
cat_1 ok row1 1 TRUE row2 2 TRUE row3 3 FALSE
This is equivalent to leaving a blank space before the comma:
Indexing Column-Specific
df[, c(1,3)]
cat_1 ok row1 1 TRUE row2 2 TRUE row3 3 FALSE
We can apply sequence-indexing and logical indexing to columns in the same way. You’ll find that indexing rows and indexing columns is a nearly identical process that is easy to get hold of. We can combine any of the previous methods to index rows and columns simultaneously.
Putting It All Together
df[1:2, c(1,3)]
cat_1 ok row1 1 TRUE row2 2 TRUE
df[c(T,F,T), c(T, F, F, F)]
[1] 1 3
$ Subsetting/Indexing
A key feature of R is the $
operator on data.frames, which is the more common indexing method for R.
Examples
How can I get the first 2 rows of a data.frame named df
?
df <- data.frame(cat_1=c(1,2,3), cat_2=c(9,8,7),
ok=c(T, T, F), other=c("first", "second", "third"))
df[1:2,]
cat_1 cat_2 ok other 1 1 9 TRUE first 2 2 8 TRUE second
How can I get the rows where values in the column named cat_1
are greater than 2?
# first example, using $
df[df$cat_1 > 2,]
cat_1 cat_2 ok other 3 3 7 FALSE third
# second example, using []
df[df[, c("cat_1")] > 2,]
cat_1 cat_2 ok other 3 3 7 FALSE third
How can I get the rows where values in the column named cat_1
are greater than 2 and the values in the column named cat_2
are less than 9?
df[df$cat_1 > 2 & df$cat_2 < 9,]
cat_1 cat_2 ok other 3 3 7 FALSE third
How can I get the rows where values in the column named cat_1
are greater than 2 or the values in the column named cat`_2 are less than 9?
df[df$cat_1 > 2 | df$cat_2 < 9,]
cat_1 cat_2 ok other 2 2 8 TRUE second 3 3 7 FALSE third