subset

subset is a function that helps you take subsets of data. By default, subset removes NA rows.

subset does not perform any operation that can’t be accomplished by indexing.

With indexing, we would write something like:

grades[grades$year=="junior" | grades$sex=="M",]$grade
[1] 100  75  74  69  88  99  90  92

Using subset, we can instead do this:

subset(grades, year=="junior" | sex=="M", select=grade)
   grade
1    100
3     75
4     74
6     69
7     88
8     99
9     90
10    92

But be careful, if we replace a grade with an NA, it will be removed by subset:

grades$sex[8] <- NA
subset(grades, year=="junior" | sex=="M", select=grade)
   grade
1    100
3     75
4     74
6     69
7     88
9     90
10    92

Whereas indexing will not unless you specify to:

grades[grades$year=="junior" | grades$sex=="M",]$grade
[1] 100  75  74  69  88  NA  90  92

Examples

How can I make a subset of the 8451 data using 1 line of code?

Click to see solution

In the 84.51 data set:

myDF <- read.csv("/depot/datamine/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv")

We recall that these are the variables:

head(myDF)
BASKET_NUM	HSHD_NUM	PURCHASE_	PRODUCT_NUM	SPEND	UNITS	STORE_R	WEEK_NUM	YEAR
      <dbl>    <dbl>	     <chr>	      <dbl>	<dbl>	<int>	   <chr>	   <int>	<int>
         24	    1809	03-JAN-16	    5817389	-1.50	-1	      SOUTH	      1	2016
         24	    1809	03-JAN-16	    5829886	-1.50	-1	      SOUTH	      1	2016
         34	    1253	03-JAN-16	     539501	 2.19	 1	      EAST	      1	2016
         60	    1595	03-JAN-16	    5260099	 0.99	 1	      WEST	      1	2016
         60	    1595	03-JAN-16	    4535660	 2.50	 2	      WEST	      1	2016
         168	 3393	03-JAN-16	    5602916	 4.50	 1	      SOUTH	      1	2016
dim(myDF)
10625553
9

There are 10625553 rows and 9 columns.

We can use the subset function to focus on only the purchases from the CENTRAL store region, in the YEAR 2016. We can also pick which variables (columns) that we want to include in the new data frame.

We do not need to specify myDF on each variable, because the subset function will keep track of this for us. The subset function knows which data set that we are working with, because we specify it as the first parameter in the subset function. The structure of the subset function is as follows: subset(x, subset, select, drop=FALSE, …​). The subset parameter of the subset function describes the rows that we are interested in. (In particular, we specify the conditions that we want the rows to satisfy.) The select parameter of the subset function describes the columns that we are interested in. (We list the columns by their names, and we need to put each such column name in double quotes.)
myfocusedDF <- subset(myDF, subset=(STORE_R=="CENTRAL") & (YEAR==2016),

select=c("PURCHASE_","PRODUCT_NUM","SPEND","UNITS") )

INCLUDE OUTPUT HERE

This new data set has only 1246144 rows, i.e., about 12 percent of the purchases, as expected. It also has only the 4 columns that we specified in the subset function.

dim(myfocusedDF)
1246144
4

How can I make a subset of the election data, using 1 line of code?

Click to see solution

Here is an example of how to use the subset function with the data from the federal election campaign contributions from 2016:

library(data.table)
myDF <- fread("/depot/datamine/data/election/itcont2016.txt", sep="|")
dim(myDF)
20557796
21
mymidwestDF <- subset(myDF, subset=(STATE %in% c("IN","IL","OH","MI","WI")) & (TRANSACTION_AMT > 0),

select=c("NAME","CITY","STATE","TRANSACTION_AMT") )

We can use the subset command to focus on the donations made from Midwest states, and limit our results to those donations that had positive TRANSACTION_AMT values. We can extract interesting variables, e.g., the NAME, CITY, STATE, and TRANSACTION_AMT.

dim(mymidwestDF)
2435825
4

The resulting data frame has 2435825 rows.

tail(sort(tapply(mymidwestDF$TRANSACTION_AMT, mymidwestDF$NAME, sum)))
UIHLEIN, RICHARD E.                          JOBSOHIO ASSOCIATION OF REALTORS, NATIONAL
            9216700                          10000000                          10234334
PRITZKER, JAY ROBERT                  UIHLEIN, RICHARD                    EYCHANER, FRED
            10511348                          11866100                          37901658

From the data set, we can sum the TRANSACTION_AMT values, grouped according to the NAME of the donor, and we find that EYCHANER, FRED was the top donor living in the midwest, during the 2016 federal election campaigns.