cut

Basics

cut breaks a vector into factors specified by the argument breaks. cut is particularly useful to break Date data into quarters (Q1, Q2), years (1999, 2000, 2001), and so on.

The utility of this function is tied to the possible factors offered by breaks. You can see a list of your options by running ?cut.POSIXt.


Examples

How can I create a new column in a data.frame df that is a factor based on the year?

Click to see solution
df$year <- cut(df$times, breaks="year")
str(df)
'data.frame':    24 obs. of  3 variables:
 $ times: POSIXct, format: "2020-06-01 06:00:00" "2020-07-01 06:00:00" ...
 $ value: int  48 62 55 4 83 77 5 53 68 46 ...
 $ year : Factor w/ 3 levels "2020-01-01","2021-01-01",..: 1 1 1 1 1 1 1 2 2 2 ...

How can I create a new column in a data.frame df that is a factor based on the quarter?

Click to see solution
df$quarter <- cut(df$times, breaks="quarter")
str(df)
'data.frame':    24 obs. of  4 variables:
 $ times  : POSIXct, format: "2020-06-01 06:00:00" "2020-07-01 06:00:00" ...
 $ value  : int  48 62 55 4 83 77 5 53 68 46 ...
 $ year   : Factor w/ 3 levels "2020-01-01","2021-01-01",..: 1 1 1 1 1 1 1 2 2 2 ...
 $ quarter: Factor w/ 9 levels "2020-04-01","2020-07-01",..: 1 2 2 2 3 3 3 4 4 4 ...

Video Example: fars 6-hour intervals

Click to see example

Let’s load up the 7581 data set and look at the HOUR column

myDF <- read.csv("/depot/datamine/data/fars/7581.csv")
table(myDF$HOUR)
    0     1     2     3     4     5     6     7     8     9    10    11    12    13    14    15
17704 18671 17262  9908  6438  5463  6749  7088  6308  6275  7311  8401  8929  9872 12066 14138

We can break these values into 6-hour intervals using cut:

table( cut(myDF$HOUR, breaks=c(0,6,12,18,24,99), include.lowest=TRUE) )
[0,6]  (6,12] (12,18] (18,24] (24,99]
82195   44312   85388   86567    1597

This effectively gives us 5 categories: midnight to 6:00 A.M., 6:01 A.M. to noon, 12:01 P.M. to 6:00 P.M., 6:01 P.M. to midnight, and unknown (99 indicates the hour of day was not included in the entry).

With the help of tapply, we can find the total number of PERSONS who are involved in accidents during each 6-hour interval:

tapply( myDF$PERSONS, cut(myDF$HOUR, breaks=c(0,6,12,18,24,99), include.lowest=TRUE), sum )
 [0,6] (6,12] (12,18] (18,24] (24,99]
187397 119261  238193  230289    2269