R base
functions
split
split
is a function with arguments x
, a vector or data.frame, and f
, a factor vector that will divide the data into smaller groups.
A useful optional argument is drop
, which indicates whether or not values not in a group should be removed. The default is drop = FALSE
.
Examples
Read the first 10 lines of /anvil/projects/tdm/data/movies_and_tv/imdb2024/basics.tsv/. Using the strsplit function, we can find out how many times each of the individual genres occur.
Click to see solution
myDF <- fread("/anvil/projects/tdm/data/movies_and_tv/imdb2024/basics.tsv", nrows = 10)
myDF$genres
strsplit(myDF$genres, ',')
unlist(strsplit(myDF$genres, ','))
table(unlist(strsplit(myDF$genres, ',')))
'Documentary,Short''Animation,Short''Animation,Comedy,Romance''Animation,Short''Comedy,Short''Short''Short,Sport''Documentary,Short''Romance''Documentary,Short' 'Documentary''Short' 'Animation''Short' 'Animation''Comedy''Romance' 'Animation''Short' 'Comedy''Short' 'Short' 'Short''Sport' 'Documentary''Short' 'Romance' 'Documentary''Short' 'Documentary''Short''Animation''Short''Animation''Comedy''Romance''Animation''Short''Comedy''Short''Short''Short''Sport''Documentary''Short''Romance''Documentary''Short' Animation Comedy Documentary Romance Short Sport 3 2 3 2 8 1
Using movies_and_tv/imdb2024/basics.tsv, for each of the genres, list how many times it occurs.
Click to see solution
genres <- fread("/anvil/projects/tdm/data/movies_and_tv/imdb2024/basics.tsv", select = "genres", col.names = "genres")
sort(table(unlist(strsplit(genres$genres, ","))), decreasing = TRUE)
Drama Comedy Talk-Show Short Documentary News 3151064 2181847 1372500 1191319 1062294 1051399 Romance Family Reality-TV Animation Action Crime 1045327 824607 624854 556566 462531 459412 Adventure Game-Show Music Adult Sport Fantasy 425130 424919 418888 353525 271872 234269 Mystery Horror Thriller History Biography Sci-Fi 225390 202434 184618 165528 119759 117541 Musical War Western Film-Noir 92140 38662 30931 873