STAT 19000: Project 6 — Fall 2020
The tapply
function works like this:
tapply( somedata, thewaythedataisgrouped, myfunction)
myDF < read.csv("/class/datamine/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv")
head(myDF)
We could do four computations to compute the mean
SPEND
amount in each STORE_R
…
mean(myDF$SPEND[myDF$STORE_R == "CENTRAL"])
mean(myDF$SPEND[myDF$STORE_R == "EAST "])
mean(myDF$SPEND[myDF$STORE_R == "SOUTH "])
mean(myDF$SPEND[myDF$STORE_R == "WEST "])
…but it is easier to do all four of these calculations with the tapply
function. We take a mean
of the SPEND
values, broken into groups according to the STORE_R
:
tapply( myDF$SPEND, myDF$STORE_R, mean)
We could find the total amount in the SPEND
column in 2016 and then again in 2017…
sum(myDF$SPEND[myDF$YEAR == "2016"])
sum(myDF$SPEND[myDF$YEAR == "2017"])
…or we could do both of these calculations at once, using the tapply
function. We take the sum
of all SPEND
amounts, broken into groups according to the YEAR
tapply(myDF$SPEND, myDF$YEAR, sum)
As a last example, we can calculate the amount spent on each day of purchases.
We take the sum
of all SPEND
amounts, broken into groups according to the PURCHASE_
day:
tapply(myDF$SPEND, myDF$PURCHASE_, sum)
tail(sort( tapply(myDF$SPEND, myDF$PURCHASE_, sum) ),n=20)
It makes sense to sort the results and then look at the 20 days on which the sum
of the SPEND
amounts were the highest.
tapply( mydata, mygroups, myfunction, na.rm=T )
Some generic uses to explain how this would look, if we made the calculations in a naive/verbose/painful way:
myfunction(mydata[mygroups == 1], na.rm=T)
myfunction(mydata[mygroups == 2], na.rm=T)
myfunction(mydata[mygroups == 3], na.rm=T) ....
myfunction(mydata[mygroups == "IN"], na.rm=T)
myfunction(mydata[mygroups == "OH"], na.rm=T)
myfunction(mydata[mygroups == "IL"], na.rm=T) ....
myDF < read.csv("/class/datamine/data/flights/subset/2005.csv")
head(myDF)
sum
all flight Distance
, split into groups according to the airline (UniqueCarrier
).
sort(tapply(myDF$Distance, myDF$UniqueCarrier, sum))
Find the mean
flight Distance
, grouped according to the city of Origin
.
sort(tapply(myDF$Distance, myDF$Origin, mean))
Calculate the mean
departure delay (DepDelay
), for each airplane (i.e., each TailNum
), using na.rm=T
because some of the values of the departure delays are NA
.
tail(sort(tapply(myDF$DepDelay, myDF$TailNum, mean, na.rm=T)),n=20)
library(data.table)
myDF < fread("/class/datamine/data/election/itcont2016.txt", sep="")
head(myDF)
sum
the amounts of all contributions made, grouped according to the STATE
where the people lived.
sort(tapply(myDF$TRANSACTION_AMT, myDF$STATE, sum))
sum
the amounts of all contributions made, grouped according to the CITY
/STATE
where the people lived.
tail(sort(tapply(myDF$TRANSACTION_AMT, paste(myDF$CITY, myDF$STATE), sum)),n=20)
mylocations < paste(myDF$CITY, myDF$STATE)
tail(sort(tapply(myDF$TRANSACTION_AMT, mylocations, sum)),n=20)
sum
the amounts of all contributions made, grouped according to the EMPLOYER
where the people worked.
tail(sort(tapply(myDF$TRANSACTION_AMT, myDF$EMPLOYER, sum)), n=30)
Motivation: tapply
is a powerful function that allows us to group data, and perform calculations on that data in bulk. The "apply suite" of functions provide a fast way of performing operations that would normally require the use of loops. Typically, when writing R code, you will want to use an "apply suite" function rather than a for loop.
Context: The past couple of projects have studied the use of loops and/or vectorized operations. In this project, we will introduce a function called tapply
from the "apply suite" of functions in R.
Scope: r, for, tapply
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/fars/7581.csv
Questions
Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go. 
Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like 
Question 1
The dataset, /class/datamine/data/fars/7581.csv
contains the combined accident records from year 1975 to 1981. Load up the dataset into a data.frame named dat
. In the previous project’s question 4, we asked you to calculate the mean number of motorists involved in an accident (variable PERSON
) with i drunk drivers where i takes the values from 0 through 6. This time, solve this question using the tapply
function instead. Which method did you prefer and why?
Now that you’ve read the data into a dataframe named dat
, run the following code:
# Read in data that maps state codes to state names
state_names < read.csv("/class/datamine/data/fars/states.csv")
# Create a vector of state names called v
v < state_names$state
# Set the names of the new vector to the codes
names(v) < state_names$code
# Create a new column in the dat dataframe with the actual names of the states
dat$mystates < v[as.character(dat$STATE)]

R code used to solve the problem.

The output/solution.
Question 2
Make a statebystate classification of the average number of drunk drivers in an accident. Which state has the highest average number of drunk drivers per accident?

R code used to solve the problem.

The entire output.

Which state has the highest average number of drunk drivers per accident?
Question 3
Add up the total number of fatalities, according to the day of the week on which they occurred. Are the numbers surprising to you? What days of the week have a higher number of fatalities? If instead you calculate the proportion of fatalities over the total number of people in the accidents, what would you expect? Calculate it and see if your expectations match.
Sundays through Saturdays are days 1 through 7, respectively. Day 9 indicates that the day is unknown. 
This video example uses the Amazon fine food reviews dataset to make a similar calculation, in which we have two tapply statements, and we divide the results to get a ton of similar ratios all at once. Powerful stuff! It may guide you in your thinking about this question.

R code used to solve the problem.

What days have the highest number of fatalities?

What would you expect if you calculate the proportion of fatalities over the total number of people in the accidents?
Question 4
How many drunk drivers are involved, on average, in crashes that occur on straight roads? How many drunk drivers are involved, on average, in crashes that occur on curved roads? Solve the pair of questions in a single line of R code.
The 

R code used to solve the problem.

Results from running the R code.
Question 5
Break the day into portions, as follows: midnight to 6AM, 6AM to 12 noon, 12 noon to 6PM, 6PM to midnight, other. Find the total number of fatalities that occur during each of these time intervals. Also, find the average number of fatalities per crash that occurs during each of these time intervals.
This example demonstrates a comparable calculation. In the video, I used the total number of people in the accident, and your question is (instead) about the number of fatalities, but this is essentially the only difference. I hope it helps to explain the way that the cut function works, along with the analogous breaks.

R code used to solve the problem.

Results from running the R code.