STAT 19000: Project 6 — Fall 2020

The tapply function works like this:

tapply( somedata, thewaythedataisgrouped, myfunction)

myDF <- read.csv("/class/datamine/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv")
head(myDF)

We could do four computations to compute the mean SPEND amount in each STORE_R…​

mean(myDF$SPEND[myDF$STORE_R == "CENTRAL"])
mean(myDF$SPEND[myDF$STORE_R == "EAST   "])
mean(myDF$SPEND[myDF$STORE_R == "SOUTH  "])
mean(myDF$SPEND[myDF$STORE_R == "WEST   "])

…​but it is easier to do all four of these calculations with the tapply function. We take a mean of the SPEND values, broken into groups according to the STORE_R:

tapply( myDF$SPEND, myDF$STORE_R, mean)

We could find the total amount in the SPEND column in 2016 and then again in 2017…​

sum(myDF$SPEND[myDF$YEAR == "2016"])
sum(myDF$SPEND[myDF$YEAR == "2017"])

…​or we could do both of these calculations at once, using the tapply function. We take the sum of all SPEND amounts, broken into groups according to the YEAR

tapply(myDF$SPEND, myDF$YEAR, sum)

As a last example, we can calculate the amount spent on each day of purchases. We take the sum of all SPEND amounts, broken into groups according to the PURCHASE_ day:

tapply(myDF$SPEND, myDF$PURCHASE_, sum)
tail(sort(  tapply(myDF$SPEND, myDF$PURCHASE_, sum)  ),n=20)

It makes sense to sort the results and then look at the 20 days on which the sum of the SPEND amounts were the highest.

tapply( mydata, mygroups, myfunction, na.rm=T )

Some generic uses to explain how this would look, if we made the calculations in a naive/verbose/painful way:

myfunction(mydata[mygroups == 1], na.rm=T)
myfunction(mydata[mygroups == 2], na.rm=T)
myfunction(mydata[mygroups == 3], na.rm=T) ....
myfunction(mydata[mygroups == "IN"], na.rm=T)
myfunction(mydata[mygroups == "OH"], na.rm=T)
myfunction(mydata[mygroups == "IL"], na.rm=T) ....
myDF <- read.csv("/class/datamine/data/flights/subset/2005.csv")
head(myDF)

sum all flight Distance, split into groups according to the airline (UniqueCarrier).

sort(tapply(myDF$Distance, myDF$UniqueCarrier, sum))

Find the mean flight Distance, grouped according to the city of Origin.

sort(tapply(myDF$Distance, myDF$Origin, mean))

Calculate the mean departure delay (DepDelay), for each airplane (i.e., each TailNum), using na.rm=T because some of the values of the departure delays are NA.

tail(sort(tapply(myDF$DepDelay, myDF$TailNum, mean, na.rm=T)),n=20)
library(data.table)
myDF <- fread("/class/datamine/data/election/itcont2016.txt", sep="|")
head(myDF)

sum the amounts of all contributions made, grouped according to the STATE where the people lived.

sort(tapply(myDF$TRANSACTION_AMT, myDF$STATE, sum))

sum the amounts of all contributions made, grouped according to the CITY/STATE where the people lived.

tail(sort(tapply(myDF$TRANSACTION_AMT, paste(myDF$CITY, myDF$STATE), sum)),n=20)
mylocations <-  paste(myDF$CITY, myDF$STATE)
tail(sort(tapply(myDF$TRANSACTION_AMT, mylocations, sum)),n=20)

sum the amounts of all contributions made, grouped according to the EMPLOYER where the people worked.

tail(sort(tapply(myDF$TRANSACTION_AMT, myDF$EMPLOYER, sum)), n=30)

Motivation: tapply is a powerful function that allows us to group data, and perform calculations on that data in bulk. The "apply suite" of functions provide a fast way of performing operations that would normally require the use of loops. Typically, when writing R code, you will want to use an "apply suite" function rather than a for loop.

Context: The past couple of projects have studied the use of loops and/or vectorized operations. In this project, we will introduce a function called tapply from the "apply suite" of functions in R.

Scope: r, for, tapply

Learning objectives
  • Explain what "recycling" is in R and predict behavior of provided statements.

  • Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.

  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • List the differences between lists, vectors, factors, and data.frames, and when to use each.

  • Demonstrate a working knowledge of control flow in r: if/else statements, while loops, etc.

  • Demonstrate how apply functions are generally faster than using loops.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/fars/7581.csv

Questions

Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

Question 1

The dataset, /class/datamine/data/fars/7581.csv contains the combined accident records from year 1975 to 1981. Load up the dataset into a data.frame named dat. In the previous project’s question 4, we asked you to calculate the mean number of motorists involved in an accident (variable PERSON) with i drunk drivers where i takes the values from 0 through 6. This time, solve this question using the tapply function instead. Which method did you prefer and why?

Now that you’ve read the data into a dataframe named dat, run the following code:

# Read in data that maps state codes to state names
state_names <- read.csv("/class/datamine/data/fars/states.csv")
# Create a vector of state names called v
v <- state_names$state
# Set the names of the new vector to the codes
names(v) <- state_names$code
# Create a new column in the dat dataframe with the actual names of the states
dat$mystates <- v[as.character(dat$STATE)]
Items to submit
  • R code used to solve the problem.

  • The output/solution.

Question 2

Make a state-by-state classification of the average number of drunk drivers in an accident. Which state has the highest average number of drunk drivers per accident?

Items to submit
  • R code used to solve the problem.

  • The entire output.

  • Which state has the highest average number of drunk drivers per accident?

Question 3

Add up the total number of fatalities, according to the day of the week on which they occurred. Are the numbers surprising to you? What days of the week have a higher number of fatalities? If instead you calculate the proportion of fatalities over the total number of people in the accidents, what would you expect? Calculate it and see if your expectations match.

Sundays through Saturdays are days 1 through 7, respectively. Day 9 indicates that the day is unknown.

This video example uses the Amazon fine food reviews dataset to make a similar calculation, in which we have two tapply statements, and we divide the results to get a ton of similar ratios all at once. Powerful stuff! It may guide you in your thinking about this question.

Items to submit
  • R code used to solve the problem.

  • What days have the highest number of fatalities?

  • What would you expect if you calculate the proportion of fatalities over the total number of people in the accidents?

Question 4

How many drunk drivers are involved, on average, in crashes that occur on straight roads? How many drunk drivers are involved, on average, in crashes that occur on curved roads? Solve the pair of questions in a single line of R code.

The ALIGNMNT variable is 1 for straight, 2 for curved, and 9 for unknown.

Items to submit
  • R code used to solve the problem.

  • Results from running the R code.

Question 5

Break the day into portions, as follows: midnight to 6AM, 6AM to 12 noon, 12 noon to 6PM, 6PM to midnight, other. Find the total number of fatalities that occur during each of these time intervals. Also, find the average number of fatalities per crash that occurs during each of these time intervals.

This example demonstrates a comparable calculation. In the video, I used the total number of people in the accident, and your question is (instead) about the number of fatalities, but this is essentially the only difference. I hope it helps to explain the way that the cut function works, along with the analogous breaks.

Items to submit
  • R code used to solve the problem.

  • Results from running the R code.