# STAT 19000: Project 6 — Fall 2020

The `tapply` function works like this:

`tapply( somedata, thewaythedataisgrouped, myfunction)`

``````myDF <- read.csv("/class/datamine/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv")

We could do four computations to compute the `mean` `SPEND` amount in each `STORE_R`…​

``````mean(myDF\$SPEND[myDF\$STORE_R == "CENTRAL"])
mean(myDF\$SPEND[myDF\$STORE_R == "EAST   "])
mean(myDF\$SPEND[myDF\$STORE_R == "SOUTH  "])
mean(myDF\$SPEND[myDF\$STORE_R == "WEST   "])``````

…​but it is easier to do all four of these calculations with the `tapply` function. We take a `mean` of the `SPEND` values, broken into groups according to the `STORE_R`:

``tapply( myDF\$SPEND, myDF\$STORE_R, mean)``

We could find the total amount in the `SPEND` column in 2016 and then again in 2017…​

``````sum(myDF\$SPEND[myDF\$YEAR == "2016"])
sum(myDF\$SPEND[myDF\$YEAR == "2017"])``````

…​or we could do both of these calculations at once, using the `tapply` function. We take the `sum` of all `SPEND` amounts, broken into groups according to the `YEAR`

``tapply(myDF\$SPEND, myDF\$YEAR, sum)``

As a last example, we can calculate the amount spent on each day of purchases. We take the `sum` of all `SPEND` amounts, broken into groups according to the `PURCHASE_` day:

``tapply(myDF\$SPEND, myDF\$PURCHASE_, sum)``
``tail(sort(  tapply(myDF\$SPEND, myDF\$PURCHASE_, sum)  ),n=20)``

It makes sense to sort the results and then look at the 20 days on which the `sum` of the `SPEND` amounts were the highest.

``tapply( mydata, mygroups, myfunction, na.rm=T )``

Some generic uses to explain how this would look, if we made the calculations in a naive/verbose/painful way:

``````myfunction(mydata[mygroups == 1], na.rm=T)
myfunction(mydata[mygroups == 2], na.rm=T)
myfunction(mydata[mygroups == 3], na.rm=T) ....
myfunction(mydata[mygroups == "IN"], na.rm=T)
myfunction(mydata[mygroups == "OH"], na.rm=T)
myfunction(mydata[mygroups == "IL"], na.rm=T) ....``````
``````myDF <- read.csv("/class/datamine/data/flights/subset/2005.csv")

`sum` all flight `Distance`, split into groups according to the airline (`UniqueCarrier`).

``sort(tapply(myDF\$Distance, myDF\$UniqueCarrier, sum))``

Find the `mean` flight `Distance`, grouped according to the city of `Origin`.

``sort(tapply(myDF\$Distance, myDF\$Origin, mean))``

Calculate the `mean` departure delay (`DepDelay`), for each airplane (i.e., each `TailNum`), using `na.rm=T` because some of the values of the departure delays are `NA`.

``tail(sort(tapply(myDF\$DepDelay, myDF\$TailNum, mean, na.rm=T)),n=20)``
``````library(data.table)

`sum` the amounts of all contributions made, grouped according to the `STATE` where the people lived.

``sort(tapply(myDF\$TRANSACTION_AMT, myDF\$STATE, sum))``

`sum` the amounts of all contributions made, grouped according to the `CITY`/`STATE` where the people lived.

``````tail(sort(tapply(myDF\$TRANSACTION_AMT, paste(myDF\$CITY, myDF\$STATE), sum)),n=20)
mylocations <-  paste(myDF\$CITY, myDF\$STATE)
tail(sort(tapply(myDF\$TRANSACTION_AMT, mylocations, sum)),n=20)``````

`sum` the amounts of all contributions made, grouped according to the `EMPLOYER` where the people worked.

``tail(sort(tapply(myDF\$TRANSACTION_AMT, myDF\$EMPLOYER, sum)), n=30)``

Motivation: `tapply` is a powerful function that allows us to group data, and perform calculations on that data in bulk. The "apply suite" of functions provide a fast way of performing operations that would normally require the use of loops. Typically, when writing R code, you will want to use an "apply suite" function rather than a for loop.

Context: The past couple of projects have studied the use of loops and/or vectorized operations. In this project, we will introduce a function called `tapply` from the "apply suite" of functions in R.

Scope: r, for, tapply

Learning objectives
• Explain what "recycling" is in R and predict behavior of provided statements.

• Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.

• Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

• Read and write basic (csv) data.

• Explain and demonstrate: positional, named, and logical indexing.

• List the differences between lists, vectors, factors, and data.frames, and when to use each.

• Demonstrate a working knowledge of control flow in r: if/else statements, while loops, etc.

• Demonstrate how apply functions are generally faster than using loops.

## Dataset

The following questions will use the dataset found in Scholar:

`/class/datamine/data/fars/7581.csv`

## Questions

 Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

### Question 1

The dataset, `/class/datamine/data/fars/7581.csv` contains the combined accident records from year 1975 to 1981. Load up the dataset into a data.frame named `dat`. In the previous project’s question 4, we asked you to calculate the mean number of motorists involved in an accident (variable `PERSON`) with i drunk drivers where i takes the values from 0 through 6. This time, solve this question using the `tapply` function instead. Which method did you prefer and why?

Now that you’ve read the data into a dataframe named `dat`, run the following code:

``````# Read in data that maps state codes to state names
# Create a vector of state names called v
v <- state_names\$state
# Set the names of the new vector to the codes
names(v) <- state_names\$code
# Create a new column in the dat dataframe with the actual names of the states
dat\$mystates <- v[as.character(dat\$STATE)]``````
Items to submit
• R code used to solve the problem.

• The output/solution.

### Question 2

Make a state-by-state classification of the average number of drunk drivers in an accident. Which state has the highest average number of drunk drivers per accident?

Items to submit
• R code used to solve the problem.

• The entire output.

• Which state has the highest average number of drunk drivers per accident?

### Question 3

Add up the total number of fatalities, according to the day of the week on which they occurred. Are the numbers surprising to you? What days of the week have a higher number of fatalities? If instead you calculate the proportion of fatalities over the total number of people in the accidents, what would you expect? Calculate it and see if your expectations match.

 Sundays through Saturdays are days 1 through 7, respectively. Day 9 indicates that the day is unknown.

This video example uses the Amazon fine food reviews dataset to make a similar calculation, in which we have two tapply statements, and we divide the results to get a ton of similar ratios all at once. Powerful stuff! It may guide you in your thinking about this question.

Items to submit
• R code used to solve the problem.

• What days have the highest number of fatalities?

• What would you expect if you calculate the proportion of fatalities over the total number of people in the accidents?

### Question 4

How many drunk drivers are involved, on average, in crashes that occur on straight roads? How many drunk drivers are involved, on average, in crashes that occur on curved roads? Solve the pair of questions in a single line of R code.

 The `ALIGNMNT` variable is 1 for straight, 2 for curved, and 9 for unknown.
Items to submit
• R code used to solve the problem.

• Results from running the R code.

### Question 5

Break the day into portions, as follows: midnight to 6AM, 6AM to 12 noon, 12 noon to 6PM, 6PM to midnight, other. Find the total number of fatalities that occur during each of these time intervals. Also, find the average number of fatalities per crash that occurs during each of these time intervals.

This example demonstrates a comparable calculation. In the video, I used the total number of people in the accident, and your question is (instead) about the number of fatalities, but this is essentially the only difference. I hope it helps to explain the way that the cut function works, along with the analogous breaks.

Items to submit
• R code used to solve the problem.

• Results from running the R code.