TDM 10100: R Project 3 — 2024

Motivation: Now that we are comfortable with the table command in R, we can learn about the tapply command. The tapply will apply a function to values in one column, which are grouped according to another column. This sounds abstract, but once you see some examples, it makes a lot of sense.

Context: tapply takes two columns and a function. It applies the function to the first column of values, split into groups according to the second column.

Scope: tapply in R

Learning Objectives:
  • Learning about how to apply functions to data in groups

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following dataset(s):

  • /anvil/projects/tdm/data/election/itcont1980.txt

  • /anvil/projects/tdm/data/icecream/combined/products.csv

  • /anvil/projects/tdm/data/flights/subset/1990.csv

  • /anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv

  • /anvil/projects/tdm/data/olympics/athlete_events.csv

Questions

As before, please use the seminar-r kernel (not the seminar kernel). You do not need to use the %%R cell magic.

Three examples of the tapply function:

Example 1 Using the 1980 election data, we can find the amount of money donated in each state.

library(data.table)
myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")

We take the money for the donations (in the TRANSACTION_AMT column), group the data according to the state (in the STATE column), and sum up the donation amounts:

tapply(myDF$TRANSACTION_AMT, myDF$STATE, sum)

Example 2 Using the ice cream products data, we can find the average rating for each brand.

library(data.table)
myDF <- fread("/anvil/projects/tdm/data/icecream/combined/products.csv")

We take the rating (in the rating column), group the data according to the brand (in the brand column), and take an average of these reviews:

tapply(myDF$rating, myDF$brand, mean)

Example 3 Using the 1990 airport data, we can find the average departure delay for flights from each airport.

library(data.table)
myDF <- fread("/anvil/projects/tdm/data/flights/subset/1990.csv")

We take the departure delays (in the DepDelay column), group the data according to airport where the flights depart (in the Origin column), and take an average of these departure delays:

tapply(myDF$DepDelay, myDF$Origin, mean)

The values show up as "NA" (not available) because some values are missing, so R cannot take an average. In such a case, we can give R the fourth parameter na.rm=TRUE so that it ignores missing values, and we try again:

tapply(myDF$DepDelay, myDF$Origin, mean, na.rm=TRUE)

For Dr Ward, using the Firefox browser, 1 core was enough for this entire project, but Dr Ward met one student who demonstrated that he needed 2 cores, even in Firefox. So if you cannot load the 1990 flights subset with 1 core, then you might want to try it with 2 cores. Please make sure that you are using Firefox.

Question 1 (2 pts)

Using the 1990 airport data, find the average arrival delay for flights arriving to each airport.

The arrival delays are in the ArrDelay column, and the planes arrive at the airports in the Dest (destination) column.

In the three examples at the start of the project (before Question 1), we used:

library(data.table)
myDF <- fread( put my file location here )

to load the data. I recommend that you use the fread function to load your data too (rather than read.csv).

Question 2 (2 pts)

In the grocery store file:

/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv

Find the sum of the amount spent (in the SPEND column) at each of the store regions (the STORE_R column).

Question 3 (2 pts)

In the grocery store file (same file from question 2):

Find the total amount of money spent in 2016 altogether, and the total amount of money spent in 2017 altogether. (You can use the tapply to do this with just one cell.)

Question 4 (2 pts)

In the Olympics file /anvil/projects/tdm/data/olympics/athlete_events.csv

Find the average height of the athletes in each country (the country is the NOC column).

Remember to use na.rm=TRUE because some of the athelete heights are missing.

Question 5 (2 pts)

In the Olympics file (same file from question 4):

Find the average height of the athletes in each sport (the sport is the Sport column, of course!). After finding these average heights, please sort your results. In which sport are the athletes the tallest (on average)? Does this make sense intuitively, i.e., is height an advantage in this sport?

Again, remember to use na.rm=TRUE because some of the athelete heights are missing.

Submitting your Work

We only learned about tapply in this project because it is a short week, but it is powerful! As always, please ask any questions you have, on Piazza, or in office hours. We hope you have a nice Labor Day weekend!

Items to submit
  • firstname_lastname_project3.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.