TDM 10100: R Project 3 — 2024
Motivation: Now that we are comfortable with the table command in R, we can learn about the tapply command. The tapply will apply a function to values in one column, which are grouped according to another column. This sounds abstract, but once you see some examples, it makes a lot of sense.
Context: tapply takes two columns and a function. It applies the function to the first column of values, split into groups according to the second column.
Scope: tapply in R
Dataset(s)
This project will use the following dataset(s):
-
/anvil/projects/tdm/data/election/itcont1980.txt -
/anvil/projects/tdm/data/icecream/combined/products.csv -
/anvil/projects/tdm/data/flights/subset/1990.csv -
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv -
/anvil/projects/tdm/data/olympics/athlete_events.csv
Questions
|
As before, please use the |
Three examples of the tapply function:
Example 1 Using the 1980 election data, we can find the amount of money donated in each state.
library(data.table)
myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")
We take the money for the donations (in the TRANSACTION_AMT column), group the data according to the state (in the STATE column), and sum up the donation amounts:
tapply(myDF$TRANSACTION_AMT, myDF$STATE, sum)
Example 2 Using the ice cream products data, we can find the average rating for each brand.
library(data.table)
myDF <- fread("/anvil/projects/tdm/data/icecream/combined/products.csv")
We take the rating (in the rating column), group the data according to the brand (in the brand column), and take an average of these reviews:
tapply(myDF$rating, myDF$brand, mean)
Example 3 Using the 1990 airport data, we can find the average departure delay for flights from each airport.
library(data.table)
myDF <- fread("/anvil/projects/tdm/data/flights/subset/1990.csv")
We take the departure delays (in the DepDelay column), group the data according to airport where the flights depart (in the Origin column), and take an average of these departure delays:
tapply(myDF$DepDelay, myDF$Origin, mean)
The values show up as "NA" (not available) because some values are missing, so R cannot take an average. In such a case, we can give R the fourth parameter na.rm=TRUE so that it ignores missing values, and we try again:
tapply(myDF$DepDelay, myDF$Origin, mean, na.rm=TRUE)
|
For Dr Ward, using the Firefox browser, 1 core was enough for this entire project, but Dr Ward met one student who demonstrated that he needed 2 cores, even in Firefox. So if you cannot load the 1990 flights subset with 1 core, then you might want to try it with 2 cores. Please make sure that you are using Firefox. |
Question 1 (2 pts)
Using the 1990 airport data, find the average arrival delay for flights arriving to each airport.
|
The arrival delays are in the |
|
In the three examples at the start of the project (before Question 1), we used:
to load the data. I recommend that you use the |
Question 2 (2 pts)
In the grocery store file:
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
Find the sum of the amount spent (in the SPEND column) at each of the store regions (the STORE_R column).
Question 3 (2 pts)
In the grocery store file (same file from question 2):
Find the total amount of money spent in 2016 altogether, and the total amount of money spent in 2017 altogether. (You can use the tapply to do this with just one cell.)
Question 4 (2 pts)
In the Olympics file /anvil/projects/tdm/data/olympics/athlete_events.csv
Find the average height of the athletes in each country (the country is the NOC column).
|
Remember to use |
Question 5 (2 pts)
In the Olympics file (same file from question 4):
Find the average height of the athletes in each sport (the sport is the Sport column, of course!). After finding these average heights, please sort your results. In which sport are the athletes the tallest (on average)? Does this make sense intuitively, i.e., is height an advantage in this sport?
|
Again, remember to use |
Submitting your Work
We only learned about tapply in this project because it is a short week, but it is powerful! As always, please ask any questions you have, on Piazza, or in office hours. We hope you have a nice Labor Day weekend!
-
firstname_lastname_project3.ipynb
|
You must double check your You will not receive full credit if your |