TDM 10100: R Project 4 — 2024

Motivation: We continue to practice using the tapply function.

Context: tapply takes two columns and a function. It applies the function to the first column of values, split into groups according to the second column.

Scope: tapply in R

Learning Objectives:
  • Learning about how to apply functions to data in groups

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following dataset(s):

  • /anvil/projects/tdm/data/death_records/DeathRecords.csv

  • /anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv

  • /anvil/projects/tdm/data/beer/reviews_sample.csv

  • /anvil/projects/tdm/data/election/itcont1980.txt

  • /anvil/projects/tdm/data/flights/subset/1990.csv

Questions

As before, please use the seminar-r kernel (not the seminar kernel). You do not need to use the %%R cell magic.

If you session crashes when you read in the data (for instance, on question 2), you might want to try using 2 cores in your session instead of 1 core.

Question 1 (2 pts)

In the death records file:

/anvil/projects/tdm/data/death_records/DeathRecords.csv

Find the mean Age of death for each Sex.

Deliverables
  • Show the mean Age of death for each Sex.

Question 2 (2 pts)

In the grocery store file:

/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv

Find the total amount of money spent on each PRODUCT_NUM. (You can just add up the values in the SPEND column, grouped according to the PRODUCT_NUM value. You can ignore the UNITS column.) Display the top 10 types of PRODUCT_NUM, according to the total amount of money spent on those products (i.e., according to the sum of the SPEND column for those 10 PRODUCT_NUM values).

Deliverables
  • Show the top 10 types of PRODUCT_NUM, according to the total amount of money spent on those products (i.e., according to the sum of the SPEND column for those 10 PRODUCT_NUM values).

Question 3 (2 pts)

In this file of beer reviews /anvil/projects/tdm/data/beer/reviews_sample.csv

Consider the mean beer scores on each date.

Find the three dates on which the mean score is a 5.

A mean score of "5" is a perfect score, so you can use the tapply function, taking the mean of the score values, grouped according to the date. You will need to sort the results and consider the tail.

Deliverables
  • Show the three dates on which the mean score is a 5.

Question 4 (2 pts)

Revisit the video at the very beginning of Project 3, Example 1, in which we found the amount of money donated in each state, during the 1980 federal election cycle.

/anvil/projects/tdm/data/election/itcont1980.txt

This time, instead of finding the amount of money donated in each state in 1980, please find the amount of money donated in each city-and-state pair.

To accomplish this, paste the city and state together like this:

paste(myDF$CITY, myDF$STATE, sep=", ")

We use a comma and a space for the separator in the paste function. Take the paste and use it as the second element in Project 3, Example 1, so that we group the data according to the CITY and the STATE. The goal is to show the top 20 city-and-state pairs, according to the amount of money donated.

Your answer will need to use the sort function and the tail function, like this:

tail(sort(tapply( …​, …​, …​ )), n=20)

Here are the top 6 city-and-state pairs (notice that the top result has a blank city-and-state pair, namely, many of the donations have a blank city-and-state):

WASHINGTON, DC
    4273606
LOS ANGELES, CA
    4569952
DALLAS, TX
    4748262
HOUSTON, TX
    7606806
NEW YORK, NY
    11345027
,
    17299729

In your solution, you need to show the top 20 of the top city-and-state pairs.

Deliverables
  • Show the top 20 city-and-state pairs, according to the amount of money donated.

Question 5 (2 pts)

Revisit the video at the very beginning of Project 3, Example 3, in which we studied the departure delays (DepDelay) in the 1990 flight data:

/anvil/projects/tdm/data/flights/subset/1990.csv

This time, instead of finding the mean departure delays according to where the flights depart (in the Origin column), please find mean departure delays on each Month / DayofMonth / Year triple

To accomplish this, paste these three columns together like this:

paste(myDF$Month, myDF$DayofMonth, myDF$Year, sep="/")

We use a slash for the separator in the paste function. Take the paste and use it as the second element in Project 3, Example 3, so that we group the data according to the Month / DayofMonth / Year triple. The goal is to show the worst 6 dates from 1990, according to the largest mean departure delay (DepDelay) values.

Your answer will need to use the sort function and the tail function, like this:

tail(sort(tapply( …​, …​, mean, na.rm=TRUE)))

Here are the worst two dates from 1990, according to the largest mean departure delay (DepDelay) values.

12/22/1990
    45.2222488995598
12/21/1990
    45.6617816091954

In your solution, you need to show the worst 6 dates from 1990, according to the largest mean departure delay (DepDelay) values.

Deliverables
  • Show the worst 6 dates from 1990, according to the largest mean departure delay (DepDelay) values.

Submitting your Work

We hope that you enjoyed this additional practice with the tapply function in this project!

Items to submit
  • firstname_lastname_project4.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.