TDM 10100: R Project 4 — 2024
Motivation: We continue to practice using the tapply
function.
Context: tapply
takes two columns and a function. It applies the function to the first column of values, split into groups according to the second column.
Scope: tapply
in R
Dataset(s)
This project will use the following dataset(s):
-
/anvil/projects/tdm/data/death_records/DeathRecords.csv
-
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
-
/anvil/projects/tdm/data/beer/reviews_sample.csv
-
/anvil/projects/tdm/data/election/itcont1980.txt
-
/anvil/projects/tdm/data/flights/subset/1990.csv
Questions
As before, please use the |
If you session crashes when you read in the data (for instance, on question 2), you might want to try using 2 cores in your session instead of 1 core. |
Question 1 (2 pts)
In the death records file:
/anvil/projects/tdm/data/death_records/DeathRecords.csv
Find the mean Age
of death for each Sex
.
-
Show the mean
Age
of death for eachSex
.
Question 2 (2 pts)
In the grocery store file:
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
Find the total amount of money spent on each PRODUCT_NUM
. (You can just add up the values in the SPEND
column, grouped according to the PRODUCT_NUM
value. You can ignore the UNITS
column.) Display the top 10 types of PRODUCT_NUM
, according to the total amount of money spent on those products (i.e., according to the sum
of the SPEND
column for those 10 PRODUCT_NUM
values).
-
Show the top 10 types of
PRODUCT_NUM
, according to the total amount of money spent on those products (i.e., according to thesum
of theSPEND
column for those 10PRODUCT_NUM
values).
Question 3 (2 pts)
In this file of beer reviews /anvil/projects/tdm/data/beer/reviews_sample.csv
Consider the mean beer scores on each date.
Find the three dates on which the mean score is a 5.
A mean score
of "5" is a perfect score, so you can use the tapply
function, taking the mean of the score
values, grouped according to the date
. You will need to sort the results and consider the tail.
-
Show the three dates on which the mean
score
is a 5.
Question 4 (2 pts)
Revisit the video at the very beginning of Project 3, Example 1, in which we found the amount of money donated in each state, during the 1980 federal election cycle.
/anvil/projects/tdm/data/election/itcont1980.txt
This time, instead of finding the amount of money donated in each state in 1980, please find the amount of money donated in each city-and-state pair.
To accomplish this, paste the city and state together like this:
paste(myDF$CITY, myDF$STATE, sep=", ")
We use a comma and a space for the separator in the paste function. Take the paste and use it as the second element in Project 3, Example 1, so that we group the data according to the CITY
and the STATE
. The goal is to show the top 20 city-and-state pairs, according to the amount of money donated.
Your answer will need to use the sort
function and the tail function, like this:
tail(sort(tapply( …, …, … )), n=20)
Here are the top 6 city-and-state pairs (notice that the top result has a blank city-and-state pair, namely, many of the donations have a blank city-and-state):
WASHINGTON, DC
4273606
LOS ANGELES, CA
4569952
DALLAS, TX
4748262
HOUSTON, TX
7606806
NEW YORK, NY
11345027
,
17299729
In your solution, you need to show the top 20 of the top city-and-state pairs.
-
Show the top 20 city-and-state pairs, according to the amount of money donated.
Question 5 (2 pts)
Revisit the video at the very beginning of Project 3, Example 3, in which we studied the departure delays (DepDelay
) in the 1990 flight data:
/anvil/projects/tdm/data/flights/subset/1990.csv
This time, instead of finding the mean departure delays according to where the flights depart (in the Origin
column), please find mean departure delays on each Month / DayofMonth / Year triple
To accomplish this, paste these three columns together like this:
paste(myDF$Month, myDF$DayofMonth, myDF$Year, sep="/")
We use a slash for the separator in the paste function. Take the paste and use it as the second element in Project 3, Example 3, so that we group the data according to the Month / DayofMonth / Year triple. The goal is to show the worst 6 dates from 1990, according to the largest mean departure delay (DepDelay
) values.
Your answer will need to use the sort
function and the tail function, like this:
tail(sort(tapply( …, …, mean, na.rm=TRUE)))
Here are the worst two dates from 1990, according to the largest mean departure delay (DepDelay
) values.
12/22/1990
45.2222488995598
12/21/1990
45.6617816091954
In your solution, you need to show the worst 6 dates from 1990, according to the largest mean departure delay (DepDelay
) values.
-
Show the worst 6 dates from 1990, according to the largest mean departure delay (
DepDelay
) values.
Submitting your Work
We hope that you enjoyed this additional practice with the tapply
function in this project!
-
firstname_lastname_project4.ipynb
You must double check your You will not receive full credit if your |