TDM 10100: R Project 6 — 2024
Motivation: Indexing in R is powerful and easy, and can be performed in several ways.
Context: R indexes are often simply vectors of logical (TRUE/FALSE) values (but can also be positive or negative numbers, or can be some names).
Scope: We will get familiar with several types of indexes for data in R.
Dataset(s)
This project will use the following dataset(s):
-
/anvil/projects/tdm/data/death_records/DeathRecords.csv -
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv -
/anvil/projects/tdm/data/beer/reviews_sample.csv -
/anvil/projects/tdm/data/election/itcont1980.txt -
/anvil/projects/tdm/data/flights/subset/1990.csv
Questions
|
As before, please use the |
|
If you session crashes when you read in the data (for instance, on question 2), you might want to try using 2 cores in your session instead of 1 core. |
Example 1:
Example 2:
Example 3:
Example 4:
Example 5:
Example 6:
Example 7:
Example 8:
Question 1 (2 pts)
In the death records file:
/anvil/projects/tdm/data/death_records/DeathRecords.csv
For context: We can revisit Question 1 from Project 5 without using the subset command. Instead, we can use indexing, to illustrate the death age for women, as follows:
plot(table(myDF$Age[(myDF$Sex == "F") & (myDF$Age < 999)]))
and we can make a comparable plot for the deaths for men:
plot(table(myDF$Age[(myDF$Sex == "M") & (myDF$Age < 999)]))
(Notice how men die earlier, and also there is a bump in the number of deaths for men in their twenties and thirties.)
OK, in this question, we want to make similar plots for the death age for a few different races.
|
You can see what races the data from the |
-
Make a
tableof the values in theRacecolumn and how many times that eachRacevalue occurs. How many of the people in the data set have Filipino race? -
Use indexing (not the
subsetfunction) to make a plot of the table of theAgevalues at the time of death for which theRacevalue is the number 7 (which stands for Filipino race) and for which theAgeis not 999.
-
a. A
tableof the values in theRacecolumn and how many times that eachRacevalue occurs. Also, state how many of the people in the data set have Filipino race. -
b. Plot of the table of
Agevalues for people with Filipino race, also withAgenot equal to 999.
Question 2 (2 pts)
In the grocery store file:
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
Let’s re-examine Project 5, Question 2, as follows: Make a table of the values in the myDF$STORE_R column that satisfy the index condition myDF$SPEND < 0. In this way, you can re-create your answer to Project 5, Question 2, without using the subset function.
-
Show the number of refunds for each
STORE_Rvalue, just using indexing, in other words, without using thesubsetfunction. (For instance,CENTRALstores had 2750 refunds.)
Question 3 (2 pts)
In this file of beer reviews /anvil/projects/tdm/data/beer/reviews_sample.csv
-
Make a table of the values in the column
myDF$usernamebut do not print all of the values. Pleasesortthe values and show only thetail, so that you can see the most popular 6usernamevalues, and the number of reviews that each of these 6 people wrote. Hint: The user namedacurtiswrote the most reviews! -
In part 3b, consider only the reviews written by the user
acurtis. What is the averagescoreof the reviews that were written by the useracurtis?
-
a. Print the most popular 6
usernamevalues, and the number of reviews that each of these 6 people wrote. -
b. Find the average
scoreof the reviews that were written by the useracurtis.
Question 4 (2 pts)
Read in the 1980 election data using:
library(data.table)
myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")
Revisit Question 4 from Project 5, and find the 9 NAME values for which the TRANSACTION_DT value is missing, using indexing instead of the subset function.
-
Give the 9
NAMEvalues for which theTRANSACTION_DTvalue is missing, using indexing instead of thesubsetfunction.
Question 5 (2 pts)
Consider the 1990 flight data:
/anvil/projects/tdm/data/flights/subset/1990.csv
Using indexing (not the subset function) find the mean of the DepDelay of all of the flights whose Origin airport is EWR or JFK or LGA.
-
Give the
meanof theDepDelayof all of the flights whoseOriginairport isEWRorJFKorLGA.
Submitting your Work
We are becoming very familiar with missing data and with subsets of data! These concepts take practice. Please continue to ask questions on Piazza, and/or in office hours.
-
firstname_lastname_project6.ipynb
|
You must double check your You will not receive full credit if your |