TDM 10100: R Project 6 — 2024
Motivation: Indexing in R is powerful and easy, and can be performed in several ways.
Context: R indexes are often simply vectors of logical (TRUE/FALSE) values (but can also be positive or negative numbers, or can be some names).
Scope: We will get familiar with several types of indexes for data in R.
Dataset(s)
This project will use the following dataset(s):
-
/anvil/projects/tdm/data/death_records/DeathRecords.csv
-
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
-
/anvil/projects/tdm/data/beer/reviews_sample.csv
-
/anvil/projects/tdm/data/election/itcont1980.txt
-
/anvil/projects/tdm/data/flights/subset/1990.csv
Questions
As before, please use the |
If you session crashes when you read in the data (for instance, on question 2), you might want to try using 2 cores in your session instead of 1 core. |
Example 1:
Example 2:
Example 3:
Example 4:
Example 5:
Example 6:
Example 7:
Example 8:
Question 1 (2 pts)
In the death records file:
/anvil/projects/tdm/data/death_records/DeathRecords.csv
For context: We can revisit Question 1 from Project 5 without using the subset
command. Instead, we can use indexing, to illustrate the death age for women, as follows:
plot(table(myDF$Age[(myDF$Sex == "F") & (myDF$Age < 999)]))
and we can make a comparable plot for the deaths for men:
plot(table(myDF$Age[(myDF$Sex == "M") & (myDF$Age < 999)]))
(Notice how men die earlier, and also there is a bump in the number of deaths for men in their twenties and thirties.)
OK, in this question, we want to make similar plots for the death age for a few different races.
You can see what races the data from the |
-
Make a
table
of the values in theRace
column and how many times that eachRace
value occurs. How many of the people in the data set have Filipino race? -
Use indexing (not the
subset
function) to make a plot of the table of theAge
values at the time of death for which theRace
value is the number 7 (which stands for Filipino race) and for which theAge
is not 999.
-
a. A
table
of the values in theRace
column and how many times that eachRace
value occurs. Also, state how many of the people in the data set have Filipino race. -
b. Plot of the table of
Age
values for people with Filipino race, also withAge
not equal to 999.
Question 2 (2 pts)
In the grocery store file:
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
Let’s re-examine Project 5, Question 2, as follows: Make a table of the values in the myDF$STORE_R
column that satisfy the index condition myDF$SPEND < 0
. In this way, you can re-create your answer to Project 5, Question 2, without using the subset
function.
-
Show the number of refunds for each
STORE_R
value, just using indexing, in other words, without using thesubset
function. (For instance,CENTRAL
stores had 2750 refunds.)
Question 3 (2 pts)
In this file of beer reviews /anvil/projects/tdm/data/beer/reviews_sample.csv
-
Make a table of the values in the column
myDF$username
but do not print all of the values. Pleasesort
the values and show only thetail
, so that you can see the most popular 6username
values, and the number of reviews that each of these 6 people wrote. Hint: The user namedacurtis
wrote the most reviews! -
In part 3b, consider only the reviews written by the user
acurtis
. What is the averagescore
of the reviews that were written by the useracurtis
?
-
a. Print the most popular 6
username
values, and the number of reviews that each of these 6 people wrote. -
b. Find the average
score
of the reviews that were written by the useracurtis
.
Question 4 (2 pts)
Read in the 1980 election data using:
library(data.table)
myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")
Revisit Question 4 from Project 5, and find the 9 NAME
values for which the TRANSACTION_DT
value is missing, using indexing instead of the subset
function.
-
Give the 9
NAME
values for which theTRANSACTION_DT
value is missing, using indexing instead of thesubset
function.
Question 5 (2 pts)
Consider the 1990 flight data:
/anvil/projects/tdm/data/flights/subset/1990.csv
Using indexing (not the subset
function) find the mean
of the DepDelay
of all of the flights whose Origin
airport is EWR
or JFK
or LGA
.
-
Give the
mean
of theDepDelay
of all of the flights whoseOrigin
airport isEWR
orJFK
orLGA
.
Submitting your Work
We are becoming very familiar with missing data and with subsets of data! These concepts take practice. Please continue to ask questions on Piazza, and/or in office hours.
-
firstname_lastname_project6.ipynb
You must double check your You will not receive full credit if your |