TDM 10100: R Project 6 — 2024

Motivation: Indexing in R is powerful and easy, and can be performed in several ways.

Context: R indexes are often simply vectors of logical (TRUE/FALSE) values (but can also be positive or negative numbers, or can be some names).

Scope: We will get familiar with several types of indexes for data in R.

Learning Objectives:
  • Learning about how to work with indexes in R.

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following dataset(s):

  • /anvil/projects/tdm/data/death_records/DeathRecords.csv

  • /anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv

  • /anvil/projects/tdm/data/beer/reviews_sample.csv

  • /anvil/projects/tdm/data/election/itcont1980.txt

  • /anvil/projects/tdm/data/flights/subset/1990.csv

Questions

As before, please use the seminar-r kernel (not the seminar kernel). You do not need to use the %%R cell magic.

If you session crashes when you read in the data (for instance, on question 2), you might want to try using 2 cores in your session instead of 1 core.

Example 1:

Example 2:

Example 3:

Example 4:

Example 5:

Example 6:

Example 7:

Example 8:

Question 1 (2 pts)

In the death records file:

/anvil/projects/tdm/data/death_records/DeathRecords.csv

For context: We can revisit Question 1 from Project 5 without using the subset command. Instead, we can use indexing, to illustrate the death age for women, as follows:

plot(table(myDF$Age[(myDF$Sex == "F") & (myDF$Age < 999)]))

and we can make a comparable plot for the deaths for men:

plot(table(myDF$Age[(myDF$Sex == "M") & (myDF$Age < 999)]))

(Notice how men die earlier, and also there is a bump in the number of deaths for men in their twenties and thirties.)

OK, in this question, we want to make similar plots for the death age for a few different races.

You can see what races the data from the Race column represents, by looking at page 15 of the pdf source file: www.cdc.gov/nchs/data/dvs/Record_Layout_2014.pdf

  1. Make a table of the values in the Race column and how many times that each Race value occurs. How many of the people in the data set have Filipino race?

  2. Use indexing (not the subset function) to make a plot of the table of the Age values at the time of death for which the Race value is the number 7 (which stands for Filipino race) and for which the Age is not 999.

Deliverables
  • a. A table of the values in the Race column and how many times that each Race value occurs. Also, state how many of the people in the data set have Filipino race.

  • b. Plot of the table of Age values for people with Filipino race, also with Age not equal to 999.

Question 2 (2 pts)

In the grocery store file:

/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv

Let’s re-examine Project 5, Question 2, as follows: Make a table of the values in the myDF$STORE_R column that satisfy the index condition myDF$SPEND < 0. In this way, you can re-create your answer to Project 5, Question 2, without using the subset function.

Deliverables
  • Show the number of refunds for each STORE_R value, just using indexing, in other words, without using the subset function. (For instance, CENTRAL stores had 2750 refunds.)

Question 3 (2 pts)

In this file of beer reviews /anvil/projects/tdm/data/beer/reviews_sample.csv

  1. Make a table of the values in the column myDF$username but do not print all of the values. Please sort the values and show only the tail, so that you can see the most popular 6 username values, and the number of reviews that each of these 6 people wrote. Hint: The user named acurtis wrote the most reviews!

  2. In part 3b, consider only the reviews written by the user acurtis. What is the average score of the reviews that were written by the user acurtis?

Deliverables
  • a. Print the most popular 6 username values, and the number of reviews that each of these 6 people wrote.

  • b. Find the average score of the reviews that were written by the user acurtis.

Question 4 (2 pts)

Read in the 1980 election data using:

library(data.table)
myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")

Revisit Question 4 from Project 5, and find the 9 NAME values for which the TRANSACTION_DT value is missing, using indexing instead of the subset function.

Deliverables
  • Give the 9 NAME values for which the TRANSACTION_DT value is missing, using indexing instead of the subset function.

Question 5 (2 pts)

Consider the 1990 flight data:

/anvil/projects/tdm/data/flights/subset/1990.csv

Using indexing (not the subset function) find the mean of the DepDelay of all of the flights whose Origin airport is EWR or JFK or LGA.

Deliverables
  • Give the mean of the DepDelay of all of the flights whose Origin airport is EWR or JFK or LGA.

Submitting your Work

We are becoming very familiar with missing data and with subsets of data! These concepts take practice. Please continue to ask questions on Piazza, and/or in office hours.

Items to submit
  • firstname_lastname_project6.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.