TDM 10100: R Project 5 — 2024

Motivation: Real world data has a lot of missing data. It is also helpful to be able to take a subset of data.

Context: It is worthwhile to be prepared to have missing data and to know how to work with it.

Scope: Dealing with missing data, and taking subsets of data.

Learning Objectives:
  • Learning about how to work with missing data and how to take subsets of data.

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following dataset(s):

  • /anvil/projects/tdm/data/death_records/DeathRecords.csv

  • /anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv

  • /anvil/projects/tdm/data/beer/reviews_sample.csv

  • /anvil/projects/tdm/data/election/itcont1980.txt

  • /anvil/projects/tdm/data/flights/subset/1990.csv

Questions

As before, please use the seminar-r kernel (not the seminar kernel). You do not need to use the %%R cell magic.

If you session crashes when you read in the data (for instance, on question 2), you might want to try using 2 cores in your session instead of 1 core.

Example 1:

Example 2:

Example 3:

Example 4:

Example 5:

Question 1 (2 pts)

In the death records file:

/anvil/projects/tdm/data/death_records/DeathRecords.csv

  1. Build a subset of the data for which Sex=='F' and check the head of the subset to make sure that you only have 'F' values in the Sex column of your subset.

  2. Make a table of the Age values from the subset of female data in question 1a, and plot the table of these Age values. (Notice that 999 is used when the Age value is missing in part 1b!)

  3. Now revise your subset from question 1a, so that you build a subset of the data for which Sex=='F' & Age!=999 and then make of table of the Age values from this revised subset of female data and plot the table of these Age values.

Deliverables
  • a. The head of the subset of data for which Sex=='F'

  • b. Plot of the table of Age values for the subset in 1a.

  • c. Revise questions 1a and 1b so that Sex=='F' & Age!=999

Question 2 (2 pts)

In the grocery store file:

/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv

there are more than 10 million lines of data, as we can see if we check dim(myDF). Each line corresponds to the purchase of an item. The SPEND column is negative when a purchase is refunded, i.e., the item is returned and the money is given back to the customer.

Create a smaller data set called refundsDF that contains only the lines of data for which the SPEND column is negative. Make a table of the STORE_R values in this refundsDF subset, and show the number of times that each STORE_R value appears in the refundsDF subset.

Deliverables
  • Show the number of refunds for each STORE_R value in the refundsDF subset. (For instance, CENTRAL stores had 2750 refunds.)

Question 3 (2 pts)

In this file of beer reviews /anvil/projects/tdm/data/beer/reviews_sample.csv

Make a subset of the beers that have (score != 5) & (overall == 5) (in other words the score value is not equal to 5 but the overall value is equal to 5). How many lines of data are in this subset?

Deliverables
  • How many lines of data are in the subset that has (score != 5) & (overall == 5) ?

Question 4 (2 pts)

Read in the 1980 election data using:

library(data.table)
myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")

There are only 9 entries in which the TRANSACTION_DT value is missing, namely: one donation from CURCIO, BARBARA G and two donations from WOLFF, GARY W. and six donations from who?? (find their identity)! Find the name of the person who made six donations in 1980 with a missing TRANSACTION_DT.

Deliverables
  • Find the name of the person who made 6 donations in 1980 with a missing TRANSACTION_DT.

Question 5 (2 pts)

Consider the 1990 flight data:

/anvil/projects/tdm/data/flights/subset/1990.csv

This data set has information about 5270893 flights.

  1. For how many flights is the DepDelay missing and also (simultaneously) the ArrDelay is missing too?

  2. For how many flights is the DepDelay given but the ArrDelay is missing?

  3. For how many flights is the ArrDelay given but the DepDelay is missing?

Deliverables
  • a. Find the number of flights for which the DepDelay is missing and also (simultaneously) the ArrDelay is missing too.

  • b. Find the number of flights for which the DepDelay is given but the ArrDelay is missing.

  • c. Find the number of flights for which the ArrDelay is given but the DepDelay is missing.

Submitting your Work

We are becoming very familiar with missing data and with subsets of data! These concepts take practice. Please continue to ask questions on Piazza, and/or in office hours.

Items to submit
  • firstname_lastname_project5.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.