TDM 10100: R Project 5 — 2024
Motivation: Real world data has a lot of missing data. It is also helpful to be able to take a subset of data.
Context: It is worthwhile to be prepared to have missing data and to know how to work with it.
Scope: Dealing with missing data, and taking subsets of data.
Dataset(s)
This project will use the following dataset(s):
-
/anvil/projects/tdm/data/death_records/DeathRecords.csv
-
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
-
/anvil/projects/tdm/data/beer/reviews_sample.csv
-
/anvil/projects/tdm/data/election/itcont1980.txt
-
/anvil/projects/tdm/data/flights/subset/1990.csv
Questions
As before, please use the |
If you session crashes when you read in the data (for instance, on question 2), you might want to try using 2 cores in your session instead of 1 core. |
Example 1:
Example 2:
Example 3:
Example 4:
Example 5:
Question 1 (2 pts)
In the death records file:
/anvil/projects/tdm/data/death_records/DeathRecords.csv
-
Build a subset of the data for which
Sex=='F'
and check the head of the subset to make sure that you only have 'F' values in theSex
column of your subset. -
Make a table of the
Age
values from the subset of female data in question 1a, and plot the table of theseAge
values. (Notice that 999 is used when theAge
value is missing in part 1b!) -
Now revise your subset from question 1a, so that you build a subset of the data for which
Sex=='F' & Age!=999
and then make of table of theAge
values from this revised subset of female data and plot the table of theseAge
values.
-
a. The head of the subset of data for which
Sex=='F'
-
b. Plot of the table of
Age
values for the subset in 1a. -
c. Revise questions 1a and 1b so that
Sex=='F' & Age!=999
Question 2 (2 pts)
In the grocery store file:
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
there are more than 10 million lines of data, as we can see if we check dim(myDF)
. Each line corresponds to the purchase of an item. The SPEND
column is negative when a purchase is refunded, i.e., the item is returned and the money is given back to the customer.
Create a smaller data set called refundsDF
that contains only the lines of data for which the SPEND
column is negative. Make a table of the STORE_R
values in this refundsDF
subset, and show the number of times that each STORE_R
value appears in the refundsDF
subset.
-
Show the number of refunds for each
STORE_R
value in therefundsDF
subset. (For instance,CENTRAL
stores had 2750 refunds.)
Question 3 (2 pts)
In this file of beer reviews /anvil/projects/tdm/data/beer/reviews_sample.csv
Make a subset of the beers that have (score != 5) & (overall == 5)
(in other words the score
value is not equal to 5 but the overall
value is equal to 5). How many lines of data are in this subset?
-
How many lines of data are in the subset that has
(score != 5) & (overall == 5)
?
Question 4 (2 pts)
Read in the 1980 election data using:
library(data.table)
myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")
There are only 9 entries in which the TRANSACTION_DT
value is missing, namely: one donation from CURCIO, BARBARA G
and two donations from WOLFF, GARY W.
and six donations from who?? (find their identity)! Find the name of the person who made six donations in 1980 with a missing TRANSACTION_DT
.
-
Find the name of the person who made 6 donations in 1980 with a missing
TRANSACTION_DT
.
Question 5 (2 pts)
Consider the 1990 flight data:
/anvil/projects/tdm/data/flights/subset/1990.csv
This data set has information about 5270893 flights.
-
For how many flights is the
DepDelay
missing and also (simultaneously) theArrDelay
is missing too? -
For how many flights is the
DepDelay
given but theArrDelay
is missing? -
For how many flights is the
ArrDelay
given but theDepDelay
is missing?
-
a. Find the number of flights for which the
DepDelay
is missing and also (simultaneously) theArrDelay
is missing too. -
b. Find the number of flights for which the
DepDelay
is given but theArrDelay
is missing. -
c. Find the number of flights for which the
ArrDelay
is given but theDepDelay
is missing.
Submitting your Work
We are becoming very familiar with missing data and with subsets of data! These concepts take practice. Please continue to ask questions on Piazza, and/or in office hours.
-
firstname_lastname_project5.ipynb
You must double check your You will not receive full credit if your |