TDM 10100: R Project 7 — 2024

Motivation: We continue to learn about vectorized operations in R.

Context: Many functions and methods of indexing in R are much more powerful and easy to use (as compared to other tools)..

Scope: We will get familiar with several more types of vectorized operations in R.

Learning Objectives:
  • Vectorized operations in R.

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following dataset(s):

  • /anvil/projects/tdm/data/death_records/DeathRecords.csv

  • /anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv

  • /anvil/projects/tdm/data/beer/reviews_sample.csv

  • /anvil/projects/tdm/data/election/itcont1980.txt

  • /anvil/projects/tdm/data/flights/subset/1990.csv

Questions

As before, please use the seminar-r kernel (not the seminar kernel). You do not need to use the %%R cell magic.

If you session crashes when you read in the data (for instance, on question 2), you might want to try using 2 cores in your session instead of 1 core.

Question 1 (2 pts)

In the death records file:

/anvil/projects/tdm/data/death_records/DeathRecords.csv

Use the cut command to classify people at their time of death into 5 categories:

    "youth": less than or equal to 18 years old

    "young adult": older than 18 but less than or equal to 25 years old

    "adult": older than 25 but less than or equal to 35 years old

    "middle age adult": older than 35 but less than or equal to 55 years old

    "senior adult": greater than 55 years old
  1. First wrap the results of your cut function into a table.

  2. Use the option useNA="always" in the table to find out how many people’s ages were unknown at the time of their death.

  3. In the cut function, add labels corresponding to the 5 categories above.

  4. Now wrap the table into a barplot that shows the number of people in each of the 5 categories and also the number of people whose age is unknown.

Deliverables
  • a. A table showing how many people are in each of the 5 categories above at the time of their death. (The labels for part a should be the default labels, i.e., like this: (-Inf,18] (18,25] (25,35] (35,55] (55, Inf]

  • b. Same table output as in part a but adding the option useNA="always" to show how many people’s ages were unknown at the time of their death.

  • c. Same table output as in part b but now also adding labels corresponding to the 5 categories above. (It is not necessary to put a label on the unknown age category.)

  • d. A barplot that shows the number of people in each of the 5 categories and also the number of people whose age is unknown. (The 6th bar in the barplot, corresponding to the number of people with unknown age, does not need a label.)

Question 2 (2 pts)

In the grocery store file:

/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv

Use the tapply function to sum the values from the SPEND column, according to 8 categories, namely, according to whether the YEAR is 2016 or 2017, and according to whether the STORE_R value is CENTRAL, EAST, SOUTH, or WEST.

Deliverables
  • Show the sum of the values in the SPEND column according to the 8 possible pairs of YEAR and STORE_R values.

Question 3 (2 pts)

In this file of beer reviews /anvil/projects/tdm/data/beer/reviews_sample.csv

Use tapply to categorize the mean score values in each month and year pair. Your tapply should output a table with years as the row labels and the months as the column labels.

Deliverables
  • Print a table displaying the mean score values for each month and year pair.

Question 4 (2 pts)

Read in the 1980 election data using:

library(data.table)
myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")

In this question, we do not care about the dollar amounts of the election donations. In other words, do not pay any attention to the TRANSACTION_AMT column. Only pay attention to the number of donations. There is one donation per row in the data set.

  1. Using the subset function to get a data frame that contains only the donations for which the STATE is IN. From the CITY column of this subset, make a table of the number of occurrences of each CITY. Sort the table and print the largest 41 entries.

  2. Same question as part a, but this time, do not use the subset function. Instead, consider the elements from the CITY column for which the STATE value is IN. Amongst these restricted CITY values, make a table of the number of occurrences of each CITY. Sort the table and print the largest 41 entries. (Your result from question 4a and 4b should look the same, but using these two different methods.)

  3. Find at least one strange thing about the top 41 entries in your result.

Deliverables
  • a. Using the subset function, give a table of the top 41 cities in Indiana, according to the number of donations from people in that city.

  • b. Using indexing (not a subset), give a table of the top 41 cities in Indiana, according to the number of donations from people in that city.

  • c. Find at least one strange thing about the top 41 entries in your result.

Question 5 (2 pts)

Consider the 1990 flight data:

/anvil/projects/tdm/data/flights/subset/1990.csv

The DepDelay values are given in minutes. We will classify the number of flights according to how many hours that the flight was delayed.

Use the cut command to classify the number of flights in each of these categories:

Flight departed early or on time, i.e., DepDelay is negative or 0.

Flight departed more than 0 but less than or equal to 60 minutes late.

Flight departed more than 60 but less than or equal to 120 minutes late.

Flight departed more than 120 but less than or equal to 180 minutes late.

Flight departed more than 180 but less than or equal to 240 minutes late.

Flight departed more than 240 but less than or equal to 300 minutes late.

Etc., etc., and finally:

Flight departed more than 1380 but less than or equal to 1440 minutes late.

Make a table that shows the number of flights in each of these categories.

Use the useNA="always" option in the table, so that the number of flights without a known DepDelay is also given.

In the cut command, the output will look nicer if you use the option dig.lab = 4.

Deliverables
  • Give the table described above, which classifies the number of flights according to the number of hours that the flights are delayed.

Submitting your Work

You now are knowledgeable about a wide range of R functions. Please continue to practice and to ask good questions~

Items to submit
  • firstname_lastname_project7.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.