TDM 10100: R Project 7 — 2024
Motivation: We continue to learn about vectorized operations in R.
Context: Many functions and methods of indexing in R are much more powerful and easy to use (as compared to other tools)..
Scope: We will get familiar with several more types of vectorized operations in R.
Dataset(s)
This project will use the following dataset(s):
-
/anvil/projects/tdm/data/death_records/DeathRecords.csv
-
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
-
/anvil/projects/tdm/data/beer/reviews_sample.csv
-
/anvil/projects/tdm/data/election/itcont1980.txt
-
/anvil/projects/tdm/data/flights/subset/1990.csv
Example 1:
Example 2:
Example 3:
Example 4:
Example 5:
Example 6:
Example 7:
Example 8:
Example 9:
Example 10:
Questions
As before, please use the |
If you session crashes when you read in the data (for instance, on question 2), you might want to try using 2 cores in your session instead of 1 core. |
Question 1 (2 pts)
In the death records file:
/anvil/projects/tdm/data/death_records/DeathRecords.csv
Use the cut
command to classify people at their time of death into 5 categories:
"youth": less than or equal to 18 years old
"young adult": older than 18 but less than or equal to 25 years old
"adult": older than 25 but less than or equal to 35 years old
"middle age adult": older than 35 but less than or equal to 55 years old
"senior adult": greater than 55 years old but less than or equal to 150 years old (or any other upper threshhold that you like)
"unknown": age of 999 (you could use, say, ages 150 to Inf for this category)
-
First wrap the results of your
cut
function into a table. -
In the
cut
function, add labels corresponding to the 6 categories above. -
Now wrap the table into a
barplot
that shows the number of people in each of the 6 categories above.
-
a. A table showing how many people are in each of the 5 categories above at the time of their death. (The labels for part a should be the default labels, i.e., like this:
(-Inf,18] (18,25] (25,35] (35,55] (55,150] (150, Inf]
-
b. Same table output as in part a but now also adding labels corresponding to the 6 categories above.
-
c. A
barplot
that shows the number of people in each of the 6 categories above.
Question 2 (2 pts)
In the grocery store file:
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv
Use the tapply
function to sum the values from the SPEND
column, according to 8 categories, namely, according to whether the YEAR
is 2016 or 2017, and according to whether the STORE_R
value is CENTRAL
, EAST
, SOUTH
, or WEST
.
-
Show the sum of the values in the
SPEND
column according to the 8 possible pairs ofYEAR
andSTORE_R
values.
Question 3 (2 pts)
In this file of beer reviews /anvil/projects/tdm/data/beer/reviews_sample.csv
Use tapply
to categorize the mean score
values in each month and year pair. Your tapply
should output a table with years as the row labels and the months as the column labels.
-
Print a table displaying the mean
score
values for each month and year pair.
Question 4 (2 pts)
Read in the 1980 election data using:
library(data.table)
myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")
In this question, we do not care about the dollar amounts of the election donations. In other words, do not pay any attention to the TRANSACTION_AMT
column. Only pay attention to the number of donations. There is one donation per row in the data set.
-
Using the
subset
function to get a data frame that contains only the donations for which theSTATE
isIN
. From theCITY
column of thissubset
, make atable
of the number of occurrences of eachCITY
. Sort the table and print the largest 41 entries. -
Same question as part a, but this time, do not use the
subset
function. Instead, consider the elements from theCITY
column for which theSTATE
value isIN
. Amongst these restrictedCITY
values, make atable
of the number of occurrences of eachCITY
. Sort the table and print the largest 41 entries. (Your result from question 4a and 4b should look the same, but using these two different methods.) -
Find at least one strange thing about the top 41 entries in your result.
-
a. Using the
subset
function, give atable
of the top 41 cities in Indiana, according to the number of donations from people in that city. -
b. Using indexing (not a
subset
), give atable
of the top 41 cities in Indiana, according to the number of donations from people in that city. -
c. Find at least one strange thing about the top 41 entries in your result.
Question 5 (2 pts)
Consider the 1990 flight data:
/anvil/projects/tdm/data/flights/subset/1990.csv
The DepDelay
values are given in minutes. We will classify the number of flights according to how many hours that the flight was delayed.
Use the cut
command to classify the number of flights in each of these categories:
Flight departed early or on time, i.e., DepDelay is negative or 0.
Flight departed more than 0 but less than or equal to 60 minutes late.
Flight departed more than 60 but less than or equal to 120 minutes late.
Flight departed more than 120 but less than or equal to 180 minutes late.
Flight departed more than 180 but less than or equal to 240 minutes late.
Flight departed more than 240 but less than or equal to 300 minutes late.
Etc., etc., and finally:
Flight departed more than 1380 but less than or equal to 1440 minutes late.
Make a table
that shows the number of flights in each of these categories.
Use the useNA="always"
option in the table
, so that the number of flights without a known DepDelay
is also given.
In the |
-
Give the table described above, which classifies the number of flights according to the number of hours that the flights are delayed.
Submitting your Work
You now are knowledgeable about a wide range of R functions. Please continue to practice and to ask good questions~
-
firstname_lastname_project7.ipynb
You must double check your You will not receive full credit if your |