TDM 10100: R Project 8 — 2024

Motivation: We will learn about how user-defined functions work in R.

Context: Although R has lots of built-in functions, we can design our own functions too!

Scope: We start with some basic functions, just one line functions, to demonstrate how powerful they are.

Learning Objectives:
  • User-defined functions in R.

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following dataset(s):

  • /anvil/projects/tdm/data/death_records/DeathRecords.csv

  • /anvil/projects/tdm/data/beer/reviews_sample.csv

  • /anvil/projects/tdm/data/election/itcont1980.txt

  • /anvil/projects/tdm/data/flights/subset/1990.csv

  • /anvil/projects/tdm/data/olympics/athlete_events.csv

Example 1:

Finding the average weight of Olympic athletes in a given country.

avgweights <- function(x) {mean(myDF$Weight[myDF$NOC == x], na.rm = TRUE)}

Example 2:

Finding the percentages of school metro types in a given state.

myschoolpercentages <- function(x) {prop.table(table(myDF$"School Metro Type"[myDF$"School State" == x]))}

Example 3:

In the 1980 election data, finding the sum of the donations in a given state.

mystatesum <- function(x) {sum(myDF$TRANSACTION_AMT[myDF$STATE == x])}

Example 4:

Finding the average number of stars for a given author of reviews.

myauthoravgstars <- function(x) {mean(myDF$stars[myDF$author == x])}

Questions

As before, please use the seminar-r kernel (not the seminar kernel). You do not need to use the %%R cell magic.

Question 1 (2 pts)

Consider this user-defined function, which makes a table that shows the percentages of values in each category:

makeatable <- function(x) {prop.table(table(x, useNA="always"))}

If we do something like this, with a column from a data frame:

makeatable(myDF$mycolumn)

Then it is the same as running this:

prop.table(table(myDF$mycolumn, useNA="always"))

In other words, makeatable is a user-defined function that makes a table, including all NA values, and expresses the result as percentages. That is what the prop.table does here.

Now consider the DeathRecords data set:

/anvil/projects/tdm/data/death_records/DeathRecords.csv

  1. Try the function makeatable on the Sex column of the DeathRecords.

  2. Also try the function makeatable on the MaritalStatus column of the DeathRecords.

Deliverables
  • Use the makeatable function to display table of values from the Sex column of the DeathRecords.

  • Use the makeatable function to display table of values from the MaritalStatus column of the DeathRecords.

Question 2 (2 pts)

Define a function called teenagecount as follows:

teenagecount <- function(x) {length(x[(x >= 13) & (x <= 19) & (!is.na(x))])}
  1. Try this function on the Age column of the DeathRecords.

  2. Also try this function on the Age column of the file /anvil/projects/tdm/data/olympics/athlete_events.csv

Deliverables
  • Display the number of teenagers in the DeathRecords data.

  • Display the number of teenagers in the Olympics Athlete Events data.

Question 3 (2 pts)

The nchar function gives the number of characters in a string. The which.max function finds the position of the maximum value. Define the function:

longesttest <- function(x) {x[which.max(nchar(x))]}
  1. Use the function longesttest to find the longest review in the text column of the beer reviews data set /anvil/projects/tdm/data/beer/reviews_sample.csv

  2. Also use the function longesttest to find the longest name in the NAME column of the 1980 election data:

library(data.table)
myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")
Deliverables
  • Print the longest review in the text column of the beer reviews data set /anvil/projects/tdm/data/beer/reviews_sample.csv

  • Print the longest name in the NAME column of the 1980 election data.

Question 4 (2 pts)

  1. Create your own function called mostpopulardate that finds the most popular date in a column of dates, as well as the number of times that date occurs.

  2. Test your function mostpopulardate on the date column of the beer reviews data /anvil/projects/tdm/data/beer/reviews_sample.csv

  3. Also test your function mostpopulardate on the TRANSACTION_DT column of the 1980 election data.

Deliverables
  • a. Define your function called mostpopulardate

  • b. Use your function mostpopulardate to find the most popular date in the beer reviews data /anvil/projects/tdm/data/beer/reviews_sample.csv

  • c. Also use your function mostpopulardate to find the most popular transaction date from the 1980 election data.

Question 5 (2 pts)

Define a function called myaveragedelay that takes a 3-letter string (correspding to an airport code) and finds the average departure delays (after removing the NA values) from the DepDelay column of the 1990 flight data /anvil/projects/tdm/data/flights/subset/1990.csv for flights departing from that airport.

Try your function on the Indianapolis "IND" flights. In other words, myaveragedelay("IND") should print 5.96977225672878 because the flights with Origin airport "IND" have an average departure delay of 5.9 minutes.

Try your function on the New York City "JFK" flights. In other words, myaveragedelay("JFK") should print 11.8572741063607 because the flights with Origin airport "JFK" have an average departure delay of 11.8 minutes.

Deliverables
  • a. Define your function called myaveragedelay

  • b. Use myaveragedelay("IND") to print the average departure delays for flights with Origin airport "IND".

  • c. Use myaveragedelay("JFK") to print the average departure delays for flights with Origin airport "JFK".

Submitting your Work

Now you know how to write your own functions! Please let us know if you need assistance with this project.

Items to submit
  • firstname_lastname_project8.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.