TDM 10100: R Project 8 — 2024

Motivation: We will learn about how user-defined functions work in R.

Context: Although R has lots of built-in functions, we can design our own functions too!

Scope: We start with some basic functions, just one line functions, to demonstrate how powerful they are.

Learning Objectives:
  • User-defined functions in R.

This project will use the following dataset(s):

  • /anvil/projects/tdm/data/death_records/DeathRecords.csv

  • /anvil/projects/tdm/data/beer/reviews_sample.csv

  • /anvil/projects/tdm/data/election/itcont1980.txt

  • /anvil/projects/tdm/data/flights/subset/1990.csv

Example 1:

Finding the average weight of Olympic athletes in a given country.

avgweights <- function(x) {mean(myDF$Weight[myDF$NOC == x], na.rm = TRUE)}

Example 2:

Finding the percentages of school metro types in a given state.

myschoolpercentages <- function(x) {prop.table(table(myDF$"School Metro Type"[myDF$"School State" == x]))}

Example 3:

In the 1980 election data, finding the sum of the donations in a given state.

mystatesum <- function(x) {sum(myDF$TRANSACTION_AMT[myDF$STATE == x])}

Example 4:

Finding the average number of stars for a given author of reviews.

myauthoravgstars <- function(x) {mean(myDF$stars[myDF$author == x])}


Question 1 (2 pts)

Consider this user-defined function, which makes a table that shows the percentages of values in each category:

makeatable <- function(x) {prop.table(table(x, useNA="always"))}

If we do something like this, with a column from a data frame:


Then it is the same as running this:

prop.table(table(myDF$mycolumn, useNA="always"))

In other words, makeatable is a user-defined function that makes a table, including all NA values, and expresses the result as percentages. That is what the prop.table does here.

Now consider the DeathRecords data set:


  1. Try the function makeatable on the Sex column of the DeathRecords.

  2. Also try the function makeatable on the MaritalStatus column of the DeathRecords.

  • Use the makeatable function to display table of values from the Sex column of the DeathRecords.

  • Use the makeatable function to display table of values from the MaritalStatus column of the DeathRecords.

Question 2 (2 pts)

Define a function called teenagecount as follows:

teenagecount <- function(x) {length(x[(x >= 13) & (x <= 19) & (!])}
  1. Try this function on the Age column of the DeathRecords.

  2. Also try this function on the Age column of the file /anvil/projects/tdm/data/olympics/athlete_events.csv

  • Display the number of teenagers in the DeathRecords data.

  • Display the number of teenagers in the Olympics Athlete Events data.

Question 3 (2 pts)

The nchar function gives the number of characters in a string. The which.max function finds the position of the maximum value. Define the function:

longesttest <- function(x) {x[which.max(nchar(x))]}
  1. Use the function longesttest to find the longest review in the text column of the beer reviews data set /anvil/projects/tdm/data/beer/reviews_sample.csv

  2. Also use the function longesttest to find the longest name in the NAME column of the 1980 election data:

myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
  • Print the longest review in the text column of the beer reviews data set /anvil/projects/tdm/data/beer/reviews_sample.csv

  • Print the longest name in the NAME column of the 1980 election data.

Question 4 (2 pts)

  1. Create your own function called mostpopulardate that finds the most popular date in a column of dates, as well as the number of times that date occurs.

  2. Test your function mostpopulardate on the date column of the beer reviews data /anvil/projects/tdm/data/beer/reviews_sample.csv

  3. Also test your function mostpopulardate on the TRANSACTION_DT column of the 1980 election data.

  • a. Define your function called mostpopulardate

  • b. Use your function mostpopulardate to find the most popular date in the beer reviews data /anvil/projects/tdm/data/beer/reviews_sample.csv

  • c. Also use your function mostpopulardate to find the most popular transaction date from the 1980 election data.

Question 5 (2 pts)

Define a function called myaveragedelay that takes a 3-letter string (correspding to an airport code) and finds the average departure delays (after removing the NA values) from the DepDelay column of the 1990 flight data /anvil/projects/tdm/data/flights/subset/1990.csv for flights departing from that airport.

Try your function on the Indianapolis "IND" flights. In other words, myaveragedelay("IND") should print 5.96977225672878 because the flights with Origin airport "IND" have an average departure delay of 5.9 minutes.

Try your function on the New York City "JFK" flights. In other words, myaveragedelay("JFK") should print 11.8572741063607 because the flights with Origin airport "JFK" have an average departure delay of 11.8 minutes.

  • a. Define your function called myaveragedelay

  • b. Use myaveragedelay("IND") to print the average departure delays for flights with Origin airport "IND".

  • c. Use myaveragedelay("JFK") to print the average departure delays for flights with Origin airport "JFK".

