TDM 10100: R Project 11 — 2024

Motivation: We continue to learn how to extract information from several files in R.

Context: The apply functions in R allow us to gather and analyze data from many sources in a unified way.

Scope: Applying functions to data.

Learning Objectives:
  • Learn how to apply functions to large data sets and extract information.

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following dataset(s):

  • /anvil/projects/tdm/data/flights/subset/* (flights data)

  • /anvil/projects/tdm/data/election/itcont/* (election data)

  • /anvil/projects/tdm/data/icecream/talenti/* (ice cream data)

We demonstrate the power of the apply family of functions.

Questions

Question 1 (2 pts)

For this question, only use the data corresponding to flights with Origin at the Indianapolis IND airport.

  1. Write a function called monthlydepdelays that takes a year as the input and uses tapply to return a table of length 12 with the average DepDelay for flights starting at IND in each of the 12 months of that year.

  2. Test your function individually, one at a time, on the years 1990, 1998, and 2005. For instance, if you run monthlydepdelays(1990), the output should be something like: 1: 7.28277205677707 2: 9.49702660406886 3: 6.92484111633048 4: 4.94985835694051 5: 5.47148703956344 6: 6.01083547191332 7: 4.30737704918033 8: 5.63978201634877 9: 4.45558583106267 10: 4.47372488408037 11: 3.4083044982699 12: 9.76410531972058

Deliverables
  • Write a function called monthlydepdelays that takes a year as the input and returns a table of length 12 with the average DepDelay for flights starting at IND in each of the 12 months of that year.

  • Show the output of monthlydepdelays(1990) and monthlydepdelays(1998) and monthlydepdelays(2005).

Question 2 (2 pts)

First run this command:

par(mfrow=c(3,2))

which tells R that the next 6 plots should appear in 3 rows and 2 columns.

Then the sapply function to plot the results of monthlydepdelays for the years 1988 through 1993.

Note: JupyterLab might print NULL values if you just run your sapply function by itself, but if you run question 2 like this, things should turn out OK:

par(mfrow=c(3,2))
myresults <- sapply(1988:1993, function(x) plot(monthlydepdelays(x)))
Deliverables
  • Make a 3 by 2 frame of 6 plots, corresponding to the results of monthlydepdelays in the years 1988 through 1993.

Question 3 (2 pts)

For this question, only use the data corresponding to donations from the state of Indiana.

  1. Write a function called myindycities that takes a year as the input and uses tapply to make a table of length 10, containing the top 10 cities in Indiana according to the sum of the amount of donations (in dollars) given in each city.

  2. Test your function individually, one at a time, on the years 1980, 1986, and 1992. For instance, if you run myindycities(1984), the output should be something like: FT WAYNE: 44665 TERRE HAUTE: 52650 CARMEL: 53200 EVANSVILLE: 65250 SOUTH BEND: 68387 INDPLS: 76520 FORT WAYNE: 80882 ELKHART: 93171 MUNCIE: 104260 INDIANAPOLIS: 511935

Deliverables
  • Write a function called myindycities that takes a year as the input and uses tapply to make a table of length 10, containing the top 10 cities in Indiana according to the sum of the amount of donations (in dollars) given in each city.

  • Show the output of myindycities(1980) and myindycities(1986) and myindycities(1992).

Question 4 (2 pts)

  1. Use the list apply function (lapply) to run the function myindycities on each of the even-numbered election years 1984 to 1994 as follows:

myresults <- lapply( seq(1984,1994,by=2), myindycities )
names(myresults) <- seq(1984,1994,by=2)
myresults
  1. Now use par(mfrow=c(3,2)) and the sapply function too, but this time, make a dotchart for each entry in myresults.

Do not worry about the pink warning that appears above the plots.

Deliverables
  • Use lapply to show the results of myindycities for each even-numbered year from 1984 to 1994.

  • Make a dotchart for each of the 6 years in part a.

Question 5 (2 pts)

  1. Find the average number of stars in each of these four files:

/anvil/projects/tdm/data/icecream/bj/reviews.csv

/anvil/projects/tdm/data/icecream/breyers/reviews.csv

/anvil/projects/tdm/data/icecream/hd/reviews.csv

/anvil/projects/tdm/data/icecream/talenti/reviews.csv

  1. Write a function myavgstars that takes a company name (e.g., either "bj" or "breyers" or "hd" or "talenti") as input, and returns the average number of stars for that company.

  2. Define a vector of length 4, with all 4 of these company names:

mycompanies <- c("bj", "breyers", "hd", "talenti")

and now use the sapply function to run the function from part b that re-computes the values from part a, all at once, like this:

sapply(mycompanies, myavgstars)
Deliverables
  • Print the average number of stars for each of the 4 ice cream companies.

  • Write a function myavgstars that takes a company name (e.g., either "bj" or "breyers" or "hd" or "talenti") as input, and returns the average number of stars for that company.

  • Use sapply to run the function from part b on the vector mycompanies, which should give the same values as in part a.

Submitting your Work

This project further demonstrates how to use the powerful functions in R to perform data analysis.

Items to submit
  • firstname_lastname_project11.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.