TDM 10100: R Project 11 — 2024
Motivation: We continue to learn how to extract information from several files in R.
Context: The apply
functions in R allow us to gather and analyze data from many sources in a unified way.
Scope: Applying functions to data.
Dataset(s)
This project will use the following dataset(s):
-
/anvil/projects/tdm/data/flights/subset/*
(flights data) -
/anvil/projects/tdm/data/election/itcont/*
(election data) -
/anvil/projects/tdm/data/icecream/talenti/*
(ice cream data)
We demonstrate the power of the apply family of functions.
Questions
Question 1 (2 pts)
For this question, only use the data corresponding to flights with Origin
at the Indianapolis IND
airport.
-
Write a function called
monthlydepdelays
that takes a year as the input and usestapply
to return a table of length 12 with the averageDepDelay
for flights starting atIND
in each of the 12 months of that year. -
Test your function individually, one at a time, on the years 1990, 1998, and 2005. For instance, if you run
monthlydepdelays(1990)
, the output should be something like:1: 7.28277205677707 2: 9.49702660406886 3: 6.92484111633048 4: 4.94985835694051 5: 5.47148703956344 6: 6.01083547191332 7: 4.30737704918033 8: 5.63978201634877 9: 4.45558583106267 10: 4.47372488408037 11: 3.4083044982699 12: 9.76410531972058
-
Write a function called
monthlydepdelays
that takes a year as the input and returns a table of length 12 with the averageDepDelay
for flights starting atIND
in each of the 12 months of that year. -
Show the output of
monthlydepdelays(1990)
andmonthlydepdelays(1998)
andmonthlydepdelays(2005)
.
Question 2 (2 pts)
First run this command:
par(mfrow=c(3,2))
which tells R that the next 6 plots should appear in 3 rows and 2 columns.
Then the sapply function to plot the results of monthlydepdelays
for the years 1988 through 1993.
Note: JupyterLab might print NULL
values if you just run your sapply
function by itself, but if you run question 2 like this, things should turn out OK:
par(mfrow=c(3,2))
myresults <- sapply(1988:1993, function(x) plot(monthlydepdelays(x)))
-
Make a 3 by 2 frame of 6 plots, corresponding to the results of
monthlydepdelays
in the years 1988 through 1993.
Question 3 (2 pts)
For this question, only use the data corresponding to donations from the state of Indiana.
-
Write a function called
myindycities
that takes a year as the input and usestapply
to make a table of length 10, containing the top 10 cities in Indiana according to the sum of the amount of donations (in dollars) given in each city. -
Test your function individually, one at a time, on the years 1980, 1986, and 1992. For instance, if you run
myindycities(1984)
, the output should be something like:FT WAYNE: 44665 TERRE HAUTE: 52650 CARMEL: 53200 EVANSVILLE: 65250 SOUTH BEND: 68387 INDPLS: 76520 FORT WAYNE: 80882 ELKHART: 93171 MUNCIE: 104260 INDIANAPOLIS: 511935
-
Write a function called
myindycities
that takes a year as the input and usestapply
to make a table of length 10, containing the top 10 cities in Indiana according to the sum of the amount of donations (in dollars) given in each city. -
Show the output of
myindycities(1980)
andmyindycities(1986)
andmyindycities(1992)
.
Question 4 (2 pts)
-
Use the list apply function (
lapply
) to run the functionmyindycities
on each of the even-numbered election years 1984 to 1994 as follows:
myresults <- lapply( seq(1984,1994,by=2), myindycities )
names(myresults) <- seq(1984,1994,by=2)
myresults
-
Now use
par(mfrow=c(3,2))
and the sapply function too, but this time, make adotchart
for each entry inmyresults
.
Do not worry about the pink warning that appears above the plots. |
-
Use
lapply
to show the results ofmyindycities
for each even-numbered year from 1984 to 1994. -
Make a dotchart for each of the 6 years in part a.
Question 5 (2 pts)
-
Find the average number of stars in each of these four files:
/anvil/projects/tdm/data/icecream/bj/reviews.csv
/anvil/projects/tdm/data/icecream/breyers/reviews.csv
/anvil/projects/tdm/data/icecream/hd/reviews.csv
/anvil/projects/tdm/data/icecream/talenti/reviews.csv
-
Write a function
myavgstars
that takes a company name (e.g., either "bj" or "breyers" or "hd" or "talenti") as input, and returns the average number of stars for that company. -
Define a vector of length 4, with all 4 of these company names:
mycompanies <- c("bj", "breyers", "hd", "talenti")
and now use the sapply
function to run the function from part b that re-computes the values from part a, all at once, like this:
sapply(mycompanies, myavgstars)
-
Print the average number of stars for each of the 4 ice cream companies.
-
Write a function
myavgstars
that takes a company name (e.g., either "bj" or "breyers" or "hd" or "talenti") as input, and returns the average number of stars for that company. -
Use
sapply
to run the function from part b on the vectormycompanies
, which should give the same values as in part a.
Submitting your Work
This project further demonstrates how to use the powerful functions in R to perform data analysis.
-
firstname_lastname_project11.ipynb
You must double check your You will not receive full credit if your |