TDM 10100: R Project 10 — 2024

Motivation: Using R enables us to applying functions to many data sets in an efficient way. We can extract information in a straightforward way.

Context: There are several types of apply functions in R. In this project, we learn about the sapply function, which is a "simplified" apply function.

Scope: Applying functions to data.

Learning Objectives:
  • Learn how to apply functions to large data sets and extract information.

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following dataset(s):

  • /anvil/projects/tdm/data/flights/subset/* (flights data)

  • /anvil/projects/tdm/data/election/itcont/* (election data)

  • /anvil/projects/tdm/data/taxi/yellow/* (yellow taxi cab data)

We demonstrate the power of the apply family of functions.

In this project, we walk students through these powerful techniques.

Questions

Question 1 (2 pts)

We can calculate the number of flights starting from Indianapolis airport in 1990 as follows:

library(data.table)
myDF <- fread("/anvil/projects/tdm/data/flights/subset/1990.csv")
myvalue <- table(myDF$Origin)['IND']
myvalue
rm(myDF)

(We use the rm at the end, so that we do not keep this data frame in memory, during the remainder of the project.)

Now we can replicate this work, using a function, as follows:

myindyflights <- function(myyear) {
    myDF <- fread(paste0("/anvil/projects/tdm/data/flights/subset/", myyear, ".csv"))
    myvalue <- table(myDF$Origin)['IND']
    names(myvalue) <- myyear
    return(myvalue)
}

and we can test that we get the same results:

library(data.table)
myindyflights(1990)

Finally, we can use the sapply function to run this function on each year from 1987 to 2008.

library(data.table)
myresults <- sapply(1987:2008, myindyflights)

which yields the number of flights starting from Indianapolis airport in each year from 1987 to 2008.

The total number of flights starting from Indianapolis altogether, during 1987 to 2008, is:

sum(myresults)

and the number of flights per year is:

plot(names(myresults), myresults)

The data sets cover October 1987 through April 2008. So the data sets for 1987 and 2008 are smaller than you might expect, and that is OK.

Deliverables
  • Plot the total number of flights starting from the Indianapolis airport during 1987 to 2008.

Question 2 (2 pts)

We replicate the work from Question 1, but this time, we keep track of all of the flights originating at every airport in the data set.

We make a function, very much like Question 1, but this time, we keep track of the full table of the counts of Origin airports, for all airports (not just for Indianapolis):

myflights <- function(myyear) {
    myDF <- fread(paste0("/anvil/projects/tdm/data/flights/subset/", myyear, ".csv"))
    myvalue <- table(myDF$Origin)
    return(myvalue)
}

and we can test that this function works for the 1990 flights:

library(data.table)
myflights(1990)

Finally, we can use the sapply function to run this function on each year from 1987 to 2008.

library(data.table)
myresults <- sapply(1987:2008, myflights)

which yields the number of flights starting from each airport in each year, from 1987 to 2008.

Now we can add up the number of flights across all of the years, as follows:

v <- unlist(myresults)
tapply(v, names(v), sum)

and the number of flights starting at each of the top 10 airports during the years 1987 to 2008 is:

dotchart(tail(sort(tapply(v, names(v), sum)), n=10))
Deliverables
  • Plot the total number of flights starting from each of the top 10 airports during 1987 to 2008.

Question 3 (2 pts)

Now we follow the methodology of Question 1, but this time we obtain the total amount of the donations from Indiana during federal election campaigns.

We can extract the total amount of the donations from Indiana during an election year as follows:

myindydonations <- function(myyear) {
    myDF <- fread(paste0("/anvil/projects/tdm/data/election/itcont", myyear, ".txt"), quote="", select = c(10,15))
    names(myDF) <- c("state", "donation")
    myvalue <- tapply(myDF$donation, myDF$state, sum)['IN']
    names(myvalue) <- myyear
    return(myvalue)
}

and we can test this function by discovering how much money was donated from Indiana during the 1990 election cycle:

library(data.table)
myindydonations(1990)

Finally, we can use the sapply function to run this function on each election year (in other words, the even numbered years) from 1980 to 2018.

library(data.table)
myresults <- sapply( seq(1980,2018,by=2), myindydonations )

which yields the total amount of money donated from Indiana during each election cycle from 1980 to 2018.

The amount of money donated from Indiana per election cycle is:

plot(names(myresults), myresults)
Deliverables
  • Plot amount of money donated from Indiana per election cycle from 1980 to 2018.

Question 4 (2 pts)

Now we find the top 10 states according to the total amount of the donations from each state during the elections from 1980 to 2018.

We can extract the total amount of all the donations from all of the states during an election year as follows:

mydonations <- function(myyear) {
    myDF <- fread(paste0("/anvil/projects/tdm/data/election/itcont", myyear, ".txt"), quote="", select = c(10,15))
    names(myDF) <- c("state", "donation")
    myvalue <- tapply(myDF$donation, myDF$state, sum)
    return(myvalue)
}

and we can test this function by discovering how much money was donated from each state during the 1990 election cycle:

library(data.table)
mydonations(1990)

Finally, we can use the sapply function to run this function on each election year (in other words, the even numbered years) from 1980 to 2018.

library(data.table)
myresults <- sapply( seq(1980,2018,by=2), mydonations )

which yields the total amount of money donated from each state during each election cycle from 1980 to 2018.

Now we can add up the amount of donations in each state, across all of the years, as follows:

v <- unlist(myresults)
tapply(v, names(v), sum)

and the total amount of donations from each of the top 10 states across all election years 1980 to 2018 is:

dotchart(tail(sort(tapply(v, names(v), sum)), n=10))
Deliverables
  • Plot the amount of money donated from each of the top 10 states altogether during 1980 to 2018.

Question 5 (2 pts)

In this last question, we find the total amount of money spent on taxi cab rides in New York City on each day of 2018.

We first extract the total amount of the taxi cab rides per day of a given month as follows:

myfares <- function(mymonth) {
    myDF <- fread(paste0("/anvil/projects/tdm/data/taxi/yellow/yellow_tripdata_2018-", mymonth, ".csv"), select=c(2,17))
    mytable <- tapply(myDF$total_amount, as.Date(myDF$tpep_pickup_datetime), sum)
    return(mytable)
}

and we can test this function by discovering how much money was spent on each day in January:

library(data.table)
myfares("01")

Finally, we can use the sapply function to run this function on each month from "01" to "12".

library(data.table)
myresults <- sapply( sprintf("%02d", 1:12), myfares )

which yields the total amount of money spent on taxi cab rides each day.

Now we can add up the amounts spent per day (sometimes there is overlap from month to month), as follows:

names(myresults) <- NULL
v <- do.call(c, myresults)
mytotals <- tapply(v, names(v), sum)
betterdates <- mytotals[year(as.Date(names(mytotals))) == 2018]

and the total amount of money spent on taxi cab rides during each day in 2018 is can be plotted as:

plot( as.Date(names(betterdates)), betterdates )
Deliverables
  • Plot the total amount of money spent on taxi cab rides during each day in 2018.

Submitting your Work

Now you are familiar with the method of merging data from multiple data frames.

Items to submit
  • firstname_lastname_project10.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.