TDM 10100: R Project 10 — 2024
Motivation: Using R enables us to apply functions to many data sets in an efficient way. We can extract information in a straightforward way.
Context: There are several types of apply
functions in R. In this project, we learn about the sapply
function, which is a "simplified" apply function.
Scope: Applying functions to data.
Dataset(s)
This project will use the following dataset(s):
-
/anvil/projects/tdm/data/flights/subset/*
(flights data) -
/anvil/projects/tdm/data/election/itcont/*
(election data) -
/anvil/projects/tdm/data/taxi/yellow/*
(yellow taxi cab data)
We demonstrate the power of the apply family of functions.
In this project, we walk students through these powerful techniques.
Questions
Question 1 (2 pts)
We can calculate the number of flights starting from Indianapolis airport in 1990 as follows:
library(data.table)
myDF <- fread("/anvil/projects/tdm/data/flights/subset/1990.csv")
myvalue <- table(myDF$Origin)['IND']
myvalue
rm(myDF)
(We use the rm
at the end, so that we do not keep this data frame in memory, during the remainder of the project.)
Now we can replicate this work, using a function, as follows:
myindyflights <- function(myyear) {
myDF <- fread(paste0("/anvil/projects/tdm/data/flights/subset/", myyear, ".csv"))
myvalue <- table(myDF$Origin)['IND']
names(myvalue) <- myyear
return(myvalue)
}
and we can test that we get the same results:
library(data.table)
myindyflights(1990)
Finally, we can use the sapply
function to run this function on each year from 1987 to 2008.
library(data.table)
myresults <- sapply(1987:2008, myindyflights)
which yields the number of flights starting from Indianapolis airport in each year from 1987 to 2008.
The total number of flights starting from Indianapolis altogether, during 1987 to 2008, is:
sum(myresults)
and the number of flights per year is:
plot(names(myresults), myresults)
The data sets cover October 1987 through April 2008. So the data sets for 1987 and 2008 are smaller than you might expect, and that is OK. |
-
Plot the total number of flights starting from the Indianapolis airport during 1987 to 2008.
Question 2 (2 pts)
We replicate the work from Question 1, but this time, we keep track of all of the flights originating at every airport in the data set.
We make a function, very much like Question 1, but this time, we keep track of the full table of the counts of Origin
airports, for all airports (not just for Indianapolis):
myflights <- function(myyear) {
myDF <- fread(paste0("/anvil/projects/tdm/data/flights/subset/", myyear, ".csv"))
myvalue <- table(myDF$Origin)
return(myvalue)
}
and we can test that this function works for the 1990 flights:
library(data.table)
myflights(1990)
Finally, we can use the sapply
function to run this function on each year from 1987 to 2008.
library(data.table)
myresults <- sapply(1987:2008, myflights)
which yields the number of flights starting from each airport in each year, from 1987 to 2008.
Now we can add up the number of flights across all of the years, as follows:
v <- unlist(myresults)
tapply(v, names(v), sum)
and the number of flights starting at each of the top 10 airports during the years 1987 to 2008 is:
dotchart(tail(sort(tapply(v, names(v), sum)), n=10))
-
Plot the total number of flights starting from each of the top 10 airports during 1987 to 2008.
Question 3 (2 pts)
Now we follow the methodology of Question 1, but this time we obtain the total amount of the donations from Indiana during federal election campaigns.
We can extract the total amount of the donations from Indiana during an election year as follows:
myindydonations <- function(myyear) {
myDF <- fread(paste0("/anvil/projects/tdm/data/election/itcont", myyear, ".txt"), quote="", select = c(10,15))
names(myDF) <- c("state", "donation")
myvalue <- tapply(myDF$donation, myDF$state, sum)['IN']
names(myvalue) <- myyear
return(myvalue)
}
and we can test this function by discovering how much money was donated from Indiana during the 1990 election cycle:
library(data.table)
myindydonations(1990)
Finally, we can use the sapply
function to run this function on each election year (in other words, the even numbered years) from 1980 to 2018.
library(data.table)
myresults <- sapply( seq(1980,2018,by=2), myindydonations )
which yields the total amount of money donated from Indiana during each election cycle from 1980 to 2018.
The amount of money donated from Indiana per election cycle is:
plot(names(myresults), myresults)
-
Plot amount of money donated from Indiana per election cycle from 1980 to 2018.
Question 4 (2 pts)
Now we find the top 10 states according to the total amount of the donations from each state during the elections from 1980 to 2018.
We can extract the total amount of all the donations from all of the states during an election year as follows:
mydonations <- function(myyear) {
myDF <- fread(paste0("/anvil/projects/tdm/data/election/itcont", myyear, ".txt"), quote="", select = c(10,15))
names(myDF) <- c("state", "donation")
myvalue <- tapply(myDF$donation, myDF$state, sum)
return(myvalue)
}
and we can test this function by discovering how much money was donated from each state during the 1990 election cycle:
library(data.table)
mydonations(1990)
Finally, we can use the sapply
function to run this function on each election year (in other words, the even numbered years) from 1980 to 2018.
library(data.table)
myresults <- sapply( seq(1980,2018,by=2), mydonations )
which yields the total amount of money donated from each state during each election cycle from 1980 to 2018.
Now we can add up the amount of donations in each state, across all of the years, as follows:
v <- unlist(myresults)
tapply(v, names(v), sum)
and the total amount of donations from each of the top 10 states across all election years 1980 to 2018 is:
dotchart(tail(sort(tapply(v, names(v), sum)), n=10))
-
Plot the amount of money donated from each of the top 10 states altogether during 1980 to 2018.
Question 5 (2 pts)
In this last question, we find the total amount of money spent on taxi cab rides in New York City on each day of 2018.
We first extract the total amount of the taxi cab rides per day of a given month as follows:
myfares <- function(mymonth) {
myDF <- fread(paste0("/anvil/projects/tdm/data/taxi/yellow/yellow_tripdata_2018-", mymonth, ".csv"), select=c(2,17))
mytable <- tapply(myDF$total_amount, as.Date(myDF$tpep_pickup_datetime), sum)
return(mytable)
}
and we can test this function by discovering how much money was spent on each day in January:
library(data.table)
myfares("01")
Finally, we can use the sapply
function to run this function on each month from "01"
to "12"
.
library(data.table)
myresults <- sapply( sprintf("%02d", 1:12), myfares )
which yields the total amount of money spent on taxi cab rides each day.
Now we can add up the amounts spent per day (sometimes there is overlap from month to month), as follows:
names(myresults) <- NULL
v <- do.call(c, myresults)
mytotals <- tapply(v, names(v), sum)
betterdates <- mytotals[year(as.Date(names(mytotals))) == 2018]
and the total amount of money spent on taxi cab rides during each day in 2018 is can be plotted as:
plot( as.Date(names(betterdates)), betterdates )
-
Plot the total amount of money spent on taxi cab rides during each day in 2018.
Submitting your Work
Now you are familiar with the method of merging data from multiple data frames.
-
firstname_lastname_project10.ipynb
You must double check your You will not receive full credit if your |