Think Summer: Day 3 Notes — 2024

Loading the R data.table library and loading the data from the 2005 airline data set

%%R
library(data.table)
%%R
myDF <- fread("/anvil/projects/tdm/data/flights/subset/2005.csv")

A dataframe in R, by the way, is a lot like an Excel spreadsheet or a SQL table. A dataframe has columns of data, with one type of data in each column. The columns are usually long. In other words, there are usually many rows in the dataframe.

These are the first few lines of the 2005 airline data set

%%R
head(myDF)

There are 7 million rows and 29 columns

%%R
dim(myDF)

The first few flights are departing from Boston or O’Hare

%%R
head(myDF$Origin)

The first few flights are arriving at Boston or O’Hare

%%R
head(myDF$Dest)

The last few flights are departing and arriving as follows:

%%R
tail(myDF$Origin)
%%R
tail(myDF$Dest)

We can use n=50 to get the destinations of the first 50 flights and the destinations of the last 50 flights.

%%R
head(myDF$Dest, n=50)
%%R
tail(myDF$Dest, n=50)

If we find out how many times an airplane departed from each airport, we get these counts:

%%R
head(table(myDF$Origin), n=10)

Now we can sort those counts, in descending order (i.e., with the largest ones given first), and display the largest such 10 counts.

%%R
head(sort(table(myDF$Origin), decreasing=T), n=10)

Now we can display how many flights departed from each of the 10 most popular airports.

%%R
dotchart(head(sort(table(myDF$Origin), decreasing=T), n=10))

We can extract the number of flights from specific airports, by looking the data up by the airports as indices. Note that we are only selecting from the 10 most popular airports here.

%%R
head(sort(table(myDF$Origin), decreasing=T), n=10)[c("ATL","CVG","ORD")]

Here is another example, in which we extract the number of flights from airports which may or may not be among the most popular 10 airports.

%%R
sort(table(myDF$Origin), decreasing=T)[c("EWR","IND","JFK","ORD")]

We can paste together the first 300 origin airports and the first 300 destination airports.

%%R
paste(head(myDF$Origin, n=300), head(myDF$Dest, n=300), sep="-")

Then we can tabulate how many times each such flight path was flown.

%%R
table(paste(head(myDF$Origin, n=300), head(myDF$Dest, n=300), sep="-"))

Now that this works, we can remove the heads on each of those data sets. Then we can tabulate the number of times that every flight path was used, and sort those results, and finally we can display the 100 most popular flight paths overall.

%%R
head(sort(table(paste(myDF$Origin, myDF$Dest, sep="-")), decreasing=T), n=100)

Now we can take these results and make a dotchart:

%%R
dotchart(head(sort(table(paste(myDF$Origin, myDF$Dest, sep="-")), decreasing=T), n=10))

or alternatively:

%%R
dotchart(tail(sort(table(paste(myDF$Origin, myDF$Dest, sep="-"))), n=10))

We can extract the number of specific flights as well, for instance:

%%R
tail(sort(table(paste(myDF$Origin, myDF$Dest, sep="-"))), n=20)["LAX-PHX"]
%%R
tail(sort(table(paste(myDF$Origin, myDF$Dest, sep="-"))), n=20)["SAN-LAX"]

In the example below, if we are still using the tail then we will not see the IND-ORD results, because that flight path is not among the most popular 20 flight paths. So we had to (instead) look at all of the flight paths. Note that, in such a case, we want to keep an index at the end (otherwise, we will be asking to see the tens of thousands of flight paths all over the whole country).

%%R
sort(table(paste(myDF$Origin, myDF$Dest, sep="-")))[c("SAN-LAX", "LAX-PHX", "IND-ORD")]

When we use the table function in R, in the result, we have a row of names followed by a row of data. Then we have another row of names followed by a row of data, etc., etc. R always displays data from a table in this way, namely, by alternating a row of names and a row of data. You can think about how things would look different (and easier) if your screen was really, really wide, and there were only two rows displayed, namely, the names and the data.

These are the airline carriers for the first 6 flights.

%%R
head(myDF$UniqueCarrier)

We can see how many flights were flown with each carrier, in either increasing or decreasing order:

%%R
sort(table(myDF$UniqueCarrier), decreasing=T)
%%R
sort(table(myDF$UniqueCarrier), decreasing=F)

The first few departure delays are:

%%R
head(myDF$DepDelay, n=100)

The overall average departure delay, across all flights, is 8.67 minutes:

%%R
mean(myDF$DepDelay, na.rm=T)

We can just restrict attention to the average departure delay for flights departing from IND or from JFK.

%%R
mean(myDF$DepDelay[myDF$Origin=="IND"], na.rm=T)
%%R
mean(myDF$DepDelay[myDF$Origin=="JFK"], na.rm=T)

These are the first 100 departure delays for flights from Indianapolis to Chicago.

%%R
head(myDF$DepDelay[(myDF$Origin=="IND") & (myDF$Dest=="ORD")], n=100)

and the mean (or average) of those departure delays are:

%%R
mean(myDF$DepDelay[(myDF$Origin=="IND") & (myDF$Dest=="ORD")])

The average departure delay on Christmas Day is:

%%R
mean( myDF$DepDelay[myDF$Year == 2005 & myDF$Month == 12 & myDF$DayofMonth == 25], na.rm=T)

The first 6 department delays for flights from Boston or flights from Indianapolis are:

%%R
head(myDF$DepDelay[myDF$Origin == "BOS"])
%%R
head(myDF$DepDelay[myDF$Origin == "IND"])

We could make a table of departure delays for flights from Indianapolis:

%%R
table(myDF$DepDelay[myDF$Origin == "IND"])

and we can plot the distribution of departure delays:

%%R
plot(table(myDF$DepDelay[myDF$Origin == "IND"]))

and we can add conditions to this. For instance, if we only want to see the distribution of delays that are less than 1 hour:

%%R
plot(table(myDF$DepDelay[(myDF$Origin == "IND") & (myDF$DepDelay < 60)]))

If we look at the raw data (without plotting it), we need to be mindful that the output shows two rows: the departure delay, and then (below) that how many flights with that delay.

%%R
table(myDF$DepDelay[(myDF$Origin == "IND") & (myDF$DepDelay < 60)])

Now we switch gears and load the donation data from federal election campaigns in 2000. This data is described here: Contributions by individuals file description

%%R
library(data.table)
%%R
myDF <- fread("/anvil/projects/tdm/data/election/itcont2000.txt", quote="")

We need to provide the names of the columns in the data frame:

%%R
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")

The first several rows of election data are:

%%R
head(myDF)

There are 1.6 million rows and 21 columns

%%R
dim(myDF)

The states where the last 100 donations were made are:

%%R
tail(myDF$STATE, n=100)

Altogether, there were 1.8 billion dollars in contributions

%%R
sum(myDF$TRANSACTION_AMT)

The largest number of contributions (regardless of the size of the contributions) were made by residents of CA, NY, TX, etc.

%%R
sort(table(myDF$STATE), decreasing=T)

We can paste the first 6 cities and the first 6 states together, using the paste function:

%%R
head(myDF$CITY)
%%R
head(myDF$STATE)
%%R
paste(head(myDF$CITY), head(myDF$STATE))

Then we can tabulate how many times those 6 city-state pairs occur, and sort the results, and display the head.

%%R
head(sort(table(paste(head(myDF$CITY), head(myDF$STATE))), decreasing=T))

Now that this works for the first 6 city-state pairs, we can do this again for the entire data set. We see that the most donations were made from some typically large cities. There are also a lot of donations from unknown locations.

%%R
head(sort(table(paste(myDF$CITY, myDF$STATE)), decreasing=T), n=20)

or in the opposite order:

%%R
tail(sort(table(paste(myDF$CITY, myDF$STATE)), decreasing=F), n=20)

Here are the names of the people who made the largest number of contributions (regardless of the size of the contributions themselves)

%%R
head(sort(table(myDF$NAME), decreasing=T))

Now we can learn how to use the tapply function.

The tapply function takes three things, namely, some data, some groups to sort the data, and a function to run on the data.

For instance, we can take the data about the election transaction amounts, and split the data according the state where the donation was made, and sum the dollar amounts of those election donations within each state.

%%R
head(sort(tapply(myDF$TRANSACTION_AMT, myDF$STATE, sum), decreasing=T))

We can do something similar, now summing the amounts of the transactions in dollars, splitting the data according to the name of the donor:

%%R
head(sort(tapply(myDF$TRANSACTION_AMT, myDF$NAME, sum), decreasing=T), n=20)

Now we return to the airline data set from 2005:

%%R
myDF <- fread("/anvil/projects/tdm/data/flights/subset/2005.csv")

We can take an average of the departure delays, split according to the airline for the flights:

%%R
tapply( myDF$DepDelay, myDF$UniqueCarrier, mean, na.rm=T )

We can sum the distances of the flights according to the airports where the flights departed:

%%R
head(sort( tapply( myDF$Distance, myDF$Origin, sum ), decreasing=T ))

We can take an average of the arrival delays according to the destination where the flights landed.

%%R
head(sort( tapply( myDF$ArrDelay, myDF$Dest, mean, na.rm=T ), decreasing=T ))