R base functions

head is a simple function that shows the first n values in an object, with default n = 6. Additionally, if you include a negative sign in front of your n integer, it will display the object without the last n items.

Examples

Using the airports data file, show the output with the first five rows of the airports data frame.

Click to see solution
myDF <- read.csv("/anvil/projects/tdm/data/flights/subset/airports.csv")
head(myDF)
    iata              airport                       city     state      country      lat          long
1   00M               Thigpen                Bay Springs        MS	    USA	    31.95376	 -89.23450
2   00R	           Livingston       Municipal Livingston        TX	    USA	    30.68586	 -95.01793
3   00V	          Meadow Lake           Colorado Springs        CO	    USA	    38.94575    -104.56989
4   01G	         Perry-Warsaw                      Perry        NY	    USA	    42.74135	 -78.05208
5   01J	     Hilliard Airpark                   Hilliard        FL	    USA	    30.68801	 -81.90594
6   01M	    Tishomingo country                   Belmont        MS	    USA	    34.49167	 -88.20111

table

table is a function used to build a contingency table, which is a table that shows counts for categorical data, from one or more categories. prop.table is a function that accepts table output, returning proportions of the counts.

Examples

Using the airports data file, show how many airports are found in each state, first in alphabetical order (which is the default), and then in sorted order.

Click to see solution
# default
table(myDF$state)

# sorted
sort(table(myDF$state))
# default
 AK  AL  AR  AS  AZ  CA  CO  CQ  CT  DC  DE  FL  GA  GU  HI  IA  ID  IL  IN  KS
263  73  74   3  59 205  49   4  15   1   5 100  97   1  16  78  37  88  65  78
 KY  LA  MA  MD  ME  MI  MN  MO  MS  MT  NC  ND  NE  NH  NJ  NM  NV  NY  OH  OK
 50  55  30  18  34  94  89  74  72  71  72  52  73  14  35  51  32  97 100 102
 OR  PA  PR  RI  SC  SD  TN  TX  UT  VA  VI  VT  WA  WI  WV  WY
 57  71  11   6  52  57  70 209  35  47   5  13  65  84  24  32

 # numerically ordered
  DC  GU  AS  CQ  DE  VI  RI  PR  VT  NH  CT  HI  MD  WV  MA  NV  WY  ME  NJ  UT
  1   1   3   4   5   5   6  11  13  14  15  16  18  24  30  32  32  34  35  35
 ID  VA  CO  KY  NM  ND  SC  LA  OR  SD  AZ  IN  WA  TN  MT  PA  MS  NC  AL  NE
 37  47  49  50  51  52  52  55  57  57  59  65  65  70  71  71  72  72  73  73
 AR  MO  IA  KS  WI  IL  MN  MI  GA  NY  FL  OH  OK  CA  TX  AK
 74  74  78  78  84  88  89  94  97  97 100 100 102 205 209 263

subset

subset is a function that helps you take subsets of data. By default, subset removes NA rows.

subset does not perform any operation that can’t be accomplished by indexing.

Examples

Using the 1990.csv file, make a subset of the flights for which DepDelay and ArrDelay are missing.

Click to see solution
flightDF <- read.csv("/anvil/projects/tdm/data/flights/subset/1990.csv")

bothMissing <- subset(flightDF, is.na(DepDelay) & is.na(ArrDelay))

Using the 1990.csv file, make a subset of the flights for which the DepDelay is given but the ArrDelay is missing?

Click to see solution
flightDF <- read.csv("/anvil/projects/tdm/data/flights/subset/1990.csv")

arrMissing <- subset(flightDF, !is.na(DepDelay) & is.na(ArrDelay))

Using the 1990.csv file, make a subset of the flights for which the ArrDelay is given but the DepDelay is missing?

Click to see solution
flightDF <- read.csv("/anvil/projects/tdm/data/flights/subset/1990.csv")

depMissing <- subset(flightDF, is.na(DepDelay) & !is.na(ArrDelay))

nrow

nrow is a function that returns the number of rows of a matrix, vector, array or data.frame.

Examples

Using the 1990.csv file, for how many flights is the DepDelay missing and also (simultaneously) the ArrDelay is missing too?

Click to see solution
flightDF <- read.csv("/anvil/projects/tdm/data/flights/subset/1990.csv")

bothMissing <- nrow(subset(flightDF, is.na(DepDelay) & is.na(ArrDelay)))
print(bothMissing)
[1] 52458

Using the 1990.csv file, for how many flights is the DepDelay given but the ArrDelay is missing?

Click to see solution
flightDF <- read.csv("/anvil/projects/tdm/data/flights/subset/1990.csv")

arrMissing <- nrow(subset(flightDF, !is.na(DepDelay) & is.na(ArrDelay)))
print(arrMissing)
[1] 15954

Using the 1990.csv file, for how many flights is the ArrDelay given but the DepDelay is missing?

Click to see solution
flightDF <- read.csv("/anvil/projects/tdm/data/flights/subset/1990.csv")

depMissing <- nrow(subset(flightDF, is.na(DepDelay) & !is.na(ArrDelay)))
print(depMissing)
[1] 0

mean

mean is a function that calculates the average of a vector of values.

You will often find yourself using the na.rm argument, short for NA value removal. Most real-life data will contain missing values somewhere, and na.rm = TRUE will automatically remove those values from consideration during a function call or computation. na.rm = FALSE is the default, so make sure to include na.rm = TRUE if you’re unsure of your data’s composition.

Examples

Using the 1990 csv file, give the mean of the DepDelay of all of the flights whose Origin airport is EWR or JFK or LGA.

Click to see solution
flightDF <- read.csv("/anvil/projects/tdm/data/flights/subset/1990.csv")
mean(flightDF$DepDelay[(flightDF$Origin == "EWR") | (flightDF$Origin == "JFK") | (flightDF$Origin == "LGA")], na.rm = TRUE)
 9.40006351697211

cut

cut breaks a vector into factors specified by the argument breaks. cut is particularly useful to break Date data into quarters (Q1, Q2), years (1999, 2000, 2001), and so on.

The utility of this function is tied to the possible factors offered by breaks. You can see a list of your options by running ?cut.POSIXt.

Examples

Using the 1990 csv file, give the table described above, which classifies the number of flights according to the number of hours that the flights are delayed.

The DepDelay values are given in minutes. We will classify the number of flights according to how many hours that the flight was delayed.

Use the cut command to classify the number of flights in each of these categories:

Flight departed early or on time, i.e., DepDelay is negative or 0.

Flight departed more than 0 but less than or equal to 60 minutes late.

Flight departed more than 60 but less than or equal to 120 minutes late.

Flight departed more than 120 but less than or equal to 180 minutes late.

Flight departed more than 180 but less than or equal to 240 minutes late.

Flight departed more than 240 but less than or equal to 300 minutes late.

Etc., etc., and finally:

Flight departed more than 1380 but less than or equal to 1440 minutes late.

Click to see solution
flightDF <- read.csv("/anvil/projects/tdm/data/flights/subset/1990.csv")

breaks <- c(-Inf, 0)

for (i in seq(60, 1440, by = 60)) {
  breaks <- c(breaks, i)
}

categories <- cut(flight_data$DepDelay,
                        breaks = breaks,
                        right = TRUE,
                        dig.lab = 4)

flight_delays <- table(categories, useNA = "always")

print(flight_delays)
categories
   (-Inf,0]      (0,60]    (60,120]   (120,180]   (180,240]   (240,300]
    2966433     2111783      104240       24000        7517        2630
  (300,360]   (360,420]   (420,480]   (480,540]   (540,600]   (600,660]
       1001         366         125          65          35          19
  (660,720]   (720,780]   (780,840]   (840,900]   (900,960]  (960,1020]
         24          20          24          13           8           6
(1020,1080] (1080,1140] (1140,1200] (1200,1260] (1260,1320] (1320,1380]
          1           4           3           3          11          28
(1380,1440]        <NA>
         76       52458

merge

merge is a function that can be used to combine data.frames by row or column names. The effects of the function replicate the join operations in SQL.

Examples

Using airports.csv and 1990.csv files, merge the data according to the iata and Origin values. Then print the average departure delay for all flights that have Origin airport in Indiana.

Click to see solution
library(data.table)
airportsDF <- fread("/anvil/projects/tdm/data/flights/subset/airports.csv")
flightdataDF <- fread("/anvil/projects/tdm/data/flights/subset/1990.csv")

mybigDF <- merge(airportsDF, flightdataDF, by.x = "iata", by.y = "Origin")
flights <- mybigDF[state == "IN"]
mean(flights$DepDelay, na.rm = TRUE)
 5.37069064914101

Using airports.csv and 1990.csv files, merge the data according to the iata and Origin values. Then print the average departure delay for all flights that have Origin airport in Houston, Texas.

Click to see solution
library(data.table)
airportsDF <- fread("/anvil/projects/tdm/data/flights/subset/airports.csv")
flightdataDF <- fread("/anvil/projects/tdm/data/flights/subset/1990.csv")

mybigDF <- merge(airportsDF, flightdataDF, by.x = "iata", by.y = "Origin")

houston_flights <- mybigDF[state == "TX" & city == "Houston"]
mean(houston_flights$DepDelay, na.rm = TRUE)

# Other approach:
table(houston_flights$iata)
#  EFD   HOU   IAH
#  730 58534 91175
houston_flights2 <- mybigDF[iata=="EFD" | iata == "HOU" | iata == "IAH"]
mean(houston_flights2$DepDelay, na.rm = TRUE)
# 7.22768727677225
7.22768727677225