R base
functions
head
head
is a simple function that shows the first n
values in an object, with default n = 6
. Additionally, if you include a negative sign in front of your n
integer, it will display the object without the last n items.
Examples
Using the airports data file, show the output with the first five rows of the airports data frame.
Click to see solution
myDF <- read.csv("/anvil/projects/tdm/data/flights/subset/airports.csv")
head(myDF)
iata airport city state country lat long 1 00M Thigpen Bay Springs MS USA 31.95376 -89.23450 2 00R Livingston Municipal Livingston TX USA 30.68586 -95.01793 3 00V Meadow Lake Colorado Springs CO USA 38.94575 -104.56989 4 01G Perry-Warsaw Perry NY USA 42.74135 -78.05208 5 01J Hilliard Airpark Hilliard FL USA 30.68801 -81.90594 6 01M Tishomingo country Belmont MS USA 34.49167 -88.20111
table
table
is a function used to build a contingency table, which is a table that shows counts for categorical data, from one or more categories. prop.table
is a function that accepts table
output, returning proportions of the counts.
Examples
Using the airports data file, show how many airports are found in each state, first in alphabetical order (which is the default), and then in sorted order.
Click to see solution
# default
table(myDF$state)
# sorted
sort(table(myDF$state))
# default AK AL AR AS AZ CA CO CQ CT DC DE FL GA GU HI IA ID IL IN KS 263 73 74 3 59 205 49 4 15 1 5 100 97 1 16 78 37 88 65 78 KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK 50 55 30 18 34 94 89 74 72 71 72 52 73 14 35 51 32 97 100 102 OR PA PR RI SC SD TN TX UT VA VI VT WA WI WV WY 57 71 11 6 52 57 70 209 35 47 5 13 65 84 24 32 # numerically ordered DC GU AS CQ DE VI RI PR VT NH CT HI MD WV MA NV WY ME NJ UT 1 1 3 4 5 5 6 11 13 14 15 16 18 24 30 32 32 34 35 35 ID VA CO KY NM ND SC LA OR SD AZ IN WA TN MT PA MS NC AL NE 37 47 49 50 51 52 52 55 57 57 59 65 65 70 71 71 72 72 73 73 AR MO IA KS WI IL MN MI GA NY FL OH OK CA TX AK 74 74 78 78 84 88 89 94 97 97 100 100 102 205 209 263
subset
subset
is a function that helps you take subsets of data. By default, subset
removes NA rows.
subset does not perform any operation that can’t be accomplished by indexing.
|
Examples
Using the 1990.csv file, make a subset of the flights for which DepDelay and ArrDelay are missing.
Click to see solution
flightDF <- read.csv("/anvil/projects/tdm/data/flights/subset/1990.csv")
bothMissing <- subset(flightDF, is.na(DepDelay) & is.na(ArrDelay))
nrow
nrow
is a function that returns the number of rows of a matrix, vector, array or data.frame.
Examples
Using the 1990.csv file, for how many flights is the DepDelay missing and also (simultaneously) the ArrDelay is missing too?
Click to see solution
flightDF <- read.csv("/anvil/projects/tdm/data/flights/subset/1990.csv")
bothMissing <- nrow(subset(flightDF, is.na(DepDelay) & is.na(ArrDelay)))
print(bothMissing)
[1] 52458
mean
mean
is a function that calculates the average of a vector of values.
You will often find yourself using the na.rm
argument, short for NA value removal. Most real-life data will contain missing values somewhere, and na.rm = TRUE
will automatically remove those values from consideration during a function call or computation. na.rm = FALSE
is the default, so make sure to include na.rm = TRUE
if you’re unsure of your data’s composition.
Examples
Using the 1990 csv file, give the mean of the DepDelay of all of the flights whose Origin airport is EWR or JFK or LGA.
Click to see solution
flightDF <- read.csv("/anvil/projects/tdm/data/flights/subset/1990.csv")
mean(flightDF$DepDelay[(flightDF$Origin == "EWR") | (flightDF$Origin == "JFK") | (flightDF$Origin == "LGA")], na.rm = TRUE)
9.40006351697211
cut
cut
breaks a vector into factors specified by the argument breaks
. cut
is particularly useful to break Date data into quarters (Q1, Q2), years (1999, 2000, 2001), and so on.
The utility of this function is tied to the possible factors offered by breaks
. You can see a list of your options by running ?cut.POSIXt
.
Examples
Using the 1990 csv file, give the table described above, which classifies the number of flights according to the number of hours that the flights are delayed.
The DepDelay values are given in minutes. We will classify the number of flights according to how many hours that the flight was delayed.
Use the cut command to classify the number of flights in each of these categories:
Flight departed early or on time, i.e., DepDelay is negative or 0.
Flight departed more than 0 but less than or equal to 60 minutes late.
Flight departed more than 60 but less than or equal to 120 minutes late.
Flight departed more than 120 but less than or equal to 180 minutes late.
Flight departed more than 180 but less than or equal to 240 minutes late.
Flight departed more than 240 but less than or equal to 300 minutes late.
Etc., etc., and finally:
Flight departed more than 1380 but less than or equal to 1440 minutes late.
Click to see solution
flightDF <- read.csv("/anvil/projects/tdm/data/flights/subset/1990.csv")
breaks <- c(-Inf, 0)
for (i in seq(60, 1440, by = 60)) {
breaks <- c(breaks, i)
}
categories <- cut(flight_data$DepDelay,
breaks = breaks,
right = TRUE,
dig.lab = 4)
flight_delays <- table(categories, useNA = "always")
print(flight_delays)
categories (-Inf,0] (0,60] (60,120] (120,180] (180,240] (240,300] 2966433 2111783 104240 24000 7517 2630 (300,360] (360,420] (420,480] (480,540] (540,600] (600,660] 1001 366 125 65 35 19 (660,720] (720,780] (780,840] (840,900] (900,960] (960,1020] 24 20 24 13 8 6 (1020,1080] (1080,1140] (1140,1200] (1200,1260] (1260,1320] (1320,1380] 1 4 3 3 11 28 (1380,1440] <NA> 76 52458
merge
merge
is a function that can be used to combine data.frames by row or column names. The effects of the function replicate the join operations in SQL.
Examples
Using airports.csv and 1990.csv files, merge the data according to the iata and Origin values. Then print the average departure delay for all flights that have Origin airport in Indiana.
Click to see solution
library(data.table)
airportsDF <- fread("/anvil/projects/tdm/data/flights/subset/airports.csv")
flightdataDF <- fread("/anvil/projects/tdm/data/flights/subset/1990.csv")
mybigDF <- merge(airportsDF, flightdataDF, by.x = "iata", by.y = "Origin")
flights <- mybigDF[state == "IN"]
mean(flights$DepDelay, na.rm = TRUE)
5.37069064914101
Using airports.csv and 1990.csv files, merge the data according to the iata and Origin values. Then print the average departure delay for all flights that have Origin airport in Houston, Texas.
Click to see solution
library(data.table)
airportsDF <- fread("/anvil/projects/tdm/data/flights/subset/airports.csv")
flightdataDF <- fread("/anvil/projects/tdm/data/flights/subset/1990.csv")
mybigDF <- merge(airportsDF, flightdataDF, by.x = "iata", by.y = "Origin")
houston_flights <- mybigDF[state == "TX" & city == "Houston"]
mean(houston_flights$DepDelay, na.rm = TRUE)
# Other approach:
table(houston_flights$iata)
# EFD HOU IAH
# 730 58534 91175
houston_flights2 <- mybigDF[iata=="EFD" | iata == "HOU" | iata == "IAH"]
mean(houston_flights2$DepDelay, na.rm = TRUE)
# 7.22768727677225
7.22768727677225