R base
functions
table
table
is a function used to build a contingency table, which is a table that shows counts for categorical data, from one or more categories. prop.table
is a function that accepts table
output, returning proportions of the counts.
Examples
Which value appears in the "STATE" column the most times, in itcont1980.txt?
Click to see solution
library(data.table)
myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")
head(sort(table(myDF$STATE), decreasing=TRUE), n=1)
CA 3706
paste
and paste0
paste
is a function that converts vector elements to character strings and then concatenates them. It has a sep
argument (default sep = " "
) where the user can include a phrase/string to separate the strings being pasted together
paste0
is a version of paste
where its sep
argument is "", meaning the strings will be linked with no characters in between.
Examples
Use the paste command to join the "CITY" and "STATE" columns, with the goal of determining the top 5 city-and-state locations where donations were made.
Click to see solution
head(sort(table(paste(myDF$"CITY", myDF$"STATE", sep=", ")), decreasing=TRUE), n=6)
NEW YORK, NY , HOUSTON, TX DALLAS, TX WASHINGTON, DC 13862 11582 10146 6438 5890 LOS ANGELES, CA 5866
subset
subset
is a function that helps you take subsets of data. By default, subset
removes NA rows.
subset does not perform any operation that can’t be accomplished by indexing.
|
Examples
There are only 9 entries in which the TRANSACTION_DT value is missing, namely: one donation from CURCIO, BARBARA G and two donations from WOLFF, GARY W. and six donations from who?? (find their identity)! Find the name of the person who made six donations in 1980 with a missing TRANSACTION_DT.
Click to see solution
library(data.table)
myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")
missingDF <- subset(myDF, is.na(TRANSACTION_DT))
donorCounts <- table(missingDF$NAME)
name <- names(donorCounts[donorCounts == 6])
print(donorCounts)
print(name)
CURCIO, BARBARA G SCHECK, RICHARD HERBERT WOLFF, GARY W. 1 6 2 [1] "SCHECK, RICHARD HERBERT"
Using itcont1980.txt, give the 9 NAME values for which the TRANSACTION_DT value is missing, using indexing instead of the subset function.
Click to see solution
library(data.table)
myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")
myDF$NAME[is.na(myDF$TRANSACTION_DT)]
'SCHECK, RICHARD HERBERT' 'SCHECK, RICHARD HERBERT' 'SCHECK, RICHARD HERBERT' 'SCHECK, RICHARD HERBERT' 'SCHECK, RICHARD HERBERT' 'SCHECK, RICHARD HERBERT' 'CURCIO, BARBARA G' 'WOLFF, GARY W.' 'WOLFF, GARY W.'
Using itcont1980.txt, using the subset function to get a data frame that contains only the donations for which the STATE is IN. From the CITY column of this subset, make a table of the number of occurrences of each CITY. Sort the table and print the largest 41 entries.
Click to see solution
library(data.table)
myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")
IN_donations <- subset(myDF, STATE == "IN")
cities <- table(IN_donations$CITY)
sorted_cities <- sort(cities, decreasing = TRUE)
print(head(sorted_cities, n=41))
INDIANAPOLIS MUNCIE CARMEL FORT WAYNE EVANSVILLE 1443 232 167 157 151 ELKHART FT WAYNE INDPLS SOUTH BEND LAFAYETTE 132 132 123 123 85 TERRE HAUTE LAPORTE ZIONSVILLE GARY MERRILLVILLE 75 53 52 47 46 COLUMBUS KOKOMO RICHMOND BLOOMINGTON CROWN POINT 45 45 43 38 35 GREENWOOD ANDERSON VALPARAISO MUNICE MUNSTER 35 34 34 31 30 WABASH SYRACUSE WEST LAFAYETTE HAMMOND NOBLESVILLE 27 24 24 22 22 W LAFAYETTE WARSAW VINCENNES HOBART HIGHLAND 22 21 20 17 16 MICHIGAN CITY MISHAWAKA CHESTERTON INDIANPOLIS PLAINFIELD 16 16 15 15 15 WALKERTON 15
Using itcont1980.txt, using indexing (not a subset), give a table of the top 41 cities in Indiana, according to the number of donations from people in that city.
Click to see solution
library(data.table)
myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")
IN_donations_2 <- myDF[myDF$STATE == "IN"]
cities_2 <- table(IN_donations_2$CITY)
sorted_cities_2 <- sort(cities_2, decreasing = TRUE)
print(head(sort(sorted_cities_2, decreasing = TRUE), n=41))
INDIANAPOLIS MUNCIE CARMEL FORT WAYNE EVANSVILLE 1443 232 167 157 151 ELKHART FT WAYNE INDPLS SOUTH BEND LAFAYETTE 132 132 123 123 85 TERRE HAUTE LAPORTE ZIONSVILLE GARY MERRILLVILLE 75 53 52 47 46 COLUMBUS KOKOMO RICHMOND BLOOMINGTON CROWN POINT 45 45 43 38 35 GREENWOOD ANDERSON VALPARAISO MUNICE MUNSTER 35 34 34 31 30 WABASH SYRACUSE WEST LAFAYETTE HAMMOND NOBLESVILLE 27 24 24 22 22 W LAFAYETTE WARSAW VINCENNES HOBART HIGHLAND 22 21 20 17 16 MICHIGAN CITY MISHAWAKA CHESTERTON INDIANPOLIS PLAINFIELD 16 16 15 15 15 WALKERTON 15