R base functions

table

table is a function used to build a contingency table, which is a table that shows counts for categorical data, from one or more categories. prop.table is a function that accepts table output, returning proportions of the counts.

Examples

Which value appears in the "STATE" column the most times, in itcont1980.txt?

Click to see solution
library(data.table)
myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")

head(sort(table(myDF$STATE), decreasing=TRUE), n=1)
   CA
3706

Which value appears in the "NAME" column the most times?

Click to see solution
head(sort(table(myDF$NAME), decreasing=TRUE), n=1)
AMERICAN MEDICAL POLITICAL ACTION COMMITTEE
                                        769

paste and paste0

paste is a function that converts vector elements to character strings and then concatenates them. It has a sep argument (default sep = " ") where the user can include a phrase/string to separate the strings being pasted together

paste0 is a version of paste where its sep argument is "", meaning the strings will be linked with no characters in between.

Examples

Use the paste command to join the "CITY" and "STATE" columns, with the goal of determining the top 5 city-and-state locations where donations were made.

Click to see solution
head(sort(table(paste(myDF$"CITY", myDF$"STATE", sep=", ")), decreasing=TRUE), n=6)
   NEW YORK, NY              ,      HOUSTON, TX      DALLAS, TX  WASHINGTON, DC
          13862           11582           10146            6438            5890
LOS ANGELES, CA
           5866

subset

subset is a function that helps you take subsets of data. By default, subset removes NA rows.

subset does not perform any operation that can’t be accomplished by indexing.

Examples

There are only 9 entries in which the TRANSACTION_DT value is missing, namely: one donation from CURCIO, BARBARA G and two donations from WOLFF, GARY W. and six donations from who?? (find their identity)! Find the name of the person who made six donations in 1980 with a missing TRANSACTION_DT.

Click to see solution
library(data.table)
myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")

missingDF <- subset(myDF, is.na(TRANSACTION_DT))

donorCounts <- table(missingDF$NAME)

name <- names(donorCounts[donorCounts == 6])

print(donorCounts)
print(name)
      CURCIO, BARBARA G SCHECK, RICHARD HERBERT          WOLFF, GARY W.
                      1                       6                       2
[1] "SCHECK, RICHARD HERBERT"

Using itcont1980.txt, give the 9 NAME values for which the TRANSACTION_DT value is missing, using indexing instead of the subset function.

Click to see solution
library(data.table)
myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")

myDF$NAME[is.na(myDF$TRANSACTION_DT)]
    'SCHECK, RICHARD HERBERT'
    'SCHECK, RICHARD HERBERT'
    'SCHECK, RICHARD HERBERT'
    'SCHECK, RICHARD HERBERT'
    'SCHECK, RICHARD HERBERT'
    'SCHECK, RICHARD HERBERT'
    'CURCIO, BARBARA G'
    'WOLFF, GARY W.'
    'WOLFF, GARY W.'

Using itcont1980.txt, using the subset function to get a data frame that contains only the donations for which the STATE is IN. From the CITY column of this subset, make a table of the number of occurrences of each CITY. Sort the table and print the largest 41 entries.

Click to see solution
library(data.table)
myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")

IN_donations <- subset(myDF, STATE == "IN")

cities <- table(IN_donations$CITY)

sorted_cities <- sort(cities, decreasing = TRUE)

print(head(sorted_cities, n=41))
  INDIANAPOLIS         MUNCIE         CARMEL     FORT WAYNE     EVANSVILLE
          1443            232            167            157            151
       ELKHART       FT WAYNE         INDPLS     SOUTH BEND      LAFAYETTE
           132            132            123            123             85
   TERRE HAUTE        LAPORTE     ZIONSVILLE           GARY   MERRILLVILLE
            75             53             52             47             46
      COLUMBUS         KOKOMO       RICHMOND    BLOOMINGTON    CROWN POINT
            45             45             43             38             35
     GREENWOOD       ANDERSON     VALPARAISO         MUNICE        MUNSTER
            35             34             34             31             30
        WABASH       SYRACUSE WEST LAFAYETTE        HAMMOND    NOBLESVILLE
            27             24             24             22             22
   W LAFAYETTE         WARSAW      VINCENNES         HOBART       HIGHLAND
            22             21             20             17             16
 MICHIGAN CITY      MISHAWAKA     CHESTERTON    INDIANPOLIS     PLAINFIELD
            16             16             15             15             15
     WALKERTON
            15

Using itcont1980.txt, using indexing (not a subset), give a table of the top 41 cities in Indiana, according to the number of donations from people in that city.

Click to see solution
library(data.table)
myDF <- fread("/anvil/projects/tdm/data/election/itcont1980.txt", quote="")
names(myDF) <- c("CMTE_ID", "AMNDT_IND", "RPT_TP", "TRANSACTION_PGI", "IMAGE_NUM", "TRANSACTION_TP", "ENTITY_TP", "NAME", "CITY", "STATE", "ZIP_CODE", "EMPLOYER", "OCCUPATION", "TRANSACTION_DT", "TRANSACTION_AMT", "OTHER_ID", "TRAN_ID", "FILE_NUM", "MEMO_CD", "MEMO_TEXT", "SUB_ID")

IN_donations_2 <- myDF[myDF$STATE == "IN"]

cities_2 <- table(IN_donations_2$CITY)

sorted_cities_2 <- sort(cities_2, decreasing = TRUE)

print(head(sort(sorted_cities_2, decreasing = TRUE), n=41))
  INDIANAPOLIS         MUNCIE         CARMEL     FORT WAYNE     EVANSVILLE
          1443            232            167            157            151
       ELKHART       FT WAYNE         INDPLS     SOUTH BEND      LAFAYETTE
           132            132            123            123             85
   TERRE HAUTE        LAPORTE     ZIONSVILLE           GARY   MERRILLVILLE
            75             53             52             47             46
      COLUMBUS         KOKOMO       RICHMOND    BLOOMINGTON    CROWN POINT
            45             45             43             38             35
     GREENWOOD       ANDERSON     VALPARAISO         MUNICE        MUNSTER
            35             34             34             31             30
        WABASH       SYRACUSE WEST LAFAYETTE        HAMMOND    NOBLESVILLE
            27             24             24             22             22
   W LAFAYETTE         WARSAW      VINCENNES         HOBART       HIGHLAND
            22             21             20             17             16
 MICHIGAN CITY      MISHAWAKA     CHESTERTON    INDIANPOLIS     PLAINFIELD
            16             16             15             15             15
     WALKERTON
            15