TDM 10100: Project 5 — Fall 2023

Motivation: R differs from other programing languages in that R works great with vectorized functions and the apply suite of functions (instead of using loops).

The apply family of functions provide an alternative to loops. You can use apply() and its variants (i.e. mapply(), sapply(), lapply(), vapply(), rapply(), and tapply()…​) to manipulate pieces of data from data.frames, lists, arrays, matrices in a repetitive way.

Context: We will focus in this project on efficient ways of processing data in R.

Scope: tapply function

Learning Objectives
  • Demonstrate the ability to use the tapply function.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset in Anvil:

/anvil/projects/tdm/data/election/escaped2020sample.txt

A txt and csv file both store information in plain text. Data in csv files are almost always separated by commas. In txt files, the fields can be separated by commas, semicolons, pipe symbols, tabs, or other separators.

To read in a txt file in which the data is add sep="|" (see code below)

 myDF <- read.csv("/anvil/projects/tdm/data/election/escaped2020sample.txt", sep="|")

You might want to use 3 cores in this project when you setup your Jupyter Lab session.

Data Understanding

The file uses '|' (instead of commas) to separate the data fields. The reason is that one column of data contains full names, which sometimes include commas.

head(myDF)

When looking at the head of the data frame, notice that the entries in the TRANSACTION_DT column have the month, day, and year all crammed together without any slashes between them.

lubridate

The lubridate package can be used to put a column into a date format. In general, data that contains information about dates can sometimes be hard to put into a date format, but the lubridate package makes this easier.

library(lubridate, warn.conflicts = FALSE)
myDF$newdates <-mdy(myDF$TRANSACTION_DT)

A new column newdates is created, with the same data as the TRANSACTION_DT column but now stored in date format.

Feel free to check out the official cheatsheet to learn more about the lubridate package.

tapply

tapply() helps us apply functions (for instance: mean, median, minimum, maximum, sum, etc…​) to data, one group at a time. The tapply() function is most helpful when we need to break data into groups, applying a function to each of the groups of data.

The tapply function takes three inputs:

Some data to work on; a way to break the data into groups; and a function to apply to each group of data.

tapply(myDF$TRANSACTION_AMT, myDF$newdates, sum)
  • The tapply function applies can sum the myDF$TRANSACTION_AMT data, grouped according to myDF$newdates

  • Three inputs for tapply

    • myDF$TRANSACTION_AMT: the data vector to work on

    • myDF$newdates: the way to break the data into groups

    • sum: the function to apply on each piece of data

Questions

Question 1 (1.5 pts)

  1. Use the year function (from the lubridate library) on the column newdates, to create a new column named TRANSACTION_YR.

  2. Using tapply, add the values in the TRANSACTION_AMT column, according to the values in the TRANSACTION_YR column.

  3. Plot the years on the x-axis and the total amount of the transactions by year on the y-axis.

Question 2 (1.5 pts)

  1. From Question 1, you may notice that the majority of the data collected is found in the years 2019-2020. Please create a new dataframe that only contains data for the dates in the range 01/01/2020-12/31/2020.

  2. Using tapply, get the sum of the money in the TRANSACTION_AMT column, grouped according to the months January through December (in 2020 only).

  3. Plot the months on the x-axis and the total amount of the transactions (for each month) on the y-axis.

Question 3 (1.5 pts)

Let’s go back to using the full set of data across all of the years (from Question 1). We can continue to experiment with the tapply function.

  1. Please find the donor who gave the most money (altogether) in the whole data set.

  2. Find the total amount of money given (altogether) in each state. Then sort the states, according to the total amount of money given altogether. In which 5 states was the most money given?

  3. What are the ten zipcodes in which the most money is donated (altogether)?

Question 4 (2 pts)

  1. Using a barplot or dotchart, plot the total amount of money given in each of the top five states.

  2. Using a barplot or dotchart, plot the total amount of money given in each of the top ten zipcodes.

Question 5 (1.5 pts)

  1. Analyze something that you find interesting about the election data, make a plot to demonstrate your insight, and then explain your finding with a few sentences of explanation.

Project 05 Assignment Checklist

  • Jupyter Lab notebook with your code and comments for the assignment

    • firstname-lastname-project05.ipynb.

  • R code and comments for the assignment

    • firstname-lastname-project05.R.

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.