TDM 10100: Project 5 — Fall 2022

Motivation: R differs from other programing languages that typically work best using vectorized functions and the apply suite instead of using loops.

Insider Knowledge

Apply Functions: are an alternative to loops. You can use apply() and its varients (i.e. mapply(), sapply(), lapply(), vapply(), rapply(), and tapply()…​) to manuiplate peices of data from data.frames, lists, arrays, matrices in a repetative way. The apply() functions allow for flexiabilty in crossing data in multiple ways that a loop does not.

Context: We will focus in this project on efficient ways of processing data in R.

Scope: r, data.frames, recycling, factors, if/else, for loops, apply suite

Learning Objectives
  • Demonstrate the ability to use the tapply function.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s) in anvil:

/anvil/projects/tdm/data/election/escaped2020sample.txt

Helpful Hint

A txt and csv file both sore information in plain text. csv files are always separated by commas. In txt files the fields can be separated with commas, semicolons, or tab.

To read in a txt file as a csv we simply add sep="|" (see code below)

 myDF <- read.csv("/anvil/projects/tdm/data/election/escaped2020sample.txt", sep="|")

Questions

ONE

Read the dataset escaped2020sample.txt into a data.frame called myDF. The dataset contains contribution information for the 2020 election year.

The dataset has a column named TRANSACTION_DT which is set up in the [month].[day].[year] format. We want to organize the dates in chronological order.

When working with dates, it is important to use tools specifically for this purpose (rather than using string manipulation, for example). We’ve provided you with the code below. The provided code uses the lubridate package, an excellent package which hides away many common issues that occur when working with dates. Feel free to check out the official cheatsheet in case you’d like to learn more about the package.

library(lubridate, warn.conflicts = FALSE)
  1. Use the mdy function (from the lubridate library) on the column TRANSACTION_DT, to create a new column named newdates.

  2. Using tapply, add the values in the TRANSACTION_AMT column, according to the values in the newdate column.

  3. Plot the dates on the x-axis and the information we found in part b on the y-axis.

Helpful Hint

tapply() helps us to compute statistical measures such as mean, median, minimum, maximum, sum, etc…​ for data that is split into groups. tapply() is most helpful when we need to break up a vector into groups, and compute a function on each of the groups.

If your tapply in Question 1b hates you (e.g., it will absolutely not finish the tapply, even after a few minutes), then the fix described below will likely help. Please note that, after you run this fix, you need to reset your memory back to 5000 MB at time 4:16 in the video.

You do not need to run this "fix" unless you have a cell like this, which should be running, but you are "stuck" on it:

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

TWO

The plot that we just created in question one shows us that the majority of the data collected is found in the years 2018-2020. So we will focus on the year 2019.

  1. Create a new dataframe that only contains data for the dates in the range 01/01/2019-05/15/2019

  2. Plot the new dataframe

  3. What do you notice about the data?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • Answer to the questions above

THREE

Lets look at the donations by city and state

  1. Find the sum of the total donations contributed in each state.

  2. Create a new column that pastes together the city and state.

  3. Find the total donation amount for each city/state location. In the output do you notice anything suspicious in the result? How do you think that occured?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • Answers to the questions above.

FOUR

Lets take a look who is donating

  1. Find the type of data that is in the NAME columm

  2. Split up the names in the NAME column, to extract the first names of the donors. (This will not be perfect, but it is our first attempt.)

  3. How much money is donated (altogether) by people named Mary?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • Answer to the questions above

FIVE

Employment status

  1. Using a barplot or dotchart, show the total amount of donations made by EMPLOYED vs NOT EMPLOYED individuals

  2. What is the category of occupation that donates the most money?

  3. Plot something that you find interesting about the employment and/or occupation columns

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • 1-2 sentences explaining what is was you chose to plot and why

  • Answering to the questions above

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.