STAT 19000: Project 10 — Fall 2021

Motivation: Functions are powerful. They are building blocks to more complex programs and behavior. In fact, there is an entire programming paradigm based on functions called functional programming. In this project, we will learn to apply functions to entire vectors of data using sapply.

Context: We’ve just taken some time to learn about and create functions. One of the more common "next steps" after creating a function is to use it on a series of data, like a vector. sapply is one of the best ways to do this in R.

Scope: r, sapply, functions

Learning Objectives
  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • Utilize apply functions in order to solve a data-driven problem.

  • Gain proficiency using split, merge, and subset.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/election/*.txt

Questions

Question 1

Read the elections dataset from 2014 (itcont2014.txt) into a data.frame called elections using the fread function from the data.table package.

Make sure to use the correct argument sep='|' from the fread function.

Create a vector called transactions_starting_digit that gets the starting digit for each transaction value (use the TRANSACTION_AMT column). Be sure to use get_starting_digit function from the previous project.

Take a look at the starting digits of the unique transaction amounts. Can we directly compare the results to the Benford’s law to look for anomalies? Explain why or why not, and if not, what do we need to do to be able to make the comparisons?

Pay close attention to the results — if you were able to directly compare, the numbers you were testing would need to be valid for the benfords law function.

What are the possible digits a number can start with?

Relevant topics: fread, unique, table

Items to submit
  • R code used to solve this problem.

  • The results of running the R code.

  • 1-2 sentences explaining if any changes are needed in our dataset to analyze it using Benford’s Law, why or why not? If so what changes are necessary?

Question 2

Be sure to watch the video from Question 1. It covers Question 2 too.

If in question (1) you answered that there are modifications needed in the data, make the necessary modifications.

You should need to make a modification.

Make a barplot showing the percentage of times each digit was the starting digit.

Include in your barplot a line indicating expected percentage based on Benford’s law.

If we compared our results to Benford’s Law would we consider the findings anomalous? Explain why or why not.

Relevant topics: barplot, lines, points, table, prop.table, indexing

Items to submit
  • R code used to solve this problem.

  • The results of running the R code.

  • 1-2 sentences explaining why or why not you think the results for this dataset are anomalous based on Benford’s law.

Question 3

Lets explore things a bit more. How does a different grouping look? To facilitate our analysis, lets create a function to replicate the steps from questions (1) and (2).

Create a function called compare_to_benfords that accepts two arguments, values and title. values represents a vector of values to analyze using Benford’s Law, and title will provide the title of our resulting plot.

Make sure the title argument has a default value, so we if we don’t pass an argument to it, it will still be able to run the function.

The function should get the starting digits in values, perform any necessary clean up, and compare the results with the Benford’s Law, graphically, by producing a plot we did in question (2).

Note that we are simplifying things by wrapping what we did in questions (1) and (2) into a function so we can do the analysis more efficiently.

Test your function on the TRANSACTION_AMT column from the elections dataset. Note that the results should be the same as question (2) — even the title of your plot.

For fair comparison, set the y-axis limits to be between 0 and 50%.

If you called either of the benfords_law or get_starting_digit functions within your compare_to_benfords function, consider the following.

What if you shared this function with your friend, who didn’t have access to your benfords_law or get_starting_digit functions? It wouldn’t work!

Instead, it is perfectly acceptable to declare your functions inside your compare_to_benfords function. These types of functions are called helper functions.

Items to submit
  • R code used to solve this problem.

  • The results of running the R code.

  • The results of running compare_to_benfords(elections$TRANSACTION_AMT).

Question 4

Let’s dig into data a bit more. Using the compare_to_benfords function, analyze the transactions from the following entities (ENTITY_TP):

  • Candidate ('CAN'),

  • Individual - a person - ('IND'),

  • and Organization - not a committee and not a person - ('ORG').

Use a loop, or one of the functions in the apply suite to solve this problem.

Write 1-2 sentences comparing the transactions for each type of ENTITY_TP.

Before running your code, run the following code to create a 2x2 grid for our plots.

par(mfrow=c(1,3))

There are many ways to solve this problem.

Items to submit
  • R code used to solve this problem.

  • The results of running the R code.

  • The results of running compare_to_benfords(elections$TRANSACTION_AMT).

  • Optional: Include the name or abbreviation of the entity in its title.

Question 5

Use the elections datasets and what you learned from the Benford’s Law to explore the dataset more.

You can compare specific states, donations to other entities, or even use datasets from other years.

Explain what and why you are doing, and what are your conclusions. Be creative!

Items to submit
  • R code used to solve this problem.

  • The results of running the R code.

  • 1-2 sentences explaining what and why you are doing.

  • 1-2 sentences explaining your conclusions.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.