# STAT 19000: Project 10 — Fall 2021

Motivation: Functions are powerful. They are building blocks to more complex programs and behavior. In fact, there is an entire programming paradigm based on functions called functional programming. In this project, we will learn to apply functions to entire vectors of data using `sapply`.

Context: We’ve just taken some time to learn about and create functions. One of the more common "next steps" after creating a function is to use it on a series of data, like a vector. `sapply` is one of the best ways to do this in R.

Scope: r, sapply, functions

Learning Objectives
• Read and write basic (csv) data.

• Explain and demonstrate: positional, named, and logical indexing.

• Utilize apply functions in order to solve a data-driven problem.

• Gain proficiency using split, merge, and subset.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

## Dataset(s)

The following questions will use the following dataset(s):

• `/depot/datamine/data/election/*.txt`

## Questions

### Question 1

Read the elections dataset from 2014 (`itcont2014.txt`) into a data.frame called `elections` using the `fread` function from the `data.table` package.

 Make sure to use the correct argument `sep='|'` from the `fread` function.

Create a vector called `transactions_starting_digit` that gets the starting digit for each transaction value (use the `TRANSACTION_AMT` column). Be sure to use `get_starting_digit` function from the previous project.

Take a look at the starting digits of the unique transaction amounts. Can we directly compare the results to the Benford’s law to look for anomalies? Explain why or why not, and if not, what do we need to do to be able to make the comparisons?

 Pay close attention to the results — if you were able to directly compare, the numbers you were testing would need to be valid for the benfords law function.

Items to submit
• R code used to solve this problem.

• The results of running the R code.

• 1-2 sentences explaining if any changes are needed in our dataset to analyze it using Benford’s Law, why or why not? If so what changes are necessary?

### Question 2

 Be sure to watch the video from Question 1. It covers Question 2 too.

If in question (1) you answered that there are modifications needed in the data, make the necessary modifications.

 You should need to make a modification.

Make a barplot showing the percentage of times each digit was the starting digit.

Include in your barplot a line indicating expected percentage based on Benford’s law.

If we compared our results to Benford’s Law would we consider the findings anomalous? Explain why or why not.

Relevant topics: barplot, lines, points, table, prop.table, indexing

Items to submit
• R code used to solve this problem.

• The results of running the R code.

• 1-2 sentences explaining why or why not you think the results for this dataset are anomalous based on Benford’s law.

### Question 3

Lets explore things a bit more. How does a different grouping look? To facilitate our analysis, lets create a function to replicate the steps from questions (1) and (2).

Create a function called `compare_to_benfords` that accepts two arguments, `values` and `title`. `values` represents a vector of values to analyze using Benford’s Law, and `title` will provide the `title` of our resulting plot.

Make sure the `title` argument has a default value, so we if we don’t pass an argument to it, it will still be able to run the function.

The function should get the starting digits in `values`, perform any necessary clean up, and compare the results with the Benford’s Law, graphically, by producing a plot we did in question (2).

Note that we are simplifying things by wrapping what we did in questions (1) and (2) into a function so we can do the analysis more efficiently.

Test your function on the `TRANSACTION_AMT` column from the `elections` dataset. Note that the results should be the same as question (2) — even the title of your plot.

For fair comparison, set the y-axis limits to be between 0 and 50%.

 If you called either of the `benfords_law` or `get_starting_digit` functions within your `compare_to_benfords` function, consider the following. What if you shared this function with your friend, who didn’t have access to your `benfords_law` or `get_starting_digit` functions? It wouldn’t work! Instead, it is perfectly acceptable to declare your functions inside your `compare_to_benfords` function. These types of functions are called helper functions.
Items to submit
• R code used to solve this problem.

• The results of running the R code.

• The results of running `compare_to_benfords(elections\$TRANSACTION_AMT)`.

### Question 4

Let’s dig into data a bit more. Using the `compare_to_benfords` function, analyze the transactions from the following entities (`ENTITY_TP`):

• Candidate ('CAN'),

• Individual - a person - ('IND'),

• and Organization - not a committee and not a person - ('ORG').

Use a loop, or one of the functions in the `apply` suite to solve this problem.

Write 1-2 sentences comparing the transactions for each type of `ENTITY_TP`.

Before running your code, run the following code to create a 2x2 grid for our plots.

``par(mfrow=c(1,3))``
 There are many ways to solve this problem.
Items to submit
• R code used to solve this problem.

• The results of running the R code.

• The results of running `compare_to_benfords(elections\$TRANSACTION_AMT)`.

• Optional: Include the name or abbreviation of the entity in its title.

### Question 5

Use the elections datasets and what you learned from the Benford’s Law to explore the dataset more.

You can compare specific states, donations to other entities, or even use datasets from other years.

Explain what and why you are doing, and what are your conclusions. Be creative!

Items to submit
• R code used to solve this problem.

• The results of running the R code.

• 1-2 sentences explaining what and why you are doing.

• 1-2 sentences explaining your conclusions.

 Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.