STAT 19000: Project 9 — Fall 2021

Motivation: A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code!

Context: We’ve been learning about and using functions all year! Now we are going to learn more about some of the terminology and components of a function, as you will certainly need to be able to write your own functions soon.

Scope: r, functions

Learning Objectives
  • Gain proficiency using split, merge, and subset.

  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • Demonstrate how to use tapply to solve data-driven problems.

  • Comprehend what a function is, and the components of a function in R.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/election/*.txt

Questions

Question 1

Benford’s law has many applications, the most famous probably being fraud detection.

Benford’s law, also called the Newcomb–Benford law, the law of anomalous numbers, or the first-digit law, is an observation that in many real-life sets of numerical data, the leading digit is likely to be small. In sets that obey the law, the number 1 appears as the leading significant digit about 30 % of the time, while 9 appears as the leading significant digit less than 5 % of the time. If the digits were distributed uniformly, they would each occur about 11.1 % of the time. Benford’s law also makes predictions about the distribution of second digits, third digits, digit combinations, and so on.

Benford’s law is given by the equation below.

$P(d) = \dfrac{\ln((d+1)/d)}{\ln(10)}$

$d$ is the leading digit of a number (and $d \in \{1, \cdots, 9\}$)

Create a function called benfords_law that takes the argument digit, and calculates the probability of digit being the starting digit of a random number based on Benford’s law above.

Consider digit to be a single value. Test your function on digit 7.

Items to submit
  • R code used to solve this problem.

  • The results of running the R code.

  • The results of running benfords_law(7).

Question 2

Let’s make our function more user friendly. When creating functions, it is important to think where you are going to use it, and if other people may use it as well.

Adding error catching statements can help make sure your function is not used out of context.

Add the following error catching by creating an if statement that checks if digit is between 1 and 9. If not, use the stop function to stop the function, and return a message explaining the error or how the user could avoid it.

Consider digit to be a single value. Test your new benfords_law function on digit 0.

Items to submit
  • R code used to solve this problem.

  • The results of running the R code.

  • The results of running benfords_law(0).

Question 3

Our benfords_law function was created to calculate the value of a single digit. We have discussed in the past the advantages of having a vectorized function.

Modify benfords_law to accept a vector of leading digits. Make sure benfords_law stops if any value in the vector digit is not between 1 and 9.

Test your vectorized benfords_law using the following code.

benfords_law(0:5)
benfords_law(1:6)

There are many ways to solve this problem. You can use for loops, use the functions sapply or Vectorize. However, the simplest way may be to take a look at our if statement as the function log is already vectorized.

Relevant topics: any

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Calculate the Benford’s law for all possible digits (1 to 9). Create a graph to illustrate the results. You can use a barplot, a lineplot, or a combination of both.

Make sure you add a title to your plot, play with the colors and aesthetics of your plot. Have fun!

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Now that we have and understand the theoretical probabilities of Benford’s Law, how about we use it to try to find anomalies in the elections dataset?

As we mentioned previously, Benford’s Law is very commonly used in fraud detection. Fraud detection algorithms looks for anomalies in datasets based on certain criteria and flag it for audit or further exploration.

Not every anomaly is a fraud, but it is a good start.

We will continue this in our next project, but we can start to set things up.

Create a function called get_starting_digit that has one argument, transaction_vector.

The function should return a vector containing the starting digit for each value in the transaction_vector.

For example, get_starting_digit(c(10, 2, 500)) should return c(1, 2, 5). Make sure that the the results of get_starting_digit is a numeric vector.

Test your code running the following code.

str(get_starting_digit(c(100,2,50,689,1)))

There are many ways to solve this question.

*Relevant topics: as.numeric, substr

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.