STAT 19000: Project 9 — Fall 2021
Motivation: A key component to writing efficient code is writing functions. Functions allow us to repeat and reuse coding steps that we used previously, over and over again. If you find you are repeating code over and over, a function may be a good way to reduce lots of lines of code!
Context: We’ve been learning about and using functions all year! Now we are going to learn more about some of the terminology and components of a function, as you will certainly need to be able to write your own functions soon.
Scope: r, functions
Dataset(s)
The following questions will use the following dataset(s):

/depot/datamine/data/election/*.txt
Questions
Question 1
Benford’s law has many applications, the most famous probably being fraud detection.
Benford’s law, also called the Newcomb–Benford law, the law of anomalous numbers, or the firstdigit law, is an observation that in many reallife sets of numerical data, the leading digit is likely to be small. In sets that obey the law, the number 1 appears as the leading significant digit about 30 % of the time, while 9 appears as the leading significant digit less than 5 % of the time. If the digits were distributed uniformly, they would each occur about 11.1 % of the time. Benford’s law also makes predictions about the distribution of second digits, third digits, digit combinations, and so on.
en.wikipedia.org/wiki/Benford%27s_law
Benford’s law is given by the equation below.
$P(d) = \dfrac{\ln((d+1)/d)}{\ln(10)}$
$d$ is the leading digit of a number (and $d \in \{1, \cdots, 9\}$)
Create a function called benfords_law
that takes the argument digit
, and calculates the probability of digit
being the starting digit of a random number based on Benford’s law above.
Consider digit
to be a single value. Test your function on digit 7.

R code used to solve this problem.

The results of running the R code.

The results of running
benfords_law(7)
.
Question 2
Let’s make our function more user friendly. When creating functions, it is important to think where you are going to use it, and if other people may use it as well.
Adding error catching statements can help make sure your function is not used out of context.
Add the following error catching by creating an if statement that checks if digit
is between 1 and 9. If not, use the stop
function to stop the function, and return a message explaining the error or how the user could avoid it.
Consider digit
to be a single value. Test your new benfords_law
function on digit 0.

R code used to solve this problem.

The results of running the R code.

The results of running
benfords_law(0)
.
Question 3
Our benfords_law
function was created to calculate the value of a single digit. We have discussed in the past the advantages of having a vectorized function.
Modify benfords_law
to accept a vector of leading digits. Make sure benfords_law
stops if any value in the vector digit
is not between 1 and 9.
Test your vectorized benfords_law
using the following code.
benfords_law(0:5)
benfords_law(1:6)
There are many ways to solve this problem. You can use for loops, use the functions 
Relevant topics: any

Code used to solve this problem.

Output from running the code.
Question 4
Calculate the Benford’s law for all possible digits (1 to 9). Create a graph to illustrate the results. You can use a barplot, a lineplot, or a combination of both.
Make sure you add a title to your plot, play with the colors and aesthetics of your plot. Have fun!

Code used to solve this problem.

Output from running the code.
Question 5
Now that we have and understand the theoretical probabilities of Benford’s Law, how about we use it to try to find anomalies in the elections dataset?
As we mentioned previously, Benford’s Law is very commonly used in fraud detection. Fraud detection algorithms looks for anomalies in datasets based on certain criteria and flag it for audit or further exploration.
Not every anomaly is a fraud, but it is a good start.
We will continue this in our next project, but we can start to set things up.
Create a function called get_starting_digit
that has one argument, transaction_vector
.
The function should return a vector containing the starting digit for each value in the transaction_vector
.
For example, get_starting_digit(c(10, 2, 500))
should return c(1, 2, 5)
. Make sure that the the results of get_starting_digit
is a numeric vector.
Test your code running the following code.
str(get_starting_digit(c(100,2,50,689,1)))
There are many ways to solve this question. 
*Relevant topics: as.numeric, substr

Code used to solve this problem.

Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. 