TDM 10100: Project 9 — 2022

Benford’s Law

Motivation: Benford’s law has many applications, including its infamous use in fraud detection. It also helps detect anomolies in naturally occurring datasets.

Scope: 'R' and functions

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/election/escaped2020sample.txt

Helpful Hint

A txt and csv file both store information in plain text. csv files are always separated by commas. In txt files the fields can be separated with commas, semicolons, or tab.

To read in a txt file as a csv we simply add sep="|" (see code below)

myDF <- read.csv("/anvil/projects/tdm/data/election/escaped/escaped2020sample.txt", sep="|")

Questions

Benford’s law (also known as the first digit law) states that the leading digits in a collection of datasets will most likely be small.
It is basically a probability distribution that gives the likelihood of the first digit occurring, in a set of numbers.

Another way to understand Benford’s law is to know that it helps us assess the relative frequency distribution for the leading digits of numbers in a dataset. It states that leading digits with smaller values occur more frequently.

Insider Knowledge

A probability distrubution helps definte what the probability of an event happening is. It can be simple events like a coin toss, or it can be applied to complex events such as the outcome of drug treatments etc.

  • Basic probability distributions which can be shown on a probability distribution table.

  • Binomial distributions, which have “Successes” and “Failures.”

  • Normal distributions, sometimes called a Bell Curve.

Remember that the sum of all the probablities in a distrubution is always 100% or 1 as a decimal.

Helpful Hint

This law only works for numbers that are significand S(x) which means any number that is set into a standard format.

To do this you must

  • Find the first non-zero digit

  • Move the decimal point to the right of that digit

  • Ignore the sign

An example would be 9087 and -.9087 both have the S(x) as 9.087

It can also work to find the second, third and succeeding numbers. It can also find the probability of certian combinations of numbers.

Typically does not apply to data sets that have a minimum and maximum (restricted). And to datasets if the numbers are assigned (i.e. social security numbers, phone numbers etc.) and not naturally occurring numbers.

Larger datasets and data that ranges over multiple orders of magnitudes from low to high work well using Bedford’s law.

Benford’s law is given by the equation below.

$P(d) = \dfrac{\ln((d+1)/d)}{\ln(10)}$

$d$ is the leading digit of a number (and $d \in \{1, \cdots, 9\}$)

An example the probability of the first digit being a 1 is

$P(1) = \dfrac{\ln((1+1)/1)}{\ln(10)} = 0.301$

ONE

  1. Create a function called benfords_law that takes the argument digit, and calculates the probability of digit being the starting figure of a random number based on Benford’s law.

  2. Create a vector named digits with numbers 1-9

  3. Now use the benfords_law function to create a plot (could be a bar plot, line plot, dot plot, etc., anything is OK) that shows the likelihood of digits occurring

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

TWO

  1. Read in the elections data (we have used this previously) into a dataset named myDF.

  2. Create a vector called firstdigit with the first digit from the TRANSACTION_AMT and then plot it (again, could be a bar plot, line plot, dot plot, etc., anything is OK).

  3. Does it look like it follows Bedford’s law? Why or why not?

Helpful Hint

use this to help plot

firstdigit <- as.numeric(firstdigit)
hist(firstdigit)
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

THREE

Create a function that will look at both the EMPLOYER and the OCCUPATION columns and return a new data frame with an added column named Employed that is FALSE if EMPLOYER is "NOT EMPLOYED", and is FALSE if OCCUPATION is "NOT EMPLOYED", and is TRUE otherwise.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

FOUR

How many arguments does the above function have? What does each line do? Use #comment to explain your function.

Using a graph, can you show the percentage of individuals employed vs not employed?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

FIVE

Write your own custom function! Make sure your function has at least two arguments and get creative. Your function could output a plot, or search and find information within the data.frame. Use what you have learned in Project 8 and 9 to help guide you.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Resources

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.