TDM 10100: Project 9 — 2023

Benford’s Law

Motivation: Benford’s law has many applications, including its well known use in fraud detection. It also helps detect anomalies in naturally occurring datasets.

You may get more information about Benford’s law from the following link "What is Benford’s Law and Why is it Important for Data Science"

Scope: 'R' and functions

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

/anvil/projects/tdm/data/restaurant/orders.csv

A txt and csv file both store information in plain text. csv files are always separated by commas. In txt files the fields can be separated with commas, semicolons, or tabs.

myDF <- read.csv("/anvil/projects/tdm/data/restaurant/orders.csv")

Questions

Benford’s law (also known as the first digit law) states that the leading digits in a collection of datasets will most likely be small.
It is basically a probability distribution that gives the likelihood of the first digit occurring, in a set of numbers.

Another way to understand Benford’s law is to know that it helps us assess the relative frequency distribution for the leading digits of numbers in a dataset. It states that leading digits with smaller values occur more frequently.

A probability distribution helps define what the probability of an event happening is. It can be simple events like a coin toss, or it can be applied to complex events such as the outcome of drug treatments etc.

Basic probability distributions which can be shown on a probability distribution table.
Binomial distributions, which have “Successes” and “Failures.”
Normal distributions, sometimes called a Bell Curve.

Remember that the sum of all the probabilities in a distribution is always 100% or 1 as a decimal.

This law only works for numbers that are significand S(x) which means any number that is set into a standard format.

To do this you must

Find the first non-zero digit
Move the decimal point to the right of that digit
Ignore the sign

An example would be 9087 and -.9087 both have the S(x) as 9.087

It can also work to find the second, third and succeeding numbers. It can also find the probability of certain combinations of numbers.

Typically this law does not apply to data sets that have a minimum and maximum (restricted). This law does not apply to datasets if the numbers are assigned (i.e. social security numbers, phone numbers etc.) and are not naturally occurring numbers.

Larger datasets and data that ranges over multiple orders of magnitudes from low to high work well using Bedford’s law.

Benford’s law is given by the equation below.

$P(d) = \dfrac{\ln((d+1)/d)}{\ln(10)}$

$d$ is the leading digit of a number (and $d \in \{1, \cdots, 9\}$)

An example the probability of the first digit being a 1 is

$P(1) = \dfrac{\ln((1+1)/1)}{\ln(10)} = 0.301$

The following is a function implementing Benford’s law

benfords_law <- function(d) log10(1+1/d)

To show Benfords_law in a line plot

digits <-1:9
bf_val<-benfords_law(digits)
plot(digits, bf_val, xlab = "digits", ylab="probabilities", main="Benfords Law Plot Line")

Question 1 (1 pt)

Create a plot (could be a bar plot, line plot, scatter plot, etc., any type of plot is OK) to show Benfords’s Law for probabilities of digits from 1 to 9.

Question 2 (1 pt)

Create a function called first_digit that takes an argument number, and extracts the first non-zero digit from the number

Question 3 (2 pts)

Read in the restaurant orders data /anvil/projects/tdm/data/restaurant/orders.csv into a dataset named myDF.
Create a vector fd_grand_total by using sapply with your function first_digit from question 2 on the grand_total column in your myDF dataframe

Question 4 (2 pts)

Calculate the actual distribution of digits in fd_grand_total
Plot the output actual distribution (again, could be a bar plot, line plot, dot plot, etc., anything is OK). Does it look like it follows Benford’s law? Explain briefly.

use table to get summary times of digits then divide by length of the vector fd_grand_total

Question 5 (2 pts)

Create a function that will return a new data frame orders_by_dates from the myDF that looks at the delivery_date column to compare with two arguments start_date and end_date. If the delivery_date is in between, then add record to the new data frame.
Run the function for a certain period, and display some orders with the head function

as.Date will be useful to do conversion in order to compare dates

Project 09 Assignment Checklist

Jupyter Lab notebook with your code, comments and output for the assignment
- firstname-lastname-project09.ipynb.
R code and comments for the assignment
- firstname-lastname-project09.R.
Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.