# STAT 19000: Project 5 — Fall 2020

Motivation: As briefly mentioned in project 4, R differs from other programming languages in that typically you will want to avoid using for loops, and instead use vectorized functions and the apply suite. In this project we will demonstrate some basic vectorized operations, and how they are better to use than loops.

Context: While it was important to stop and learn about looping and if/else statements, in this project, we will explore the R way of doing things.

Scope: r, data.frames, recycling, factors, if/else, for

Learning objectives
• Explain what "recycling" is in R and predict behavior of provided statements.

• Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.

• Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

• Read and write basic (csv) data.

• Explain and demonstrate: positional, named, and logical indexing.

• List the differences between lists, vectors, factors, and data.frames, and when to use each.

• Demonstrate a working knowledge of control flow in r: for loops .

## Dataset

The following questions will use the dataset found in Scholar:

`/class/datamine/data/fars`

## Questions

### Question 1

The `fars` dataset contains a series of folders labeled by year. In each year folder there is (at least) the files `ACCIDENT.CSV`, `PERSON.CSV`, and `VEHICLE.CSV`. If you take a peek at any `ACCIDENT.CSV` file in any year, you’ll notice that the column `YEAR` only contains the last two digits of the year. Add a new `YEAR` column that contains the full year. Use the `rbind` function to create a data.frame called `accidents` that combines the `ACCIDENT.CSV` files from the years 1975 through 1981 (inclusive) into one big dataset. After creating that `accidents` data frame, change the values in the `YEAR` column from two digits to four digits (i.e., paste a 19 onto each year value).

Here is a video to walk you through the method of solving Question 1.

Here is another video, using two functions you have not (yet) learned, namely, `lapply` and `do.call`. You do not need to understand these yet. It is just a glimpse of some powerful functions to come later in the course!

Items to submit
• R code used to solve the problem/comments explaining what the code does.

• The result of `unique(accidents\$YEAR)`.

## Question 2

Using the new `accidents` data frame that you created in (1), how many accidents are there in which 1 or more drunk drivers were involved in an accident with a school bus?

 Look at the variables `DRUNK_DR` and `SCH_BUS`.

Here is a video about a related problem with 3 fatalities (instead of considering drunk drivers).

Items to submit
• R code used to solve the problem/comments explaining what the code does.

### Question 3

Again using the `accidents` data frame: For accidents involving 1 or more drunk drivers and a school bus, how many happened in each of the 7 years? Which year had the largest number of these types of accidents?

Here is a video about the related problem with 3 fatalities (instead of considering drunk drivers), tabulated according to year.

Items to submit
• R code used to solve the problem/comments explaining what the code does.

• The results.

• Which year had the most qualifying accidents.

### Question 4

Again using the `accidents` data frame: Calculate the mean number of motorists involved in an accident (variable `PERSON`) with i drunk drivers, where i takes the values from 0 through 6.

 It is OK that there are no accidents involving just 5 drunk drivers.
 You can use either a `for` loop or a `tapply` function to accomplish this question.

Here is a video about the related problem with 3 fatalities (instead of considering drunk drivers). We calculate the mean number of fatalities for accidents with `i` drunk drivers, where `i` takes the values from 0 through 6.

Items to submit
• R code used to solve the problem/comments explaining what the code does.

• The output from running your code.

### Question 5

Again using the `accidents` data frame: We have a theory that there are more accidents in cold weather months for Indiana and states around Indiana. For this question, only consider the data for which `STATE` is one of these: Indiana (18), Illinois (17), Ohio (39), or Michigan (26). Create a barplot that shows the number of accidents by `STATE` and by month (`MONTH`) simultanously. What months have the most accidents? Are you surprised by these results? Explain why or why not?

We guide students through the methodology for Question 5 in this video. We also add a legend, in case students want to distinguish which stacked barplot goes with each of the four States.

Items to submit
• R code used to solve the problem/comments explaining what the code does.

• The output (plot) from running your code.

• 1-2 sentences explaining which month(s) have the most accidents and whether or not this surprises you.

### OPTIONAL QUESTION

Spruce up your plot from (5). Do any of the following: