STAT 19000: Project 5 — Fall 2020

Motivation: As briefly mentioned in project 4, R differs from other programming languages in that typically you will want to avoid using for loops, and instead use vectorized functions and the apply suite. In this project we will demonstrate some basic vectorized operations, and how they are better to use than loops.

Context: While it was important to stop and learn about looping and if/else statements, in this project, we will explore the R way of doing things.

Scope: r, data.frames, recycling, factors, if/else, for

Learning objectives
  • Explain what "recycling" is in R and predict behavior of provided statements.

  • Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.

  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • List the differences between lists, vectors, factors, and data.frames, and when to use each.

  • Demonstrate a working knowledge of control flow in r: for loops .

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/fars

To get more information on the dataset, see here.

Questions

Question 1

The fars dataset contains a series of folders labeled by year. In each year folder there is (at least) the files ACCIDENT.CSV, PERSON.CSV, and VEHICLE.CSV. If you take a peek at any ACCIDENT.CSV file in any year, you’ll notice that the column YEAR only contains the last two digits of the year. Add a new YEAR column that contains the full year. Use the rbind function to create a data.frame called accidents that combines the ACCIDENT.CSV files from the years 1975 through 1981 (inclusive) into one big dataset. After creating that accidents data frame, change the values in the YEAR column from two digits to four digits (i.e., paste a 19 onto each year value).

Here is a video to walk you through the method of solving Question 1.

Here is another video, using two functions you have not (yet) learned, namely, lapply and do.call. You do not need to understand these yet. It is just a glimpse of some powerful functions to come later in the course!

Items to submit
  • R code used to solve the problem/comments explaining what the code does.

  • The result of unique(accidents$YEAR).

Question 2

Using the new accidents data frame that you created in (1), how many accidents are there in which 1 or more drunk drivers were involved in an accident with a school bus?

Look at the variables DRUNK_DR and SCH_BUS.

Here is a video about a related problem with 3 fatalities (instead of considering drunk drivers).

Items to submit
  • R code used to solve the problem/comments explaining what the code does.

  • The result/answer itself.

Question 3

Again using the accidents data frame: For accidents involving 1 or more drunk drivers and a school bus, how many happened in each of the 7 years? Which year had the largest number of these types of accidents?

Here is a video about the related problem with 3 fatalities (instead of considering drunk drivers), tabulated according to year.

Items to submit
  • R code used to solve the problem/comments explaining what the code does.

  • The results.

  • Which year had the most qualifying accidents.

Question 4

Again using the accidents data frame: Calculate the mean number of motorists involved in an accident (variable PERSON) with i drunk drivers, where i takes the values from 0 through 6.

It is OK that there are no accidents involving just 5 drunk drivers.

You can use either a for loop or a tapply function to accomplish this question.

Here is a video about the related problem with 3 fatalities (instead of considering drunk drivers). We calculate the mean number of fatalities for accidents with i drunk drivers, where i takes the values from 0 through 6.

Items to submit
  • R code used to solve the problem/comments explaining what the code does.

  • The output from running your code.

Question 5

Again using the accidents data frame: We have a theory that there are more accidents in cold weather months for Indiana and states around Indiana. For this question, only consider the data for which STATE is one of these: Indiana (18), Illinois (17), Ohio (39), or Michigan (26). Create a barplot that shows the number of accidents by STATE and by month (MONTH) simultanously. What months have the most accidents? Are you surprised by these results? Explain why or why not?

We guide students through the methodology for Question 5 in this video. We also add a legend, in case students want to distinguish which stacked barplot goes with each of the four States.

Items to submit
  • R code used to solve the problem/comments explaining what the code does.

  • The output (plot) from running your code.

  • 1-2 sentences explaining which month(s) have the most accidents and whether or not this surprises you.

OPTIONAL QUESTION

Spruce up your plot from (5). Do any of the following:

  • Add vibrant (and preferably colorblind friendly) colors to your plot

  • Add a title

  • Add a legend

  • Add month names or abbreviations instead of numbers

Here is a resource to get you started.

Items to submit
  • R code used to solve the problem/comments explaining what the code does.

  • The output (plot) from running your code.