# STAT 19000: Project 3 — Fall 2020

Motivation: `data.frame`s are the primary data structure you will work with when using R. It is important to understand how to insert, retrieve, and update data in a `data.frame`.

Context: In the previous project we got our feet wet, and ran our first R code, and learned about accessing data inside vectors. In this project we will continue to reinforce what we’ve already learned and introduce a new, flexible data structure called `data.frame`s.

Scope: r, data.frames, recycling, factors

Learning Objectives
• Explain what "recycling" is in R and predict behavior of provided statements.

• Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.

• Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

• Read and write basic (csv) data.

• Explain and demonstrate: positional, named, and logical indexing.

• List the differences between lists, vectors, factors, and data.frames, and when to use each.

## Dataset

The following questions will use the dataset found in Scholar:

`/class/datamine/data/disney`

## Questions

### Question 1

Read the dataset `/class/datamine/data/disney/splash_mountain.csv` into a data.frame called `splash_mountain`. How many columns, or features are in each dataset? How many rows or observations?

Items to submit
• R code used to solve the problem.

• How many columns or features in each dataset?

### Question 2

Splash Mountain is a fan favorite ride at Disney World’s Magic Kingdom theme park. `splash_mountain` contains a series of dates and datetimes. For each datetime, `splash_mountain` contains a posted minimum wait time, `SPOSTMIN`, and an actual minimum wait time, `SACTMIN`. What is the average posted minimum wait time for Splash Mountain? What is the standard deviation? Based on the fact that `SPOSTMIN` represents the posted minimum wait time for our ride, does our mean and standard deviation make sense? Explain. (You might look ahead to Question 3 before writing the answer to Question 2.)

 If you got `NA` or `NaN` as a result, see here.
Items to submit
• R code used to solve this problem.

• The results of running the R code.

• 1-2 sentences explaining why or why not the results make sense.

### Question 3

In (2), we got some peculiar values for the mean and standard deviation. If you read the "attractions" tab in the file `/class/datamine/data/disney/touringplans_data_dictionary.xlsx`, you will find that -999 is used as a value in `SPOSTMIN` and `SACTMIN` to indicate the ride as being closed. Recalculate the mean and standard deviation of `SPOSTMIN`, excluding values that are -999. Does this seem to have fixed our problem?

Items to submit
• R code used to solve this problem.

• The result of running the R code.

• A statement indicating whether or not the value look reasonable now.

### Question 4

`SPOSTMIN` and `SACTMIN` aren’t the greatest feature/column names. An outsider looking at the data.frame wouldn’t be able to immediately get the gist of what they represent. Change `SPOSTMIN` to `posted_min_wait_time` and `SACTMIN` to `actual_wait_time`.

Hint: You can always use hard-coded integers to change names manually, however, if you use `which`, you can get the index of the column name that you would like to change. For data.frames like `splash_mountain`, this is a lot more efficient than manually counting which column is the one with a certain name.

Items to submit
• R code used to solve the problem.

• The output from executing `names(splash_mountain)` or `colnames(splash_mountain)`.

### Question 5

Use the `cut` function to create a new vector called `quarter` that breaks the `date` column up by quarter. Use the `labels` argument in the `factor` function to label the quarters "q1", "q2", …​, "qX" where `X` is the last quarter. Add `quarter` as a column named `quarter` in `splash_mountain`. How many quarters are there?

 If you have 2 years of data, this will result in 8 quarters: "q1", …​, "q8".
 We can generate sequential data using `seq` and `paste0`: ``paste0("item", seq(1, 5))`` or ``paste0("item", 1:5)``
Items to submit
• R code used to solve the problem.

• The `head` and `tail` of `splash_mountain`.

• The number of quarters in the new `quarter` column.

Question 5 is intended to be a little more challenging, so we worked through the exact same steps, with two other data sets. That way, if you work through these, all you will need to do, to solve Question 5, is to follow the example, and change two things, namely, the data set itself (in the `read.csv` file) and also the format of the date.

This basically steps you through everything in Question 5.

We hope that these are helpful resources for you! We appreciate you very much and we are here to support you! You would not know how to solve this question on your own—​because we are just getting started—​but we like to sometimes put in a question like this, in which you get introduced to several new things, and we will dive deeper into these ideas as we push ahead.

### Question 6

Please include a statement in Project 3 that says, "I acknowledge that the STAT 19000/29000/39000 1-credit Data Mine seminar will be recorded and posted on Piazza, for participants in this course." or if you disagree with this statement, please consult with us at [email protected] for an alternative plan.