TDM 10100: Project 3 — Fall 2022

Motivation: data.frames are the primary data structure you will work with when using R. It is important to understand how to insert, retrieve, and update data in a data.frame.

Context: In the previous project we, ran our first R code, and learned about accessing data inside vectors. In this project we will continue to reinforce what we’ve already learned and introduce a new, flexible data structure called `data.frame`s.

Scope: r, data.frames, recycling, factors

Learning Objectives
  • - Explain what "recycling" is in R and predict behavior of provided statements.

  • Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.

  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • List the differences between lists, vectors, factors, and data.frames, and when to use each.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

As described in the video above, Dr Ward is using:

options(jupyter.rich_display = F)

so that the work using the kernel f2022-s2023-r looks similar to the work using the kernel f2022-s2023. We will probably make this option permanent in the future, but I just wanted to point this out. You do not have to do this, but I like the way it the output looks with this option.

Using the f2022-s2023-r kernel, lets first see all of the files that are in the Disney folder

list.files("/anvil/projects/tdm/data/disney")

After looking at several of the files we will go ahead and read in the data frame on the 7 Dwarfs Train.

myDF <- read.csv("/anvil/projects/tdm/data/disney/7_dwarfs_train.csv", stringsAsFactors = TRUE)

If we want to see the file size (aka how large) of the CSV.

file.info("/anvil/projects/tdm/data/disney/7_dwarfs_train.csv")$size

You can also use file.info to see other information about the file.

Insider Knowledge

size- double: File size in bytes.
isdir- logical: Is the file a directory?
mode- integer of class "octmode". The file permissions, printed in octal, for example 644.
mtime, ctime, atime- integer of class "POSIXct": file modification, ‘last status change’ and last access times.
uid- integer: the user ID of the file’s owner.
gid- integer: the group ID of the file’s group.
uname- character: uid interpreted as a user name. grname
character: gid interpreted as a group name. Unknown user and group names will be NA.

ONE

Familiarizing yourself with the data.

Helpful Hint

You can look at the first 6 rows (head) and the last 6 rows (tail). The structure (str) and/or the dimentions (dim) of the dataset.

"SACTMIN" is the actual minutes that a person waited in line
"SPOSTMIN" is the time about the ride, estimating the wait time. (Any value that is -999 means that the ride was not in service)
"datetime" is the date and time the information was recorded
"date" is the date of the event

In the last project we learned about how to look at the data.frame. Based on that, write 1-2 sentences describing the dataset (how many rows, how many columns, the type of data, etc.) and what it holds. Use the head command to look at the first 21 rows.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • 1-2 sentences explaining the dataset.

TWO

Now that we have a better understanding of content and structure of our data. We are diving a bit deeper and making connections within the data.

  1. If we are looking at the column "SPOSTMIN" what do you notice about the increments of time? I.e., is there anything special about the types of values that appear? How many different wait time options do you see in "SPOSTMIN"?

  2. How many NA values do you see in "SPOSTMIN"?

  3. Create a new data frame with the name newDF in which the "SPOSTMIN" column has all NA values removed. In other words, select the rows of myDF for which "SPOSTMIN" is not NA and call the resulting data.frame by the name newDF.

Insider Knowledge

na.omit and na.exclude returns objects with the observations removed if they contain any missing values. As well as performs calculations by considering the NA values but does not include them in the calculation.
na.rm first removes the NA values and then does the calculation.
na.pass returns the object unchanged
It is also possible to use the subset function and the is.na function.

Helpful Hint

Use the code below

table(myDF$SPOSTMIN)
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • The answer to the 3 questions above.

THREE

Use the myDF data.frame for this question.

  1. On Christmas day, what was the average wait time? On July 26th, what was the average wait time?

  2. Is there a difference between the wait times in the summer and the holidays?

  3. On which date do the most entries occur in the data set?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • The answer to the 3 questions above.

FOUR

Recycling in R

Insider Knowledge

Recycling happens in R automatically. When you are attempting to preform operations like addition, subtraction on two vectors of unequal length.
The shorter vector will be repeated as long as the operation is completing on the longer vector.

  1. Find the lengths of the column "SPOSTMIN" in the myDF and newDF.

  2. Create a new vector called myhours by adding together "SPOSTMIN" columns from myDF and newDF with each divided by 60. What is the length of that new vector myhours?

  3. What happened in row 313997? Why?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • The answers to the 3 questions above.

FIVE

Indexing and Expanding dataframes in R

library(lubridate)
myDF$weekday <- wday(myDF$datetime, label=TRUE)
  1. Consider the average wait times. What day of the week in myDF has the longest average wait times?

  2. Make a plot and a dotchart that illustrate the data for the average wait times. Which one conveys the information better and why?

  3. We created a new column in myDF that shows the weekdays. Do the same thing for part (a) and (b) again, but this time using the months instead of the days of the week.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • The answers to the 3 questions above.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.