STAT 19000: Project 2 — Fall 2020

Introduction to R using 84.51 examples

Introduction to R using NYC Yellow Taxi Cab examples

Motivation: The R environment is a powerful tool to perform data analysis. R is a tool that is often compared to Python. Both have their advantages and disadvantages, and both are worth learning. In this project we will dive in head first and learn the basics while solving data-driven problems.

Context: Last project we set the stage for the rest of the semester. We got some familiarity with our project templates, and modified and ran some R code. In this project, we will continue to use R within RStudio to solve problems. Soon you will see how powerful R is and why it is often a more effective tool to use than spreadsheets.

Scope: r, vectors, indexing, recycling

Learning Objectives
  • List the differences between lists, vectors, factors, and data.frames, and when to use each.

  • Explain and demonstrate: positional, named, and logical indexing.

  • Read and write basic (csv) data.

  • Explain what "recycling" is in R and predict behavior of provided statements.

  • Identify good and bad aspects of simple plots.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/disney/metadata.csv

Questions

Question 1

Use the read.csv function to load /class/datamine/data/disney/metadata.csvinto a data.frame called myDF. Note that read.csv by default loads data into a data.frame. (We will learn more about the idea of a data.frame, but for now, just think of it like a spreadsheet, in which each column has the same type of data.) Print the first few rows of myDF using the head function (as in Project 1, Question 7).

Items to submit
  • R code used to solve the problem in an R code chunk.

Question 2

We’ve provided you with R code below that will extract the column WDWMAXTEMP of myDF into a vector. What is the 1st value in the vector? What is the 50th value in the vector? What type of data is in the vector? (For this last question, use the typeof function to find the type of data.)

our_vec <- myDF$WDWMAXTEMP
Items to submit
  • R code used to solve the problem in an R code chunk.

  • The values of the first, and 50th element in the vector.

  • The type of data in the vector (using the typeof function).

Question 3

Use the head function to create a vector called first50 that contains the first 50 values of the vector our_vec. Use the tail function to create a vector called last50 that contains the last 50 values of the vector our_vec.

You can access many elements in a vector at the same time. To demonstrate this, create a vector called mymix that contain the sum of each element of first50 being added to the analogous element of last50.

Items to submit
  • R code used to solve this problem.

  • The contents of each of the three vectors.

Question 4

In (3), we were able to rapidly add values together from two different vectors. Both vectors were the same size, hence, it was obvious which elements in each vector were added together.

Create a new vector called hot which contains only the values of myDF$WDWMAXTEMP which are greater than or equal to 80 (our vector contains max temperatures for days at Disney World). How many elements are in hot?

Calculate the sum of hot and first50. Do we get a warning? Read this and then explain what is going on.

Items to submit
  • R code used to solve this problem.

  • 1-2 sentences explaining what is happening when we are adding two vectors of different lengths.

Question 5

Plot the WDWMAXTEMP vector from myDF.

Items to submit
  • R code used to solve this problem.

  • Plot of the WDWMAXTEMP vector from myDF.

Question 6

The following three pieces of code each create a graphic. The first two graphics are created using only core R functions. The third graphic is created using a package called ggplot. We will learn more about all of these things later on. For now, pick your favorite graphic, and write 1-2 sentences explaining why it is your favorite, what could be improved, and include any interesting observations (if any).

dat <- table(myDF$SEASON)
dotchart(dat, main="Seasons", xlab="Number of Days in Each Season")

A plot resembling an abacus, where holiday is listed in a vertical list and the corresponding number of days are on that horizontal line, the further right indicating more days for that holiday classification. The clear winner is Spring.

dat <- tapply(myDF$WDWMEANTEMP, myDF$DAYOFYEAR, mean, na.rm=T)
seasons <- tapply(myDF$SEASON, myDF$DAYOFYEAR, function(x) unique(x)[1])
pal <- c("#4E79A7", "#F28E2B", "#A0CBE8",  "#FFBE7D", "#59A14F", "#8CD17D", "#B6992D", "#F1CE63", "#499894", "#86BCB6", "#E15759", "#FF9D9A", "#79706E", "#BAB0AC", "#1170aa", "#B07AA1")
colors <- factor(seasons)
levels(colors) <- pal
par(oma=c(7,0,0,0), xpd=NA)
barplot(dat, main="Average Temperature", xlab="Jan 1 (Day 0) - Dec 31 (Day 365)", ylab="Degrees in Fahrenheit", col=as.factor(colors), border = NA, space=0)
legend(0, -30, legend=levels(factor(seasons)), lwd=5, col=pal, ncol=3, cex=0.8, box.col=NA)

A filled line plot with colors corresponding to the predominant holiday at the time.

library(ggplot2)
library(tidyverse)
summary_temperatures <- myDF %>%
    select(MONTHOFYEAR,WDWMAXTEMP:WDWMEANTEMP) %>%
    group_by(MONTHOFYEAR) %>%
    summarise_all(mean, na.rm=T)
ggplot(summary_temperatures, aes(x=MONTHOFYEAR)) +
    geom_ribbon(aes(ymin = WDWMINTEMP, ymax = WDWMAXTEMP), fill = "#ceb888", alpha=.5) +
    geom_line(aes(y = WDWMEANTEMP), col="#5D8AA8") +
    geom_point(aes(y = WDWMEANTEMP), pch=21,fill = "#5D8AA8", size=2) +
    theme_classic() +
    labs(x = 'Month', y = 'Temperature', title = 'Average temperature range' ) +
    scale_x_continuous(breaks=1:12, labels=month.abb)

Line plot of temperatures over months including the range. Displays a very clear arch, highs in July-August at an average of 82 degrees Fahrenheit and lows at an average of 52 degrees Fahrenheit in January.