STAT 19000: Project 12 — Fall 2020

Motivation: In the previous project you were forced to do a little bit of date manipulation. Dates can be very difficult to work with, regardless of the language you are using. lubridate is a package within the famous tidyverse, that greatly simplifies some of the most common tasks one needs to perform with date data.

Context: We’ve been reviewing topics learned this semester. In this project we will continue solving data-driven problems, wrangling data, and creating graphics. We will introduce a tidyverse package that adds great stand-alone value when working with dates.

Scope: r

Learning objectives
  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • Utilize apply functions in order to solve a data-driven problem.

  • Gain proficiency using split, merge, and subset.

  • Demostrate the ability to create basic graphs with default settings.

  • Demonstratre the ability to modify axes labels and titles.

  • Incorporate legends using legend().

  • Demonstrate the ability to customize a plot (color, shape/linetype).

  • Convert strings to dates, and format dates using the lubridate package.

Questions

Question 1

Let’s continue our exploration of the Zillow time series data. A useful package for dealing with dates is called lubridate. This is part of the famous tidyverse suite of packages. Run the code below to load it. Read the /class/datamine/data/zillow/State_time_series.csv dataset into a data.frame named states. What class and type is the column Date?

library(lubridate)
Items to submit
  • R code used to solve the question.

  • class and typeof column Date.

Question 2

Convert column Date to a corresponding date format using lubridate. Check that you correctly transformed it by checking its class like we did in question (1). Compare and contrast this method of conversion with the solution you came up with for question (3) in the previous project. Which method do you prefer?

Take a look at the following functions from lubridate: ymd, mdy, dym.

Here is a video about ymd, mdy, dym

Items to submit
  • R code used to solve the question.

  • class of modified column Date.

  • 1-2 sentences stating which method you prefer (if any) and why.

Question 3

Create 3 new columns in states called year, month, day_of_week (Sun-Sat) using lubridate. Get the frequency table for your newly created columns. Do we have the same amount of data for all years, for all months, and for all days of the week? We did something similar in question (3) in the previous project — specifically, we broke each date down by year. Which method do you prefer and why?

Take a look at functions month, year, day, wday.

You may find the argument of label in wday useful.

Here is a video about month, year, day, wday

Items to submit
  • R code used to solve the question.

  • Frequency table for newly created columns.

  • 1-2 sentences answering whether or not we have the same amount of data for all years, months, and days of the week.

  • 1-2 sentences stating which method you prefer (if any) and why.

Question 4

Is there a better month or set of months to put your house on the market? Use tapply to compare the average DaysOnZillow_AllHomes for all months. Make a barplot showing our results. Make sure your barplot includes "all of the fixings" (title, labeled axes, legend if necessary, etc. Make it look good.).

If you want to have the month’s abbreviation in your plot, you may find both the month.abb object and the argument names.arg in barplot useful.

This video might help with Question 4.

Items to submit
  • R code used to solve the question.

  • The barplot of the average DaysOnZillow_AllHomes for all months.

  • 1-2 sentences answering the question "Is there a better time to put your house on the market?" based on your results.

Question 5

Filter the states data to contain only years from 2010+ and call it states2010plus. Make a lineplot showing the average DaysOnZillow_AllHomes by Date using states2010plus data. Can you spot any trends? Write 1-2 sentences explaining what (if any) trends you see.

Items to submit
  • R code used to solve the question.

  • The time series lineplot for the average DaysOnZillow_AllHomes per date.

  • 1-2 sentences commenting on the patterns found in the plot, and your impressions of it.

Question 6

Do homes sell faster in certain states? For the following states: 'California', 'Indiana', 'NewYork' and 'Florida', make a lineplot for DaysOnZillow_AllHomes by Date with one line per state. Use the states2010plus dataset for this question. Make sure to have each state line colored differently, and to add a legend to your plot. Examine the plot and write 1-2 sentences about any observations you have.

You may want to use the lines function to add the lines for different state.

Make sure to fix the y-axis limits using the ylim argument in plot to properly show all four lines.

You may find the argument col useful to change the color of your line.

To make your legend fit, consider using the states abbreviation, and the arguments ncol and cex of the legend function.

Items to submit
  • R code used to solve the question.

  • The time series lineplot for DaysOnZillow_AllHomes per date for the 4 states.

  • 1-2 sentences commenting on the patterns found in the plot, and your answer to the question "Do homes sell faster in certain states rather than others?".