STAT 19000: Project 11 — Fall 2020

Motivation: The ability to understand a problem, know what tools are available to you, and select the right tools to get the job done, takes practice. In this project we will use what you’ve learned so far this semester to solve data-driven problems. In previous projects, we’ve directed you towards certain tools. In this project, there will be less direction, and you will have the freedom to choose the tools you’d like.

Context: You’ve learned lots this semester about the R environment. You now have experience using a very balanced "portfolio" of R tools. We will practice using these tools on a set of economic data from Zillow.

Scope: R

Learning objectives
  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • Utilize apply functions in order to solve a data-driven problem.

  • Gain proficiency using split, merge, and subset.

  • Comprehend what a function is, and the components of a function in R.

  • Demonstrate the ability to use nested apply functions to solve a data-driven problem.


The following questions will use the dataset found in Scholar:



Question 1

Read /class/datamine/data/zillow/Zip_time_series.csv into a data.frame called zipc. Look at the RegionName column. It is supposed to be a 5-digit zip code. Either fix the column by writing a function and applying it to the column, or take the time to read the read.csv documentation by running ?read.csv and use an argument to make sure that column is not read in as an integer (which is why zip codes starting with 0 lose the leading 0 when being read in).

This video demonstrates how to read in data and respect the leading zeroes.

Items to submit
  • R code used to solve the problem.

  • head of the RegionName column.

Question 2

One might assume that the owner of a house tends to value that house more than the buyer. If that was the case, perhaps the median listing price (the price which the seller puts the house on the market, or ask price) would be higher than the ZHVI (Zillow Home Value Index — essentially an estimate of the home value). For those rows where both MedianListingPrice_AllHomes and ZHVI_AllHomes have non-NA values, on average how much higher or lower is the median listing price? Can you think of any other reasons why this may be?

Items to submit
  • R code used to solve the problem.

  • The result itself and 1-2 sentences talking about whether or not you can think of any other reasons that may explain the result.

Question 3

Convert the Date column to a date using as.Date. How many years of data do we have in this dataset? Create a line plot with lines for the average MedianListingPrice_AllHomes and average ZHVI_AllHomes by year. The result should be a single plot with multiple lines on it.

Here we give two videos to help you with this question. The first video gives some examples about working with dates in R.

This second video gives an example about how to plot two line graphs at the same time in R.

For a nice addition, add a dotted vertical line on year 2008 near the housing crisis:

abline(v="2008", lty="dotted")
Items to submit
  • R code used to solve the problem.

  • The results of running the code.

Question 4

Read /class/datamine/data/zillow/State_time_series.csv into a data.frame called states. Calculate the average median listing price by state, and create a map using plot_usmap from the usmap package that shows the average median price by state.

We give a full example about how to plot values, by State, on a map.

In order for plot_usmap to work, you must name the column containing states' names to "state".

To split words like "OhSoCool" into "Oh So Cool", try this: trimws(gsub('()', ' \\1', "OhSoCool")). This will be useful as you’ll need to correct the RegionName column at some point in time. Notice that this will not completely fix "DistrictofColumbia". You will need to fix that one manually.

Items to submit
  • R code used to solve the problem.

  • The resulting map.

Question 5

Read /class/datamine/data/zillow/County_time_series.csv into a data.frame named counties. Choose a state (or states) that you would like to "dig down" into county-level data for, and create a plot (or plots) like in (4) that show some interesting statistic by county. You can choose average median listing price if you so desire, however, you don’t need to! There are other cool data!

Make sure that you remember to aggregate your data by RegionName so the plot renders correctly.

plot_usmap looks for a column named fips. Make sure to rename the RegionName column to fips prior to passing the data.frame to plot_usmap.

If you get Question 4 working correctly, here are the main differences for Question 5. You need the regions to be "counties" instead of "states", and you need the data.frame to have a column called fips instead of state. These are the main differences between Question 4 and Question 5.

Items to submit
  • R code used to solve the problem.

  • The resulting map.