STAT 19000: Project 11 — Fall 2020

Motivation: The ability to understand a problem, know what tools are available to you, and select the right tools to get the job done, takes practice. In this project we will use what you’ve learned so far this semester to solve data-driven problems. In previous projects, we’ve directed you towards certain tools. In this project, there will be less direction, and you will have the freedom to choose the tools you’d like.

Context: You’ve learned lots this semester about the R environment. You now have experience using a very balanced "portfolio" of R tools. We will practice using these tools on a set of economic data from Zillow.

Scope: R

Learning objectives
• Read and write basic (csv) data.

• Explain and demonstrate: positional, named, and logical indexing.

• Utilize apply functions in order to solve a data-driven problem.

• Gain proficiency using split, merge, and subset.

• Comprehend what a function is, and the components of a function in R.

• Demonstrate the ability to use nested apply functions to solve a data-driven problem.

Dataset

The following questions will use the dataset found in Scholar:

`/class/datamine/data/zillow`

Questions

Question 1

Read `/class/datamine/data/zillow/Zip_time_series.csv` into a data.frame called `zipc`. Look at the `RegionName` column. It is supposed to be a 5-digit zip code. Either fix the column by writing a function and applying it to the column, or take the time to read the `read.csv` documentation by running `?read.csv` and use an argument to make sure that column is not read in as an integer (which is why zip codes starting with `0` lose the leading `0` when being read in).

 This video demonstrates how to read in data and respect the leading zeroes.
Items to submit
• R code used to solve the problem.

• `head` of the `RegionName` column.

Question 2

One might assume that the owner of a house tends to value that house more than the buyer. If that was the case, perhaps the median listing price (the price which the seller puts the house on the market, or ask price) would be higher than the ZHVI (Zillow Home Value Index — essentially an estimate of the home value). For those rows where both `MedianListingPrice_AllHomes` and `ZHVI_AllHomes` have non-NA values, on average how much higher or lower is the median listing price? Can you think of any other reasons why this may be?

Items to submit
• R code used to solve the problem.

• The result itself and 1-2 sentences talking about whether or not you can think of any other reasons that may explain the result.

Question 3

Convert the `Date` column to a date using `as.Date`. How many years of data do we have in this dataset? Create a line plot with lines for the average `MedianListingPrice_AllHomes` and average `ZHVI_AllHomes` by year. The result should be a single plot with multiple lines on it.

 Here we give two videos to help you with this question. The first video gives some examples about working with dates in R.
 This second video gives an example about how to plot two line graphs at the same time in R.
 For a nice addition, add a dotted vertical line on year 2008 near the housing crisis:
``abline(v="2008", lty="dotted")``
Items to submit
• R code used to solve the problem.

• The results of running the code.

Question 4

Read `/class/datamine/data/zillow/State_time_series.csv` into a data.frame called `states`. Calculate the average median listing price by state, and create a map using `plot_usmap` from the `usmap` package that shows the average median price by state.

 We give a full example about how to plot values, by State, on a map.
 In order for `plot_usmap` to work, you must name the column containing states' names to "state".
 To split words like "OhSoCool" into "Oh So Cool", try this: `trimws(gsub('()', ' \\1', "OhSoCool"))`. This will be useful as you’ll need to correct the `RegionName` column at some point in time. Notice that this will not completely fix "DistrictofColumbia". You will need to fix that one manually.
Items to submit
• R code used to solve the problem.

• The resulting map.

Question 5

Read `/class/datamine/data/zillow/County_time_series.csv` into a data.frame named `counties`. Choose a state (or states) that you would like to "dig down" into county-level data for, and create a plot (or plots) like in (4) that show some interesting statistic by county. You can choose average median listing price if you so desire, however, you don’t need to! There are other cool data!

 Make sure that you remember to aggregate your data by `RegionName` so the plot renders correctly.
 `plot_usmap` looks for a column named `fips`. Make sure to rename the `RegionName` column to `fips` prior to passing the data.frame to `plot_usmap`.
 If you get Question 4 working correctly, here are the main differences for Question 5. You need the `regions` to be `"counties"` instead of `"states"`, and you need the `data.frame` to have a column called `fips` instead of `state`. These are the main differences between Question 4 and Question 5.
Items to submit
• R code used to solve the problem.

• The resulting map.