# STAT 39000: Project 1 — Fall 2020

Motivation: In this project we will jump right into an R review. In this project we are going to break one larger data-wrangling problem into discrete parts. There is a slight emphasis on writing functions and dealing with strings. At the end of this project we will have greatly simplified a dataset, making it easy to dig into.

Context: We just started the semester and are digging into a large dataset, and in doing so, reviewing R concepts we’ve previously learned.

Scope: data wrangling in R, functions

Learning objectives
• Comprehend what a function is, and the components of a function in R.

• Read and write basic (csv) data.

• Utilize apply functions in order to solve a data-driven problem.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

You can find useful examples that walk you through relevant material in The Examples Book:

It is highly recommended to read through, search, and explore these examples to help solve problems in this project.

 It is highly recommended that you use rstudio.scholar.rcac.purdue.edu/. Simply click on the link and login using your Purdue account credentials.

We decided to move away from ThinLinc and away from the version of RStudio used last year (desktop.scholar.rcac.purdue.edu). The version of RStudio is known to have some strange issues when running code chunks.

Remember the very useful documentation shortcut `?`. To use, simply type `?` in the console, followed by the name of the function you are interested in.

You can also look for package documentation by using `help(package=PACKAGENAME)`, so for example, to see the documentation for the package `ggplot2`, we could run:

``help(package=ggplot2)``

Sometimes it can be helpful to see the source code of a defined function. A function is any chunk of organized code that is used to perform an operation. Source code is the underlying `R` or `c` or `c++` code that is used to create the function. To see the source code of a defined function, type the function’s name without the `()`. For example, if we were curious about what the function `Reduce` does, we could run:

``Reduce``

Occasionally this will be less useful as the resulting code will be code that calls `c` code we can’t see. Other times it will allow you to understand the function better.

## Dataset:

`/class/datamine/data/airbnb`

Often times (maybe even the majority of the time) data doesn’t come in one nice file or database. Explore the datasets in `/class/datamine/data/airbnb`.

## Questions

 Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

### Question 1

You may have noted that, for each country, city, and date we can find 3 files: `calendar.csv.gz`, `listings.csv.gz`, and `reviews.csv.gz` (for now, we will ignore all files in the "visualisations" folders).

Let’s take a look at the data in each of the three types of files. Pick a country, city and date, and read the first 50 rows of each of the 3 datasets (`calendar.csv.gz`, `listings.csv.gz`, and `reviews.csv.gz`). Provide 1-2 sentences explaining the type of information found in each, and what variable(s) could be used to join them.

 `read.csv` has an argument to select the number of rows we want to read.
 Depending on the country that you pick, the listings and/or the reviews might not display properly in RMarkdown. So you do not need to display the first 50 rows of the listings and/or reviews, in your RMarkdown document. It is OK to just display the first 50 rows of the calendar entries.

To read a compressed csv, simply use the `read.csv` function:

``````dat <- read.csv("/class/datamine/data/airbnb/brazil/rj/rio-de-janeiro/2019-06-19/data/calendar.csv.gz")

Let’s work towards getting this data into an easier format to analyze. From now on, we will focus on the `listings.csv.gz` datasets.

Items to submit
• Chunk of code used to read the first 50 rows of each dataset.

• 1-2 sentences briefly describing the information contained in each dataset.

• Name(s) of variable(s) that could be used to join them.

### Question 2

Write a function called `get_paths_for_country`, that, given a string with the country name, returns a vector with the full paths for all `listings.csv.gz` files, starting with `/class/datamine/data/airbnb/…​`.

For example, the output from `get_paths_for_country("united-states")` should have 28 entries. Here are the first 5 entries in the output:

``` [1] "/class/datamine/data/airbnb/united-states/ca/los-angeles/2019-07-08/data/listings.csv.gz"
[2] "/class/datamine/data/airbnb/united-states/ca/oakland/2019-07-13/data/listings.csv.gz"
[3] "/class/datamine/data/airbnb/united-states/ca/pacific-grove/2019-07-01/data/listings.csv.gz"
[4] "/class/datamine/data/airbnb/united-states/ca/san-diego/2019-07-14/data/listings.csv.gz"
[5] "/class/datamine/data/airbnb/united-states/ca/san-francisco/2019-07-08/data/listings.csv.gz"```
 `list.files` is useful with the `recursive=T` option.
 Use `grep` to search for the pattern `listings.csv.gz` (within the results from the first hint), and use the option `value=T` to display the values found by the `grep` function.
Items to submit
• Chunk of code for your `get_paths_for_country` function.

### Question 3

Write a function called `get_data_for_country` that, given a string with the country name, returns a data.frame containing the all listings data for that country. Use your previously written function to help you.

 Use `stringsAsFactors=F` in the `read.csv` function.
 Use `do.call(rbind, )` to combine a list of dataframes into a single dataframe.
Items to submit
• Chunk of code for your `get_data_for_country` function.

### Question 4

Use your `get_data_for_country` to get the data for a country of your choice, and make sure to name the data.frame `listings`. Take a look at the following columns: `host_is_superhost`, `host_has_profile_pic`, `host_identity_verified`, and `is_location_exact`. What is the data type for each column? (You can use `class` or `typeof` or `str` to see the data type.)

These columns would make more sense as logical values (TRUE/FALSE/NA).

Write a function called `transform_column` that, given a column containing lowercase "t"s and "f"s, your function will transform it to logical (TRUE/FALSE/NA) values. Note that NA values for these columns appear as blank (`""`), and we need to be careful when transforming the data. Test your function on column `host_is_superhost`.

Items to submit
• Chunk of code for your `transform_column` function.

• Type of `transform_column(listings\$host_is_superhost)`.

### Question 5

Create a histogram for response rates (`host_response_rate`) for super hosts (where `host_is_superhost` is `TRUE`). If your listings do not contain any super hosts, load data from a different country. Note that we first need to convert `host_response_rate` from a character containing "%" signs to a numeric variable.

Items to submit
• Chunk of code used to answer the question.

• Histogram of response rates for super hosts.