# STAT 19000: Project 7 — Fall 2021

Motivation: A couple of bread-and-butter functions that are a part of the base R are: `subset`, and `merge`. `subset` provides a more natural way to filter and select data from a data.frame. `merge` brings the principals of combining data that SQL uses, to R.

Context: We’ve been getting comfortable working with data in within the R environment. Now we are going to expand our toolset with these useful functions, all the while gaining experience and practice wrangling data!

Scope: r, subset, merge, tapply

Learning Objectives
• Gain proficiency using split, merge, and subset.

• Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

• Read and write basic (csv) data.

• Explain and demonstrate: positional, named, and logical indexing.

• Demonstrate how to use tapply to solve data-driven problems.

## Dataset(s)

The following questions will use the following dataset(s):

• `/depot/datamine/data/goodreads/csv/*.csv`

## Questions

### Question 1

Read the `goodreads_books.csv` into a data.frame called `books`. Let’s say Dr. Ward is working on a book and new content. He is looking for advice and wants some insight from us.

A friend told him that he should pick a month in the Summer to publish his book.

Based on our `books` dataset, is there any evidence that certain months get higher than average rating? What month would you suggest for Dr. Ward to publish his new book?

 Use columns `average_rating` and `publication_month` to solve this question.
 To read the data in faster and more efficiently, try the following: ``````library(data.table) books <- fread("/path/to/data")``````

Relevant topics: tapply, mean

Items to submit
• R code used to solve this problem.

• The results of running the R code.

• 1-2 sentences comparing the publication month based on average rating.

• 1-2 sentences with your suggestion to Dr. Ward and reasoning.

### Question 2

Create a new column called `book_size_cat` that is a categorical variable based on the number of pages a book has.

`book_size_cat` should have 3 levels: `small`, `medium`, `large`.

Run the code below to get different summaries and visualizations of the number of pages books have in our datasets.

``````summary(books\$num_pages)
hist(books\$num_pages)
hist(books\$num_pages[books\$num_pages <= 1000])
boxplot(books\$num_pages[books\$num_pages < 4000])``````

Pick the values from which to separate these levels by. Write 1-2 sentences explaining why you pick those values.

 You can do other visualizations to determine. Have fun, there is no right or wrong. What would you consider a small, medium, and large book?

Relevant topics: cut

Items to submit
• R code used to solve this problem.

• The results of running the R code.

• 1-2 sentences explaining the values you picked to create your categorical data and why.

• The results of running `table(books\$book_size_cat)`.

### Question 3

Dr. Ward is a firm believer in constructive feedback, and would like people to provide feedback for his book.

What recommendation would you make to Dr. Ward when it comes to book size?

 Use the column `text_reviews_count` and compare, on average, how many text reviews the various book sizes get.
 Association is not causation, and there are many factors that lead to people providing reviews. Your recommendation can be based on anecdotal evidence, no worries.

Relevant topics: tapply, mean

Items to submit
• R code used to solve this problem.

• The results of running the R code.

• 1-2 sentences with your recommendation and reasoning.

### Question 4

Sometimes (often times) looking at a single summary of our data may not provide the full picture.

Make a side-by-side boxplot for the `text_reviews_count` by `book_size_cat`.

 Take a look at the first example when you run `?boxplot`.
 You can make three boxplots if you prefer, but make sure that they all have the same y-axis limit to make the comparisons.

Relevant topics: boxplot

Items to submit
• R code used to solve this problem.

• The results of running the R code.

• 1-2 sentences with your recommendation and reasoning.

### Question 5

Repeat question (4), this time, use the `subset` function to reduce your data to books with a `text_reviews_count` less than 200. How does this change your plot? Is it a little easier to read?

Items to submit
• R code used to solve this problem.

• The results of running the R code.

• 1-2 sentences with your recommendation and reasoning.

### Question 6

Read the `goodreads_book_authors.csv` into a new data.frame called `authors`.

Use the `merge` function to combine the `books` data.frame with the `authors` data.frame. Call your new data.frame `books_authors`.

Now, use the `subset` function to create get a subset of your data for your favorite authors. Include at least 5 authors that appear in the dataset.

Redo question (4) using this new subset of data. Does your recommendation change at all?

 Make sure you pay close attention to the resulting `books_authors` data.frame. The column names will be changed to reflect the merge. Instead of `text_reviews_count` you may need to use `text_reviews_count.x`, or `text_reviews_count.y`, depending on how you merged.
Items to submit
• R code used to solve this problem.

• The results of running the R code.

• 1-2 sentences with your recommendation and reasoning.

