STAT 19000: Project 7 — Fall 2021

Motivation: A couple of bread-and-butter functions that are a part of the base R are: subset, and merge. subset provides a more natural way to filter and select data from a data.frame. merge brings the principals of combining data that SQL uses, to R.

Context: We’ve been getting comfortable working with data in within the R environment. Now we are going to expand our toolset with these useful functions, all the while gaining experience and practice wrangling data!

Scope: r, subset, merge, tapply

Learning Objectives
  • Gain proficiency using split, merge, and subset.

  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • Demonstrate how to use tapply to solve data-driven problems.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/goodreads/csv/*.csv

Questions

Question 1

Read the goodreads_books.csv into a data.frame called books. Let’s say Dr. Ward is working on a book and new content. He is looking for advice and wants some insight from us.

A friend told him that he should pick a month in the Summer to publish his book.

Based on our books dataset, is there any evidence that certain months get higher than average rating? What month would you suggest for Dr. Ward to publish his new book?

Use columns average_rating and publication_month to solve this question.

To read the data in faster and more efficiently, try the following:

library(data.table)
books <- fread("/path/to/data")
r

Relevant topics: tapply, mean

Items to submit
  • R code used to solve this problem.

  • The results of running the R code.

  • 1-2 sentences comparing the publication month based on average rating.

  • 1-2 sentences with your suggestion to Dr. Ward and reasoning.

Question 2

Create a new column called book_size_cat that is a categorical variable based on the number of pages a book has.

book_size_cat should have 3 levels: small, medium, large.

Run the code below to get different summaries and visualizations of the number of pages books have in our datasets.

summary(books$num_pages)
hist(books$num_pages)
hist(books$num_pages[books$num_pages <= 1000])
boxplot(books$num_pages[books$num_pages < 4000])
r

Pick the values from which to separate these levels by. Write 1-2 sentences explaining why you pick those values.

You can do other visualizations to determine. Have fun, there is no right or wrong. What would you consider a small, medium, and large book?

Relevant topics: cut

Items to submit
  • R code used to solve this problem.

  • The results of running the R code.

  • 1-2 sentences explaining the values you picked to create your categorical data and why.

  • The results of running table(books$book_size_cat).

Question 3

Dr. Ward is a firm believer in constructive feedback, and would like people to provide feedback for his book.

What recommendation would you make to Dr. Ward when it comes to book size?

Use the column text_reviews_count and compare, on average, how many text reviews the various book sizes get.

Association is not causation, and there are many factors that lead to people providing reviews. Your recommendation can be based on anecdotal evidence, no worries.

Relevant topics: tapply, mean

Items to submit
  • R code used to solve this problem.

  • The results of running the R code.

  • 1-2 sentences with your recommendation and reasoning.

Question 4

Sometimes (often times) looking at a single summary of our data may not provide the full picture.

Make a side-by-side boxplot for the text_reviews_count by book_size_cat.

Does your answer to question (3) change based on your plot?

Take a look at the first example when you run ?boxplot.

You can make three boxplots if you prefer, but make sure that they all have the same y-axis limit to make the comparisons.

Relevant topics: boxplot

Items to submit
  • R code used to solve this problem.

  • The results of running the R code.

  • 1-2 sentences with your recommendation and reasoning.

Question 5

Repeat question (4), this time, use the subset function to reduce your data to books with a text_reviews_count less than 200. How does this change your plot? Is it a little easier to read?

Items to submit
  • R code used to solve this problem.

  • The results of running the R code.

  • 1-2 sentences with your recommendation and reasoning.

Question 6

Read the goodreads_book_authors.csv into a new data.frame called authors.

Use the merge function to combine the books data.frame with the authors data.frame. Call your new data.frame books_authors.

Now, use the subset function to create get a subset of your data for your favorite authors. Include at least 5 authors that appear in the dataset.

Redo question (4) using this new subset of data. Does your recommendation change at all?

Make sure you pay close attention to the resulting books_authors data.frame. The column names will be changed to reflect the merge. Instead of text_reviews_count you may need to use text_reviews_count.x, or text_reviews_count.y, depending on how you merged.

Items to submit
  • R code used to solve this problem.

  • The results of running the R code.

  • 1-2 sentences with your recommendation and reasoning.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.