# STAT 19000: Project 7 — Fall 2020

Motivation: Three bread-and-butter functions that are a part of the base R are: `subset`, `merge`, and `split`. `subset` provides a more natural way to filter and select data from a data.frame. `split` is a useful function that splits a dataset based on one or more factors. `merge` brings the principals of combining data that SQL uses, to R.

Context: We’ve been getting comfortable working with data in within the R environment. Now we are going to expand our toolset with three useful functions, all the while gaining experience and practice wrangling data!

Scope: r, subset, merge, split, tapply

Learning objectives
• Gain proficiency using split, merge, and subset.

• Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

• Read and write basic (csv) data.

• Explain and demonstrate: positional, named, and logical indexing.

• Demonstrate how to use tapply to solve data-driven problems.

## Dataset

The following questions will use the dataset found in Scholar:

`/class/datamine/data/goodreads/csv`

## Questions

 Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like `head` to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

### Question 1

Load up the following two datasets `goodreads_books.csv` and `goodreads_book_authors.csv` into the data.frames `books` and `authors`, respectively. How many columns and rows are in each of these two datasets?

Items to submit
• R code used to solve the problem.

• The result of running the R code.

### Question 2

We want to figure out how book size (`num_pages`) is associated with various metrics. First, let’s create a vector called `book_size`, that categorizes books into 4 categories based on `num_pages`: `small` (up to 250 pages), `medium` (250-500 pages), `large` (500-1000 pages), `huge` (1000+ pages).

 This [video and code](#r-lapply-flight-example) might be helpful.
Items to submit
• R code used to solve the problem.

• The result of `table(book_size)`.

### Question 3

Use `tapply` to calculate the mean `average_rating`, `text_reviews_count`, and `publication_year` by `book_size`. Did any of the result surprise you? Why or why not?

Items to submit
• R code used to solve the problem.

• The output from running the R code.

### Question 4

Notice in (3) how we used `tapply` 3 times. This would get burdensome if we decided to calculate 4 or 5 or 6 columns instead. Instead of using tapply, we can use `split`, `lapply`, and `colMeans` to perform the same calculations.

Use `split` to partition the data containing only the following 3 columns: `average_rating`, `text_reviews_count`, and `publication_year`, by `book_size`. Save the result as `books_by_size`. What class is the result? `lapply` is a function that allows you to loop over each item in a list and apply a function. Use `lapply` and `colMeans` to perform the same calculation as in (3).

 This [video and code](#r-lapply-flight-example) and also this [video and code](#r-lapply-fars-example) might be helpful.
Items to submit
• R code used to solve the problem.

• The output from running the code.

### Question 5

We are working with a lot more data than we really want right now. We’ve provided you with the following code to filter out non-English books and only keep columns of interest. This will create a data frame called `en_books`.

``en_books <- books[books\$language_code %in% c("en-US", "en-CA", "en-GB", "eng", "en", "en-IN") & books\$publication_year > 2000, c("author_id", "book_id", "average_rating", "description", "title", "ratings_count", "language_code", "publication_year")]``

Now create an equivalent data frame of your own, by using the `subset` function (instead of indexing). Use `res` as the name of the data frame that you create. Do the dimensions (using `dim`) of `en_books` and `res` agree? Why or why not? (They should both have 8 columns, but a different number of rows.)

 Since the dimensions don’t match, take a look at NA values for the variables used to subset our data.
 This [video and code](#r-subset-8451-example) and also this [video and code](#r-subset-election-example) might be helpful.
Items to submit
• R code used to solve the problem.

• Do the dimensions match?

• 1-2 sentences explaining why or why not.

### Question 6

We now have a nice and tidy subset of data, called `res`. It would be really nice to get some information on the authors. We can find that information in `authors` dataset loaded in question 1! In question 2 of the previous project, we had a similar issue with the states names. There is a much better and easier way to solve these types of problems. Use the `merge` function to combine `res` and `authors` in a way which appends all information from `author` when there is a match in `res`. Use the condition `by="author_id"` in the merge. This is all you need to do:

``mymergedDF <- merge(res, authors, by="author_id")``
 The resulting data frame will have all of the columns that are found in either `res` or `authors`. When we perform the merge, we only insist that the `author_id` should match. We do not expect that the `ratings_count` or `average_rating` should agree in `res` versus `authors`. Why? In the `res` data frame, the `ratings_count` and `average_rating` refer to the specific book, but in the `authors` data frame, the `ratings_count` and `average_rating` refer to the total works by the author. Therefore, in `mymergedDF`, there are columns `ratings_count.x` and `average_rating.x` from `res`, and there are columns `ratings_count.y` and `average_rating.y` from `authors`.
 Although we provided the necessary code for this example, you might want to know more about the merge function. This [video and code](#r-merge-fars-example) and also this [video and code](#r-merge-flights-example) might be helpful.
Items to submit
• the given R code used to solve the problem.

• The `dim` of the newly merged data.frame.

### Question 7

For an author of your choice (that is in the dataset), find the author’s highest rated book. Do you agree?

Items to submit
• R code used to solve the problem.

• The title of the highest rated book (from your author).

• 1-2 sentences explaining why or why not you agree with it being the highest rated book from that author.

### OPTIONAL QUESTION

Look at the column names of the new dataframe created in question 6. Notice that there are two values for `ratings_count` and two values for `average_rating`. The names that have an appended `x` are those values from the first argument to `merge`, and the names that have an appended `y`, are those values from the second argument to `merge`. Rename these columns to indicate if they refer to a book, or an author.

 For example, `ratings_count.x` could be `ratings_count_book` or `ratings_count_author`.
Items to submit
• R code used to solve the problem.

• The `names` of the new data.frame.