STAT 19000: Project 7 — Fall 2020

Motivation: Three bread-and-butter functions that are a part of the base R are: subset, merge, and split. subset provides a more natural way to filter and select data from a data.frame. split is a useful function that splits a dataset based on one or more factors. merge brings the principals of combining data that SQL uses, to R.

Context: We’ve been getting comfortable working with data in within the R environment. Now we are going to expand our toolset with three useful functions, all the while gaining experience and practice wrangling data!

Scope: r, subset, merge, split, tapply

Learning objectives
  • Gain proficiency using split, merge, and subset.

  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • Demonstrate how to use tapply to solve data-driven problems.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/goodreads/csv

Questions

Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

Question 1

Load up the following two datasets goodreads_books.csv and goodreads_book_authors.csv into the data.frames books and authors, respectively. How many columns and rows are in each of these two datasets?

Items to submit
  • R code used to solve the problem.

  • The result of running the R code.

Question 2

We want to figure out how book size (num_pages) is associated with various metrics. First, let’s create a vector called book_size, that categorizes books into 4 categories based on num_pages: small (up to 250 pages), medium (250-500 pages), large (500-1000 pages), huge (1000+ pages).

This [video and code](#r-lapply-flight-example) might be helpful.

Items to submit
  • R code used to solve the problem.

  • The result of table(book_size).

Question 3

Use tapply to calculate the mean average_rating, text_reviews_count, and publication_year by book_size. Did any of the result surprise you? Why or why not?

Items to submit
  • R code used to solve the problem.

  • The output from running the R code.

Question 4

Notice in (3) how we used tapply 3 times. This would get burdensome if we decided to calculate 4 or 5 or 6 columns instead. Instead of using tapply, we can use split, lapply, and colMeans to perform the same calculations.

Use split to partition the data containing only the following 3 columns: average_rating, text_reviews_count, and publication_year, by book_size. Save the result as books_by_size. What class is the result? lapply is a function that allows you to loop over each item in a list and apply a function. Use lapply and colMeans to perform the same calculation as in (3).

This [video and code](#r-lapply-flight-example) and also this [video and code](#r-lapply-fars-example) might be helpful.

Items to submit
  • R code used to solve the problem.

  • The output from running the code.

Question 5

We are working with a lot more data than we really want right now. We’ve provided you with the following code to filter out non-English books and only keep columns of interest. This will create a data frame called en_books.

en_books <- books[books$language_code %in% c("en-US", "en-CA", "en-GB", "eng", "en", "en-IN") & books$publication_year > 2000, c("author_id", "book_id", "average_rating", "description", "title", "ratings_count", "language_code", "publication_year")]

Now create an equivalent data frame of your own, by using the subset function (instead of indexing). Use res as the name of the data frame that you create. Do the dimensions (using dim) of en_books and res agree? Why or why not? (They should both have 8 columns, but a different number of rows.)

Since the dimensions don’t match, take a look at NA values for the variables used to subset our data.

This [video and code](#r-subset-8451-example) and also this [video and code](#r-subset-election-example) might be helpful.

Items to submit
  • R code used to solve the problem.

  • Do the dimensions match?

  • 1-2 sentences explaining why or why not.

Question 6

We now have a nice and tidy subset of data, called res. It would be really nice to get some information on the authors. We can find that information in authors dataset loaded in question 1! In question 2 of the previous project, we had a similar issue with the states names. There is a much better and easier way to solve these types of problems. Use the merge function to combine res and authors in a way which appends all information from author when there is a match in res. Use the condition by="author_id" in the merge. This is all you need to do:

mymergedDF <- merge(res, authors, by="author_id")

The resulting data frame will have all of the columns that are found in either res or authors. When we perform the merge, we only insist that the author_id should match. We do not expect that the ratings_count or average_rating should agree in res versus authors. Why? In the res data frame, the ratings_count and average_rating refer to the specific book, but in the authors data frame, the ratings_count and average_rating refer to the total works by the author. Therefore, in mymergedDF, there are columns ratings_count.x and average_rating.x from res, and there are columns ratings_count.y and average_rating.y from authors.

Although we provided the necessary code for this example, you might want to know more about the merge function. This [video and code](#r-merge-fars-example) and also this [video and code](#r-merge-flights-example) might be helpful.

Items to submit
  • the given R code used to solve the problem.

  • The dim of the newly merged data.frame.

Question 7

For an author of your choice (that is in the dataset), find the author’s highest rated book. Do you agree?

Items to submit
  • R code used to solve the problem.

  • The title of the highest rated book (from your author).

  • 1-2 sentences explaining why or why not you agree with it being the highest rated book from that author.

OPTIONAL QUESTION

Look at the column names of the new dataframe created in question 6. Notice that there are two values for ratings_count and two values for average_rating. The names that have an appended x are those values from the first argument to merge, and the names that have an appended y, are those values from the second argument to merge. Rename these columns to indicate if they refer to a book, or an author.

For example, ratings_count.x could be ratings_count_book or ratings_count_author.

Items to submit
  • R code used to solve the problem.

  • The names of the new data.frame.