STAT 29000: Project 10 — Spring 2022

Motivation: The use of a suite of packages referred to as the tidyverse is popular with many R users. It is apparent just by looking at tidyverse R code, that it varies greatly in style from typical R code. It is useful to gain some familiarity with this collection of packages, in case you run into a situation where these packages are needed — you may even find that you enjoy using them!

Context: We’ve covered a lot of ground so far this semester, and almost completely using Python. In this next series of projects we are going to switch back to R with a strong focus on the tidyverse (including ggplot) and data wrangling tasks.

Scope: R, tidyverse, ggplot

Learning Objectives

Use mutate, pivot, unite, filter, and arrange to wrangle data and solve data-driven problems.
Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows.
Group data and calculate aggregated statistics using group_by, mutate, summarize, and transform functions.
Demonstrate the ability to create basic graphs with default settings, in ggplot.
Demonstrate the ability to modify axes labels and titles.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

The "tidyverse" consists of a variety of packages, including, but not limited to: ggplot2, dplyr, tidyr, readr, magrittr, purrr, tibble, stringr, and lubridate.

One of the underlying premises of the tidyverse is getting the data to be tidy. You can read a lot more about this in Hadley Wickham’s book, R for Data Science.

There is an excellent graphic here that illustrates a general workflow for data science projects:

Import
Tidy
Iterate on, to gain understanding:
1. Transform
2. Visualize
3. Model
Communicate

This is a good general outline of how a project could be organized, but depending on the project or company, this could vary greatly and change as the goals of a project change.

Dataset(s)

The following questions will use the following dataset(s):

/depot/datamine/data/beer/beers.csv
/depot/datamine/data/beer/reviews_sample.csv

Questions

Question 1

The first step in our workflow is to read the data.

Read the datasets beers.csv and reviews_sample.csv using the read_csv function from tidyverse into tibbles called beers and reviews, respectively.

"Tibble" are essentially the tidyverse equivalent to data.frames. They function slightly differently, but are so similar (today) that we won’t go into detail until we need to.

In projects 10 and 11, we want to analyze and compare different beers. Note, that in reviews each row corresponds to a review by a certain user on a certain date. As reviews likely vary by individuals, we may want to summarize our reviews tibble.

To do that, let’s start by deciding how we are going to summarize the reviews. Start by picking one of the variables (columns) from the reviews dataset to be our "beer goodness indicator". For example, maybe you believe that the taste is important in beverages (seems reasonable).

Now, determine a summary statistic that we will use to compare beers based on your beer goodnees indicator variable. Examples include mean, median, std, max, min, etc. Write 1-2 sentences describing why you chose the statistic you chose for your variable(s). You can use annectodal evidence (some reasoning why you think that summary statistics would be appropriate/useful here), or look at the distribution based on plots, or summary statistics to pick your preferred summary statistics for this case.

If you are making a plot, please be sure to use the ggplot package.

If you wanted to have some fun, you could decide to combine different variables into a single one. For instance, maybe you want to take into consideration both taste and smell, but you want a smaller weight for smell. Then, you create a plot of taste + .5*smell, and you notice the data is skewed, so you decide to go with the median, namely, with median(taste+.5*smell).

Items to submit

Code used to solve this problem.
Output from running the code.
1-2 sentences describing what is your beer_goodness_indicator (variable and summary statistics), and why.

Question 2

Now that we have decided how to compare beers, let’s create a new variable called beer_goodness_indicator in the reviews dataset. For each beer_id, summarize the reviews data to get a single beer_goodness_indicator based on your answer from question 1. Call this summarized dataset reviews_summary.

reviews_summary should be 41822x2 (rows x columns).

summarize is good when you want to keep your data grouped — it will result in a data.frame with a different number of rows and columns. mutate is very similar except it will maintain the original columns, add a new column where the grouped/summarized values are repeated based on the variable the data was grouped by. This may be confusing, but run the following two examples and this will be made clear.

mtcars %>%
    group_by(cyl) %>%
    summarize(mpg_mean = mean(mpg))

mtcars %>%
    group_by(cyl) %>%
    mutate(mpg_mean = mean(mpg))

You may be wondering what the heck the %>% part of the code from the previous tip is. These are pipes from the magrittr package. This is used to together functions. For example, group_by and summarize are two functions that can be chained together. You are passing the output from the previous function as the input to the next function. You’ll find this is a very clean and convenient way to express a lot of very common data wrangling tasks!

It could be as simple as getting the head of a dataframe.

head(mtcars)

You could instead use pipes:

mtcars %>%
    head()

Why? This second version is arguably easier to read, and it is easier to edit. You could easily want to add a column to the dataframe first.

mtcars %>%
    mutate(my_new_column = mean(cyl)) %>%
    head()

Now, if we had the non-piped version it would be something like:

mtcars <- mtcars %>%
    mutate(my_new_column = mean(cyl))

head(mtcars)

Or an even better example would be:

mtcars %>%
    round() %>%
    head()

Versus:

head(round(mtcars))

mutate in particular is extremely useful. Try to perform the same operation using pandas and you will quickly realize how nice some of the tidyverse functionality is.

Items to submit

Code used to solve this problem.
Output from running the code.
Head of reviews_summary dataset.

Question 3

Let’s combine our beers dataset with reviews_summary into a new dataset called beers_reviews that contains only beers that appears in both datasets. Use the appropriate join function from tidyverse (inner_join, left_join, right_join, or full_join) to solve this problem. Since you saw some examples using pipes in the previous question (%>%) — use pipes from here on out.

What are the dimensions of the resulting beers_reviews dataset? How many beers did not appear in both datasets?

Items to submit

Code used to solve this problem.
Output from running the code.
Result of running dim(beers_reviews)

Question 4

Ok, now we have the dataset ready to analyze! For beers that are available during the entire year (see availability), is there a difference between retired and not retired beers in terms of beer_goodness_indicator?

Start by subsetting the dataset using filter.
Create some data-driven method to answer this question. You can make a plot, get summary statistics (average beer_goodness_indicator, table comparing # of beers with beer_goodness_indicator > 4 for each category, etc). You can use multiple methods to answer this question! Have fun!

Items to submit

Code used to solve this problem.
Output from running the code.
1-2 sentences answering the comparing retired and not retired beers in terms of beer_goodness_indicator based on your chosen method(s). Did the results surprise you?
1-2 sentences explaining what data-driven method(s) you decided to use and why.

Question 5

Let’s compare different styles of beer based on our beer_goodness_indicator average. Create a Cleveland dotplot (using ggplot) comparing the average beer_goodness_indicator for each style in beers_reviews. Make sure to use the tidyverse functions to answer this question and to use ggplot.

The code below creates a Cleveland dotplot comparing Sepal.Length variation per Species using the iris dataset.

iris %>%
  group_by(Species) %>%
  summarize(petal_length_var = sd(Petal.Length)) %>%
  arrange(desc(petal_length_var)) %>%
ggplot() +
  geom_point(aes(x = Species, y = petal_length_var)) +
  coord_flip() +
  theme_classic() +
  labs(x = "Petal length variation")

You can use the function top_n(x) in combination with arrange to subset to show only the top x styles.

Items to submit

Code used to solve this problem.
Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.