STAT 39000: Project 10 — Spring 2021

Motivation: The use of a suite of packages referred to as the tidyverse is popular with many R users. It is apparent just by looking at tidyverse R code, that it varies greatly in style from typical R code. It is useful to gain some familiarity with this collection of packages, in case you run into a situation where these packages are needed — you may even find that you enjoy using them!

Context: We’ve covered a lot of ground so far this semester, and almost completely using Python. In this next series of projects we are going to switch back to R with a strong focus on the tidyverse (including ggplot) and data wrangling tasks.

Scope: R, tidyverse, ggplot

Learning objectives
  • Explain the differences between regular data frames and tibbles.

  • Use mutate, pivot, unite, filter, and arrange to wrangle data and solve data-driven problems.

  • Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows.

  • Group data and calculate aggregated statistics using group_by, mutate, and transform functions.

  • Demonstrate the ability to create basic graphs with default settings, in ggplot.

  • Demonstrate the ability to modify axes labels and titles.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

The tidyverse consists of a variety of packages, including, but not limited to: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and lubridate.

One of the underlying premises of the tidyverse is getting the data to be tidy. You can read a lot more about this in Hadley Wickham’s excellent book, R for Data Science.

There is an excellent graphic here that illustrates a general workflow for data science projects:

  1. Import

  2. Tidy

  3. Iterate on, to gain understanding:

    1. Transform

    2. Visualize

    3. Model

  4. Communicate

This is a good general outline of how a project could be organized, but depending on the project or company, this could vary greatly and change as the goals of a project change.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/okcupid/filtered/*.csv

Questions

Question 1

Let’s (more or less) follow the guidelines given above. The first step is to import the data. There are two files: questions.csv, and users.csv. Read this section, and use what you learn to read in the two files into questions and users, respectively. Which functions from the tidyverse did you use and why?

Its easy to load up the tidyverse packages:

library(tidyverse)

Just because a file has the .csv extension does not mean that is it comma separated.

Make sure to print all tibble after reading them in to ensure that they were read in correctly. If they were not, use a different function (from tidyverse) to read in the data.

questions should be 2281 x 10 and users should be 68371 x 2284

Items to submit
  • R code used to solve the problem.

  • head of each dataset, users and questions.

  • 1 sentence explaining which functions you used (from tidyverse) and why.

Question 2

You may recall that the function read.csv from base R reads data into a data.frame by default. In the tidyverse, readr functions read the data into a tibble instead. Read this section. To summarize, some important features that are true for tibbles but not necessarily for data.frames are:

  • Non-syntactic variable names (surrounded by backticks `` ` `` )

  • Never changes the type of the inputs (for example converting strings to factors)

  • More informative output from printing

  • No partial matching

  • Simple subsetting

Great, the next step in our outline is to make the data "tidy". Read this section. Okay, let’s say, for instance, that we wanted to create a tibble with the following columns: user, question, question_text, selected_option, race, gender2, gender_orientation, n, and keywords. As you can imagine, the "tidy" format, while great for analysis, would not be great for storage as there would be a row for each question for each user, at least. Columns like gender2 and race don’t change for a user, so we end up with a lot of repeated values.

Okay, we don’t need to analyze all 68000 users at once, let’s instead, take a random sample of 2200 users, and create a "tidy" tibble as described above. After all, we want to see why this format is useful! While trying to figure out how to do this may seem daunting at first, it is actually not so bad:

First, we convert the users tibble to long form, so each row represents 1 answer to 1 questions from 1 user:

# Add an "id" columns to the users data
users$id <- 1:nrow(users)
# To ensure we get the same random sample, run the set.seed line
# before every time you run the following line
set.seed(12345)
columns_to_pivot <- 1:2278
users_sample_long <- users[sample(nrow(users), 2200),] %>%
    mutate_at(columns_to_pivot, as.character) %>%  # This converts all of our columns in columns_to_pivot to strings
    pivot_longer(cols = columns_to_pivot, names_to="question", values_to = "selected_option") # The old qXXXX columns are now values in the "question" column.

Next, we want to merge our data from the questions tibble with our users_sample_long tibble, into a new table we will call myDF. How many rows and columns are in myDF?

myDF <- merge(users_sample_long, questions, by.x = "question", by.y = "X")
Items to submit
  • R code used to solve the problem.

  • The number of rows and columns in myDF.

  • The head of myDF.

Question 3

Excellent! Now, we have a nice tidy dataset that we can work with. You may have noticed some odd syntax %>% in the code provided in the previous question. %>% is the piping operator in R added by the magittr package. It works pretty much just like | does in bash. It "feeds" the output from the previous bit of code to the next bit of code. It is extremely common practice to use this operator in the tidyverse.

Observe the head of myDF. Notice how our question column has the value d_age, text has the content "Age", and selected_option (the column that shows the "answer" the user gave), has the actual age of the user. Wouldn’t it be better if our myDF had a new column called age instead of age being an answer to a question?

Modify the code provided in question 2 so age ends up being a column in myDF with the value being the actual age of the user.

Pay close attention to pivot_longer. You will need to understand what this function is doing to fix this.

You can make a single modification to 1 line to accomplish this. Pay close attention to the cols option in pivot_longer. If you include a column in cols what happens? If you exclude a columns from cols what happens? Experiment on the following tibble, using different values for cols, as well as names_to, and values_to:

myDF <- tibble(
    x=1:3,
    y=1,
    question1=c("How", "What", "Why"),
    question2=c("Really", "You sure", "When"),
    question3=c("Who", "Seriously", "Right now")
)
Items to submit
  • R code used to solve the problem.

  • The number of rows and columns in myDF.

  • The head of myDF.

Question 4

Wow! That is pretty powerful! Okay, it is clear that there are question questions, where the column starts with "q", and other questions, where the column starts with something else. Modify question (3) so all of the questions that don’t start with "q" have their own column in myDF. Like before, show the number of rows and columns for the new myDF, as well as print the head.

Items to submit
  • R code used to solve the problem.

  • The number of rows and columns in myDF.

  • The head of myDF.

Question 5

It seems like we’ve spent the majority of the project just wrangling our dataset — that is normal! You’d be incredibly lucky to work in an environment where you recieve data in a nice, neat, perfect format. Let’s do a couple basic operations now, to practice.

mutate is a powerful function in dplyr, that is not easy to mimic in Python’s pandas package. mutate adds new columns to your tibble, while preserving your existing columns. It doesn’t sound very powerful, but it is.

Use mutate to create a new column called generation. generation should contain "Gen Z" for ages [0, 24], "Millenial" for ages [25-40], "Gen X" for ages [41-56], and "Boomers II" for ages [57-66], and "Older" for all other ages.

Items to submit
  • R code used to solve the problem.

  • The number of rows and columns in myDF.

  • The head of myDF.

Question 6

Use ggplot to create a scatterplot showing d_age on the x-axis, and lf_min_age on the y-axis. lf_min_age is the minimum age a user is okay dating. Color the points based on gender2. Add a proper title, and labels for the X and Y axes. Use alpha=.6.

This may take quite a few minutes to create. Before creating a plot with the entire myDF, use myDF[1:10,]. If you are in a time crunch, the minimum number of points to plot to get full credit is 100, but if you wait, the plot is a bit more telling.

Items to submit
  • R code used to solve the problem.

  • Output from running your code.

  • The plot produced.