# STAT 39000: Project 10 — Spring 2021

Motivation: The use of a suite of packages referred to as the `tidyverse` is popular with many R users. It is apparent just by looking at `tidyverse` R code, that it varies greatly in style from typical R code. It is useful to gain some familiarity with this collection of packages, in case you run into a situation where these packages are needed — you may even find that you enjoy using them!

Context: We’ve covered a lot of ground so far this semester, and almost completely using Python. In this next series of projects we are going to switch back to R with a strong focus on the `tidyverse` (including `ggplot`) and data wrangling tasks.

Scope: R, tidyverse, ggplot

Learning objectives
• Explain the differences between regular data frames and tibbles.

• Use mutate, pivot, unite, filter, and arrange to wrangle data and solve data-driven problems.

• Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows.

• Group data and calculate aggregated statistics using group_by, mutate, and transform functions.

• Demonstrate the ability to create basic graphs with default settings, in `ggplot`.

• Demonstrate the ability to modify axes labels and titles.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

The `tidyverse` consists of a variety of packages, including, but not limited to: `ggplot2`, `dplyr`, `tidyr`, `readr`, `purrr`, `tibble`, `stringr`, and `lubridate`.

One of the underlying premises of the `tidyverse` is getting the data to be tidy. You can read a lot more about this in Hadley Wickham’s excellent book, R for Data Science.

There is an excellent graphic here that illustrates a general workflow for data science projects:

1. Import

2. Tidy

3. Iterate on, to gain understanding:

1. Transform

2. Visualize

3. Model

4. Communicate

This is a good general outline of how a project could be organized, but depending on the project or company, this could vary greatly and change as the goals of a project change.

## Dataset

The following questions will use the dataset found in Scholar:

`/class/datamine/data/okcupid/filtered/*.csv`

## Questions

### Question 1

Let’s (more or less) follow the guidelines given above. The first step is to import the data. There are two files: `questions.csv`, and `users.csv`. Read this section, and use what you learn to read in the two files into `questions` and `users`, respectively. Which functions from the `tidyverse` did you use and why?

 Its easy to load up the `tidyverse` packages: ``library(tidyverse)``
 Just because a file has the `.csv` extension does not mean that is it comma separated.
 Make sure to print all `tibble` after reading them in to ensure that they were read in correctly. If they were not, use a different function (from `tidyverse`) to read in the data.
 `questions` should be 2281 x 10 and `users` should be 68371 x 2284
Items to submit
• R code used to solve the problem.

• `head` of each dataset, `users` and `questions`.

• 1 sentence explaining which functions you used (from `tidyverse`) and why.

### Question 2

You may recall that the function `read.csv` from base R reads data into a data.frame by default. In the `tidyverse`, `readr` functions read the data into a `tibble` instead. Read this section. To summarize, some important features that are true for `tibbles` but not necessarily for data.frames are:

• Non-syntactic variable names (surrounded by backticks `` ` `` )

• Never changes the type of the inputs (for example converting strings to factors)

• No partial matching

• Simple subsetting

Great, the next step in our outline is to make the data "tidy". Read this section. Okay, let’s say, for instance, that we wanted to create a `tibble` with the following columns: `user`, `question`, `question_text`, `selected_option`, `race`, `gender2`, `gender_orientation`, `n`, and `keywords`. As you can imagine, the "tidy" format, while great for analysis, would not be great for storage as there would be a row for each question for each user, at least. Columns like `gender2` and `race` don’t change for a user, so we end up with a lot of repeated values.

Okay, we don’t need to analyze all 68000 users at once, let’s instead, take a random sample of 2200 users, and create a "tidy" `tibble` as described above. After all, we want to see why this format is useful! While trying to figure out how to do this may seem daunting at first, it is actually not so bad:

First, we convert the `users` tibble to long form, so each row represents 1 answer to 1 questions from 1 user:

``````# Add an "id" columns to the users data
users\$id <- 1:nrow(users)
# To ensure we get the same random sample, run the set.seed line
# before every time you run the following line
set.seed(12345)
columns_to_pivot <- 1:2278
users_sample_long <- users[sample(nrow(users), 2200),] %>%
mutate_at(columns_to_pivot, as.character) %>%  # This converts all of our columns in columns_to_pivot to strings
pivot_longer(cols = columns_to_pivot, names_to="question", values_to = "selected_option") # The old qXXXX columns are now values in the "question" column.``````

Next, we want to merge our data from the `questions` tibble with our `users_sample_long` tibble, into a new table we will call `myDF`. How many rows and columns are in `myDF`?

``myDF <- merge(users_sample_long, questions, by.x = "question", by.y = "X")``
Items to submit
• R code used to solve the problem.

• The number of rows and columns in `myDF`.

• The `head` of `myDF`.

### Question 3

Excellent! Now, we have a nice tidy dataset that we can work with. You may have noticed some odd syntax `%>%` in the code provided in the previous question. `%>%` is the piping operator in R added by the `magittr` package. It works pretty much just like `|` does in bash. It "feeds" the output from the previous bit of code to the next bit of code. It is extremely common practice to use this operator in the `tidyverse`.

Observe the `head` of `myDF`. Notice how our `question` column has the value `d_age`, `text` has the content "Age", and `selected_option` (the column that shows the "answer" the user gave), has the actual age of the user. Wouldn’t it be better if our `myDF` had a new column called `age` instead of `age` being an answer to a question?

Modify the code provided in question 2 so `age` ends up being a column in `myDF` with the value being the actual age of the user.

 Pay close attention to `pivot_longer`. You will need to understand what this function is doing to fix this.
 You can make a single modification to 1 line to accomplish this. Pay close attention to the `cols` option in `pivot_longer`. If you include a column in `cols` what happens? If you exclude a columns from `cols` what happens? Experiment on the following `tibble`, using different values for `cols`, as well as `names_to`, and `values_to`: ``````myDF <- tibble( x=1:3, y=1, question1=c("How", "What", "Why"), question2=c("Really", "You sure", "When"), question3=c("Who", "Seriously", "Right now") )``````
Items to submit
• R code used to solve the problem.

• The number of rows and columns in `myDF`.

• The `head` of `myDF`.

### Question 4

Wow! That is pretty powerful! Okay, it is clear that there are question questions, where the column starts with "q", and other questions, where the column starts with something else. Modify question (3) so all of the questions that don’t start with "q" have their own column in `myDF`. Like before, show the number of rows and columns for the new `myDF`, as well as print the `head`.

Items to submit
• R code used to solve the problem.

• The number of rows and columns in `myDF`.

• The `head` of `myDF`.

### Question 5

It seems like we’ve spent the majority of the project just wrangling our dataset — that is normal! You’d be incredibly lucky to work in an environment where you recieve data in a nice, neat, perfect format. Let’s do a couple basic operations now, to practice.

`mutate` is a powerful function in `dplyr`, that is not easy to mimic in Python’s `pandas` package. `mutate` adds new columns to your tibble, while preserving your existing columns. It doesn’t sound very powerful, but it is.

Use mutate to create a new column called `generation`. `generation` should contain "Gen Z" for ages [0, 24], "Millenial" for ages [25-40], "Gen X" for ages [41-56], and "Boomers II" for ages [57-66], and "Older" for all other ages.

Items to submit
• R code used to solve the problem.

• The number of rows and columns in `myDF`.

• The `head` of `myDF`.

### Question 6

Use `ggplot` to create a scatterplot showing `d_age` on the x-axis, and `lf_min_age` on the y-axis. `lf_min_age` is the minimum age a user is okay dating. Color the points based on `gender2`. Add a proper title, and labels for the X and Y axes. Use `alpha=.6`.

 This may take quite a few minutes to create. Before creating a plot with the entire `myDF`, use `myDF[1:10,]`. If you are in a time crunch, the minimum number of points to plot to get full credit is 100, but if you wait, the plot is a bit more telling.
Items to submit
• R code used to solve the problem.

• Output from running your code.

• The plot produced.