STAT 39000: Project 13 — Spring 2021

Motivation: Data wrangling tasks can vary between projects. Examples include joining multiple data sources, removing data that is irrelevant to the project, handling outliers, etc. Although we’ve practiced some of these skills, it is always worth it to spend some extra time to master tidying up our data.

Context: We will continue to gain familiarity with the tidyverse suite of packages (including ggplot), and data wrangling tasks.

Scope: r, tidyverse

Learning objectives
  • Explain the differences between regular data frames and tibbles.

  • Use mutate, pivot, unite, filter, and arrange to wrangle data and solve data-driven problems.

  • Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows.

  • Group data and calculate aggregated statistics using group_by, mutate, and transmute functions.

  • Demonstrate the ability to create basic graphs with default settings, in ggplot.

  • Demonstrate the ability to modify axes labels and titles.

Make sure to read about, and use the template found here, and the important information about projects submissions here.


The following questions will use the dataset found in Scholar:



Question 1

Read the dataset into a tibble named complaintsDF. This dataset contains consumer complaints for over 5,000 companies. Our goal is to create a tibble called companyDF containing the following summary information for each company:

  • Company: The company name (Company)

  • State: The state (State)

  • percent_timely_response: Percentage of timely complaints (Timely response?)

  • percent_consumer_disputed: Percentage of complaints that were disputed by the consumer (Consumer disputed?)

  • percent_submitted_online: Percentage of complaints that were submitted online (use column Submitted via, and consider a submission to be an online submission if it was submitted via Web or Email)

  • total_n_complaints: Total number of complaints

There are various ways to create companyDF. Let’s practice using the pipes (%>%) to get companyDF. The idea is that our code at the end of question 2 will look something like this:

companyDF <- complaintsDF %>%
        insert_here_code_to_change_variables %>% # (question 1)
        insert_here_code_to_group_and_get_summaries_per_group  # (question 2)

First, create logical columns (columns containing TRUE or FALSE) for Timely response?, Consumer disputed? and Submitted via named timely_response_log, consumer_disputed_log and submitted_online, respectively.

timely_response_log and consumer_disputed_log will have value TRUE if Timely response? and Consumer disputed? have values Yes respectively, and FALSE if the value for the original column is No. submitted_online will have value TRUE if the the complaint was submitted via Web or Email.

You can double check your results for each column by getting a table with the original and modified column, as shown below. In this case, we would want all TRUE values to be in row Yes, and all FALSE to be in row No.

table(companyDF$`Timely response?`, companyDF$timely_response_log)
Items to submit
  • R code used to solve the problem.

  • Output from running your code.

Question 2

Continue the pipeline we started in question (1). Get the summary information for each company. Note that you will need to include more pipes in the pseudo-code from question (1) as we want the summary for each company in each state. If a company is present in 4 states, companyDF should have 4 rows for that company — one for each state. For the rest of the project, we will refer to a company as its unique combination of Company and State.

The function n() from dplyr counts the number of observations in the current group. It can only by used within mutate/transmute, filter, and the summarize functions.

Items to submit
  • R code used to solve the problem.

  • Output from running your code.

Question 3

Using ggplot2, create a scatterplot showing the relationship between percent_timely_response and percent_consumer_disputed for companies with at least 500 complaints. Based on your results, do you believe there is an association between how timely the company’s response is, and whether the consumer disputes? Why or why not?

Remember, here we consider each row of companyDF a unique company.

Items to submit
  • R code used to solve the problem.

  • Output from running your code.

Question 4

Which company, with at least 250 complaints, has the highest percent of consumer dispute?

We are learning tidyverse, so use tidyverse functions to solve this problem.

Items to submit
  • R code used to solve the problem.

  • Output from running your code.

Question 5

Create a graph using ggplot2 that compares States based on any columns from companyDF or complaintsDF. You may need to summarize the data, filter, or even create new variables depending on what your metric of comparison is. Below are some examples of graphs that can be created. Do not feel limited by them. Make sure to change the labels for each axis, add a title, and change the theme.

  • Cleveland’s dotplot for the top 10 states with the highest ratio between percent of disputed complaints and timely response.

  • Bar graph showing the total number of complaints in each state.

  • Scatterplot comparing the percentage of timely responses in the state and average number of complaints per state.

  • Line plot, where each line is a state, showing the total number of complaints per year.

Items to submit
  • R code used to solve the problem.

  • Output from running your code.

  • The plot produced.

  • 1-2 sentences commenting on your plot.