STAT 29000: Project 12 — Spring 2022

Motivation: As we mentioned before, data wrangling is a big part in any data driven project. "Data Scientists spend up to 80% of the time on data cleaning and 20 percent of their time on actual data analysis." Therefore, it is worth to spend some time mastering how to best tidy up our data.

Context: We are continuing to practice using various tidyverse packages, in order to wrangle data.

Scope: R, tidyverse, ggplot

Learning Objectives
  • Use mutate, pivot, unite, filter, and arrange to wrangle data and solve data-driven problems.

  • Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows.

  • Group data and calculate aggregated statistics using group_by, mutate, summarize, and transform functions.

  • Demonstrate the ability to create basic graphs with default settings, in ggplot.

  • Demonstrate the ability to modify axes labels and titles.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/consumer_complaints/complaints.csv

Questions

Question 1

As usual, we will start by loading tidyverse package (using the read_csv function) and reading the processed data complaints.csv into a tibble called dat.

The first step in many projects is to define the problem statement and goal. That is, to understand why the project is important and what are our desired delivarables. For this project and the next, we will assume that we want to improve consumer satisfaction, and to achieve that we will provide them with data-driven tips and plots to make informed decisions.

What is the type of data in the date_sent_to_company and date_received columns? Do you think that these columns are in a good format to calculate the wait time between receiving and sending the complaint to the company? No need to overthink anything — from your perspective, how simple/complicated would be the steps to calculate the number of days between received and sent to the company, if the data remaining in the current format?

The glimpse function is a good function to get a sample of many columns of data at once. Althought it may be more difficult to read at first (than say, head), since it lists a single row for each column, it is better when you have many columns in a dataset.

Also, try to keep using the pipes (%>%) for the entire project.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • 1 sentence answering the type of columns date_sent_to_company and date_received.

  • 1-2 sentences commenting on if you think it would be easy to calculate calculate the wait time between receiving and sending the complaint to the company if the type of data remained as is.

Question 2

Tidyverse has a few "data type"-specific packages. For example, the package lubridate is fantastic when dealing with dates, and the package stringr was made to help when dealing with text (chr). Although lubridate is a part of the tidyverse universe, it does not get loaded when we run library(tidyverse).

Begin this question by loading the lubridate package. Use the appropriate function to convert the columns refering to dates to a date format.

There are lots of really great (and "official") cheat sheets for tidyverse packages. I find these immensely useful and almost always pull the cheat sheet up when I use a tidyverse package.

Here are the cheat sheets.

Here is the lubridate cheat sheet, which I think is particularly good.

Try to solve this question using the mutate_at function.

You will notice a pattern within the tidyverse of functions named *_at, *_if, and *_all. For example, for the mutate and summarize functions there are versions like mutate_at or summarize_if, etc. These variants of the functions are useful for applying the functions to relevant columns without having to specify individual columns by name.

Take a look at the functions: ydm, ymd, dym, mdy, myd, dmy.

If you are using mutate_at, run the followings command and see what happens.

dat %>%
    select(contains('product')) %>%
    head()

dat %>%
    summarize_at(vars(contains('product')), function(x) length(unique(x)))

dat %>%
    group_by(product) %>%
    summarize_if(is.numeric, mean)
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • Result from running glimpse(dat).

Question 3

Add a new column called wait_time that is the difference between date_sent_to_company and date_received in days.

You may want to use the argument units in the difftime function.

Remember that mutate is the function you want to use to add a new column based on other columns.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Frequently, we need to perform many data wrangling tasks: changing data types, performing operations more easily, summarizing the data, and pivoting to be enable plotting certain types of plots.

So far, we spent most of the project doing just that! Now that we have the wait_time in a format that allows us to plot it, let’s start creating some tips and plots to increase costumer satisfaction. One thing we may want to do is give the user information on the wait_time based on how the complaint was submitted (submitted_via). Compare wait_time values by how they are submitted.

Keep in mind that we want to present this information in a way that would be helpful to costumers. For example, if you summarized the data, you could present the information as a tip and include the summarized wait_time values for the fastest and slowest methods. If you are making a plot and the plot has tons of outliers, maybe we want to consider cutting our axis (or filtering) the data to include just the certain values.

Be sure to explain your reasoning for each step of your analysis. If you are summarizing, why did you pick this method, and why are you summarizing the way you are (for example, are you using the average time, the median time, the maximum time, the mean(wait_time) + 3*std(wait_time))? You may also want to create 3 categories of wait_time (small, medium, high) and do a table between the categorical wait time and submission types. Why are you presenting the information the way you are?

Figuring out how to present the information to help someone make a decision is an important step in any project! You may very well be presenting to someone that is not as familiar with data science/statistics/computer science as you are.

If you are creating categorical wait time, take a look at the case_when function.

One example could be:

The plot below shows the average time it takes for the company to receive your complaint after you sent it based on _how_ you sent it. Note that, on average, it takes XX days to get a response if you submitted via YY. Alternatively, it takes, on avaerage, twice as long to receive a response if you submit a complain via ZZ. Be sure to keep this in mind when submitting a complaint.
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • 1-2 sentences explaning your reasoning for how you presented the information.

  • Information for costumer to make decision (plot, tip, etc).

Question 5

Note that we have a column called timely_response in our dat. It may or may not (in reality) be related to wait_time, however, we would expect it to be. What would you expect to see? Compare wait_time to timely_response using any technique you’d like. You can use the same idea/technique from question 4, or you can pick something else entirely different. It is completely up to you!

Would this information be relevant to include in a tip or dashboard for a costumer to make their decision? Why or why not? Would you combine this information with the one for wait_time? If so, how?

Sometimes there are many ways to present similar pieces of information, and we must decide what we believe makes most sense, and what will be most helpful when making a decision.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • 1-2 sentences comparing wait_time for timely and not timely responses.

  • 1-2 sentences explaining whether you would include this information for costumers, and why or why not? If so, how would you include it?

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.