STAT 39000: Project 12 — Spring 2021

Motivation: As we mentioned before, data wrangling is a big part in any data driven project. "Data Scientists spend up to 80% of the time on data cleaning and 20 percent of their time on actual data analysis." Therefore, it is worth to spend some time mastering how to best tidy up our data.

Context: We are continuing to practice using various tidyverse packages, in order to wrangle data.

Scope: python

Learning objectives

Explain the differences between regular data frames and tibbles.
Use mutate, pivot, unite, filter, and arrange to wrangle data and solve data-driven problems.
Combine different data using joins (left_join, right_join, semi_join, anti_join), and bind_rows.
Group data and calculate aggregated statistics using group_by, mutate, and transform functions.
Demonstrate the ability to create basic graphs with default settings, in ggplot.
Demonstrate the ability to modify axes labels and titles.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

The first step in any data science project is to define our problem statement. In this project, our goal is to gain insights into customers' behaviours with regards to online orders and restaurant ratings.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/restaurant_recommendation/*.csv

Questions

Question 1

Load the tidyverse suite a packages, and read the data from files orders.csv, train_customers.csv, and vendors.csv into tibble`s named `orders, customers, and vendors respectively.

Take a look the tibbles and describe in a few sentences the type of information contained in each dataset. Although the name can be self-explanatory, it is important to get an idea of what exactly we are looking at. For each combination of 2 datasets, which column would you use to join them?

Items to submit

R code used to solve the problem.
Output from running your code.
1-2 sentences explaining each dataset (orders, customers, and vendors).
1-2 sentences for each combination of 2 datasets describing if we could combine the datasets or not, and which column you would you use to join them.

Question 2

Let’s tidy up our datasets a bit prior to joining them. For each dataset, complete the tasks below.

orders: remove columns from and between preparationtime to delivered_time (inclusive).
customers: take a look at the column dob. Based on its values, what do you believe it was supposed to contain? Can we rely on the numbers selected? Why or why not? Based on your answer, keep the columns akeed_customer_id, gender, and dob, OR just akeed_customer_id and gender.
vendors: take a look at columns country_id and city_id. Would they be useful to compare the vendors in our dataset? Why or why not? If not, remove the columns from the dataset.

Items to submit

R code used to solve the problem.
Output from running your code.
1-2 sentences describing what columns you kept for vendors and customers and why.

Question 3

Use your solutions from questions (1) and (2), and the join functions from tidyverse (inner_join, left_join, right_join, and full_join) to create a single tibble called myDF containing information only where all 3 tibbles intersect.

For example, we do not want myDF to contain orders from customers that are not in customers tibble. Which function(s) from the tidyverse did you use to merge the datasets and why?

myDF should have 132,226 rows.

When combining two datasets, you may want to change the argument suffix in the join function to specify from which dataset it came from. For example, when joining customers and orders: *_join(customers, orders, suffix = c('_customers', '_orders')).

Items to submit

R code used to solve the problem.
Output from running your code.
1-2 sentences describing which function you used, and why.

Question 4

Great, now we have a single, tidy dataset to work with. There are 2 vendor categories in myDF, Restaurants and Sweets & Bakes. We would expect there to be some differences. Let’s compare them using the following variables: deliverydistance, item_count, grand_total, and vendor_discount_amount. Our end goal (by the end of question 5) is to create a histogram colored by the vendor’s category (vendor_category_en), for each variable.

To accomplish this easily using ggplot, we will take advantage of pivot_longer. Pivot columns deliverydistance, item_count, grand_total, and vendor_discount_amount in myDF. The end result should be a tibble with columns variable and values, which contain the name of the pivoted column (variable), and values of those columns (values) Call this modified dataset myDF_long.

Items to submit

R code used to solve the problem.
Output from running your code.

Question 5

Now that we have the data in the ideal format for our plot, create a histogram for each variable. Make sure to color them by vendor category (vendor_category_en). How do the two types of vendors compare in these 4 variables?

Use the argument fill instead of color in geom_histogram.

You may want to add some transparency to your plot. Add it using alpha argument in geom_histogram.

You may want to change the argument scales in facet_*.

Items to submit

R code used to solve the problem.
Output from running your code.
2-3 sentences comparing Restaurants and Sweets & Bakes for deliverydistance, item_count, grand_total and vendor_discount_amount.