STAT 39000: Project 5 — Fall 2020

Motivation: Becoming comfortable stringing together commands and getting used to navigating files in a terminal is important for every data scientist to do. By learning the basics of a few useful tools, you will have the ability to quickly understand and manipulate files in a way which is just not possible using tools like Microsoft Office, Google Sheets, etc.

Context: We’ve been using UNIX tools in a terminal to solve a variety of problems. In this project we will continue to solve problems by combining a variety of tools using a form of redirection called piping.

Scope: grep, regular expression basics, UNIX utilities, redirection, piping

Learning objectives
  • Use cut to section off and slice up data from the command line.

  • Use piping to string UNIX commands together.

  • Use sort and it’s options to sort data in different ways.

  • Use head to isolate n lines of output.

  • Use wc to summarize the number of lines in a file or in output.

  • Use uniq to filter out non-unique lines.

  • Use grep to search files effectively.

You can find useful examples that walk you through relevant material in The Examples Book:

It is highly recommended to read through, search, and explore these examples to help solve problems in this project.

Don’t forget the very useful documentation shortcut ? for R code. To use, simply type ? in the console, followed by the name of the function you are interested in. In the Terminal, you can use the man command to check the documentation of bash code.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/amazon/amazon_fine_food_reviews.csv

A public sample of the data can be found here: amazon_fine_food_reviews.csv

Answers to questions should all be answered using the full dataset located on Scholar. You may use the public samples of data to experiment with your solutions prior to running them using the full dataset.

Here are three videos that might also be useful, as you work on Project 5:

Questions

Question 1

What is the Id of the most helpful review, according to the highest HelpfulnessNumerator?

You can always pipe output to head in case you want the first few values of a lot of output. Note that if you used sort before head, you may see the following error messages:

sort: write failed: standard output: Broken pipe
sort: write error

This is because head would truncate the output from sort. This is okay. See this discussion for more details.

Items to submit
  • Line of UNIX commands used to solve the problem.

  • The Id of the most helpful review.

Question 2

Some entries under the Summary column appear more than once. Calculate the proportion of unique summaries over the total number of summaries. Use two lines of UNIX commands to find the numerator and the denominator, and manually calculate the proportion.

To further clarify what we mean by unique, if we had the following vector in R, c("a", "b", "a", "c"), its unique values are c("a", "b", "c").

Items to submit
  • Two lines of UNIX commands used to solve the problem.

  • The ratio of unique `Summary’s.

Question 3

Use a chain of UNIX commands, piped in a sequence, to create a frequency table of Score.

Items to submit
  • The line of UNIX commands used to solve the problem.

  • The frequency table.

Question 4

Who is the user with the highest number of reviews? There are two columns you could use to answer this question, but which column do you think would be most appropriate and why?

You may need to pipe the output to sort multiple times.

To create the frequency table, read through the man pages for uniq. Man pages are the "manual" pages for UNIX commands. You can read through the man pages for uniq by running the following:

man uniq
Items to submit
  • The line of UNIX commands used to solve the problem.

  • The frequency table.

Question 5

Anecdotally, there seems to be a tendency to leave reviews when we feel strongly (either positive or negative) about a product. For the user with the highest number of reviews (i.e., the user identified in question 4), would you say that they follow this pattern of extremes? Let’s consider 5 star reviews to be strongly positive and 1 star reviews to be strongly negative. Let’s consider anything in between neither strongly positive nor negative.

You may find the solution to problem (3) useful.

Items to submit
  • The line of UNIX commands used to solve the problem.

Question 6

Find the most helpful review with a Score of 5. Then (separately) find the most helpful review with a Score of 1. As before, we are considering the most helpful review to be the review with the highest HelpfulnessNumerator.

You can use multiple lines to solve this problem.

Items to submit
  • The lines of UNIX commands used to solve the problem.

  • `ProductId’s of both requested reviews.

Question 7

For only the two ProductId from the previous question, create a new dataset called scores.csv that contains all ProductId and Score from all reviews for these two items.

Items to submit
  • The line of UNIX commands used to solve the problem.

OPTIONAL QUESTION

Use R to load up scores.csv into a new data.frame called dat. Create a histogram for each products' Score. Compare the most helpful review Score with those given in the histogram. Based on this comparison, point out some curiosities about the product that may be worth exploring. For example, if a product receives many high scores, but has a super helpful review that gives the product 1 star, I may tend to wonder if the product is not as great as it seems to be.

Items to submit
  • R code used to create the histograms.

  • 3 histograms, 1 for each ProductId.

  • 1-2 sentences describing the curious pattern that you would like to further explore.