STAT 29000: Project 4 — Fall 2021

Extracting and summarizing data in bash

Motivation: Becoming comfortable chaining commands and getting used to navigating files in a terminal is important for every data scientist to do. By learning the basics of a few useful tools, you will have the ability to quickly understand and manipulate files in a way which is just not possible using tools like Microsoft Office, Google Sheets, etc. While it is always fair to whip together a script using your favorite language, you may find that these UNIX tools are a better fit for your needs.

Context: We’ve been using UNIX tools in a terminal to solve a variety of problems. In this project we will continue to solve problems by combining a variety of tools using a form of redirection called piping.

Scope: grep, regular expression basics, UNIX utilities, redirection, piping

Learning Objectives
  • Use cut to section off and slice up data from the command line.

  • Use piping to string UNIX commands together.

  • Use sort and it’s options to sort data in different ways.

  • Use head to isolate n lines of output.

  • Use wc to summarize the number of lines in a file or in output.

  • Use uniq to filter out non-unique lines.

  • Use grep to search files effectively.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/stackoverflow/unprocessed/*

  • /depot/datamine/data/stackoverflow/processed/*

  • /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt

Questions

Question 1

One of the first things to do when first looking at a dataset is reading the first few lines of data in the file. Typically, there will be some headers which describe the data, and you get to see what some of the data looks like. Use the UNIX head command to read the first few lines of the data in unprocessed/2011.csv.

As you will quickly see, this dataset is just too wide — there are too many columns — to be useful. Let’s try and count the number of columns using head, tr, and wc. If we can get the first row, replace ,’s with newlines, then use `wc -l to count the number of lines, this should work, right? What happens?

The newline character in UNIX is \n.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

As you can see, csv files are not always so straightforward to parse. For this particular set of questions, we want to focus on using other UNIX tools that are more useful on semi-clean datasets. Take a look at the first few lines of the data in processed/2011.csv. How many columns are there?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Let’s switch gears, and look at a larger dataset with more data to analyze. Check out iowa_liquor_sales_cleaner.txt. What are the 5 largest orders by number of bottles sold?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

What are the different sizes (in ml) that a bottle of liquor comes in?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Which store has the most invoices? There are 2 columns you could potentially use to solve this problem, which should you use and why? For this dataset, does it end up making a difference?

This may take a few minutes to run. Grab a coffee. To prevent wasting time, try practicing on the head of the data instead of the entire data.

Be very careful when using uniq. Read the man pages for uniq, otherwise, you may not get the correct solution.

man uniq
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 6

sort is a particularly powerful function, albeit not always the most user friendly when compared to other tools.

For the largest sale (in USD), what was the volume sold in liters?

For the largest sale (in liters of liquor sold), what was the total cost (in USD)?

Use the -k option with sort to solve these questions.

To remove a dollar sign from text using tr, do the following.

tr -d '$'
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 7

Use head, grep, sort, uniq, wc, and any other UNIX utilities you feel comfortable using to answer a data-driven question about the iowa_liquor_sales_cleaner.txt dataset.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.