STAT 29000: Project 5 — Fall 2021

Motivation: awk is a programming language designed for text processing. It can be a quick and efficient way to quickly parse through and process textual data. While Python and R definitely have their place in the data science world, it can be extremely satisfying to perform an operation extremely quickly using something like awk.

Context: This is the first project where we introduce awk. awk is a powerful tool that can be used to perform a variety of the tasks that we’ve previously used other UNIX utilities for. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner.

Scope: awk, UNIX utilities

Learning Objectives
  • Use awk to process and manipulate textual data.

  • Use piping and redirection within the terminal to pass around data between utilities.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt

Questions

Question 1

While the UNIX tools we’ve used up to this point are very useful, awk enables many new capabilities, and can even replace major functionality of other tools.

In a previous question, we asked you to write a command that printed the number of columns in the dataset. Perform the same operation using awk.

Similarly, we’ve used head to print the header line. Use awk to do the same.

Similarly, we’ve used wc to count the number of lines in the dataset. Use awk to do the same.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

In a previous question, we used sort in combination with uniq to find the stores with the most number of sales.

Use awk to find the 10 stores with the most number of sales. In a previous solution, our output was minimal — we had a count and a store number. This time, take some time to format the output nicely, and use the store number to find the count (not store name).

Sorting an array by values in awk can be confusing. Check out this excellent stackoverflow post to see a couple of ways to do this. "Edit 2" is the easiest one to follow.

You can even use the store number to count the number of sales and save the most recent store name for the store number as you go to print the store names with the output.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Calculate the total number of sales (in USD) by county. Do this using any UNIX commands you have available. Then, do this using only awk.

gsub is a powerful awk utility that allows you to replace a string with another string. For example, you could replace all `$’s in field 2 with nothing by:

gsub(/\$/, "", $2)

The gsub operation happens in-place. In a nutshell, what this means is that the original field, $2 is replaced with the result of the gsub operation.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Use awk and piping to create a new dataset with the following columns, for every store, by month:

  • month_number: the month number (01-12)

  • year: the year (4-digit year, e.g., 2015)

  • store_name: store name

  • volume_sold: total volume sold

  • sold_usd: total amount sold in USD

Call the new dataset sales_by_store.csv.

Feel free to use the store name as a key for simplicity.

split is another powerful function in awk that allows you to split a string into multiple fields. You could, for example, extract the year from the date field as follows.

split($2, dates, "/", seps);

Then, you can access the year using dates[3].

You can use multiple values as a key in awk. This is a cool trick to count or calculate something by year, for example.

myarray[$4dates[3]]++

Here, $4 is the 4th field, dates[3] is the year. The resulting key would be something like "My Store Name2014", and we would have a new key (and associated value) for each store/year combination. In the provided code (below), Dr Ward suggests the use of a triple key, which includes the store name, the month, and the year.

Dr Ward walks you through a method of solution for this problem, in the video

cat /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt |
  awk -F\; 'BEGIN{ print "store_name;month_number;year;sold_usd;volume_sold" }
            {gsub(/\$/, "", $22); split($2, dates, "/", seps);
             mysales[$4";"dates[1]";"dates[3]] += $22;
             myvolumes[$4";"dates[1]";"dates[3]] += $24;
            }
            END{ for (mytriple in mysales) {print mytriple";"mysales[mytriple]";"myvolumes[mytriple]}}' >sales_by_store.csv
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Use awk to count how many times each store has sold more than $500,000 in a month. Output should be similar to the following. Sort the output from highest count to lowest.

store_name,count
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.