TDM 20100: Project 5 — 2022

Motivation: awk is a programming language designed for text processing. It can be a quick and efficient way to quickly parse through and process textual data. While Python and R definitely have their place in the data science world, it can be extremely satisfying to perform an operation extremely quickly using something like awk.

Context: This is the first of three projects where we introduce awk. awk is a powerful tool that can be used to perform a variety of the tasks that we’ve previously used other UNIX utilities for. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner.

Scope: awk, UNIX utilities

Learning Objectives
  • Use awk to process and manipulate textual data.

  • Use piping and redirection within the terminal to pass around data between utilities.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt

Questions

Question 1

While the UNIX tools we’ve used up to this point are very useful, awk enables many new capabilities, and can even replace major functionality of other tools.

In a previous question, we asked you to write a command that printed the number of columns in the dataset. Perform the same operation using awk.

Similarly, we’ve used head to print the header line. Use awk to do the same.

Similarly, we’ve used wc to count the number of lines in the dataset. Use awk to do the same.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

In a previous question, we used sort in combination with uniq to find the stores with the most number of sales.

Use awk to find the 10 stores with the most number of sales. In a previous solution, our output was minimal — we had a count and a store number. This time, take some time to format the output nicely, and use the store number to find the count (not store name).

Sorting an array by values in awk can be confusing. Check out this excellent stackoverflow post to see a couple of ways to do this. "Edit 2" is the easiest one to follow.

You can even use the store number to count the number of sales and save the most recent store name for the store number as you go to print the store names with the output.

You can pipe output to the column unix command to get neatly formatted output!

man column
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Calculate the total sales (in USD). Do this using only awk.

gsub is a powerful awk utility that allows you to replace a string with another string. For example, you could replace all `$’s in field 2 with nothing by:

gsub(/\$/, "", $2)

The gsub operation happens in-place. In a nutshell, what this means is that the original field, $2 is replaced with the result of the gsub operation (which removes the dollar signs).

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Calculate the total sales (in USD) by county. Do this using only awk. Format your output so it looks like the following.

output
FRANKLIN: $386729.06
HARRISON: $401811.83
Franklin: $2102880.14
Harrison: $2109578.24

Notice anything odd about the result? Look carefully at the dataset and suggest an alternative method that would clean up the issue.

You can see the issue in our tiny sample of output.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

awk is extremely powerful, and this liquor dataset is pretty interesting! We haven’t covered everything awk (and we won’t).

Look at the dataset and ask yourself an interesting question about the data. Use awk to solve your problem (or, at least, get you closer to answering the question). Explore various stackoverflow questions about awk and awk guides online. Try to incorporate an awk function you haven’t used, or a awk trick you haven’t seen. While this last part is not required, it is highly encouraged and can be a fun way to learn something new.

You do not need to limit yourself to just use awk, but try to do as much using just awk as you are able.

Items to submit
  • A markdown cell containing the question you are trying to answer about the dataset.

  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.