TDM 20100: Project 6 — Awk

Motivation: As you’ve seen so far, bash has a wide variety of commands that enable us to do different things, and we can use pipes to connect those commands and perform whole loads of data processing in one big step. However, conciseness is a virtue. In this project we’ll start learning about awk. By the end of the next few weeks, you will be able to do entire pipelines worth of work with just one awk!

Context: This project will relate awk concepts back to previously learned commands, and at the very least a basic knowledge of filesystem navigation and regex will be needed.

Scope: awk, data processing, Bash, GNU, CLI

Learning Objectives:

Learn the general structure of a call to awk
Construct your first basic awk
Use awk to print common file information
Use awk to print specific parts of files and accomplish multiple commands worth of work at once

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following dataset(s):

/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt (Iowa liquor sales data)
/anvil/projects/tdm/data/election (election data)
/anvil/projects/tdm/data/beer/reviews.csv (beer reviews data)
/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv (grocery store data)
/anvil/projects/tdm/data/flights/subset/ (airplane data)

Example 1:

Example 2:

Example 3:

Example 4:

Example 5:

Example 6:

Questions

Question 1 (2 pts)

To begin learning about awk, it is important to conceptualize the two fundamental units that awk operates on: records and fields.

You can think of a record as one row of data in a file.

A field can be thought of as a singular element of data (within a row of data).

Simple awk files that run on comma-separated data look this this:

awk -F, 'BEGIN{action to run before processing the data} {action to perform on each row of data} END{action to run after processing the data}' mydatafile.txt

but many awk programs do not have the beginning or ending section, i.e., they just look like this:

awk -F, '{action to perform on each row of data}' mydatafile.txt

The action to run before processing the data only runs once, before awk ever touches the data. It is helpful, for example, if you want to print a header before running your output. (The BEGIN section is often omitted.)

The action to run after processing the data will print after the data is all processed. It is helpful if you want to print out some calculations that you ran on the data set. (The END section is often omitted.)

The main action of an awk program will run on each and every line of the data.

If the data is not comma separated, but (instead) is tab-separated, then we use -F\t instead of -F, (as an example). Or if the data has | between the pieces of data, then we use -F'|' instead. Or if the data has ; between the pieces of data, then we use -F';' instead.

The Sales values (given in Dollars) are available in field 22 of this data set:

head /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | cut -d';' -f22

Use awk (without using cut!) to add the total Sales values from the entire file.

The total Sales values are almost 4 billion dollars. The e+09 that shows up in the answer means scientific notation with nine zeros, i.e., multiply the answer by 1,000,000,000, i.e., by 1 billion.

This question might take 2 or 3 minutes to run.

Deliverables

Print the total Sales values from the entire file.

Question 2 (2 pts)

2a. Very similarly to question 1, use awk (without using cut!) to sum the total dollar amounts of the donations (altogether) given in column 15 of this file: /anvil/projects/tdm/data/election/itcont1980.txt

2b. Now sum the total dollar amounts of the donations (altogether) given in column 15 of all of the itcont*.txt files (altogether).

2a. The total for your answer should be a little more than 200 million dollars.

2b. The total for your answer should be a little more than 62 billion dollars.

Question 2b might take 30 to 60 minutes to run.

Deliverables

a. Print the sum of the total dollar amounts of the donations in the 1980 election data.
b. Print the sum of the total dollar amounts of the donations in all of the election data files of the form itcont*.txt.

Question 3 (2 pts)

Consider the data in the file /anvil/projects/tdm/data/beer/reviews.csv

Notice that the number of columns on each line varies, because each line has a varied number of commas. Also note that the number of fields on each line is NF and therefore the last field on each line is $NF. Use this information to add all of the values in the score column, in a variable called totalscores. Also, at the same time, add the number of lines, in a variable called totallines. Finally, at the end, print the ratio of totalscores and totallines, so that we have the overall average score across the entire data set.

Deliverables

Print the overall average score across the entire data set.

Question 4 (2 pts)

Consider the data in the file /anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv

Use grep SOUTH and also awk (not cut) to sum the total amount of values in the SPEND column (corresponding to lines with SOUTH for the STORE_R value). Then do this again (in a separate bash pipeline) for the EAST stores, and then do it again (in a third bash pipeline) for the WEST stores, and finally in a fourth bash pipeline for the CENTRAL stores.

In the future, we will learn how to do all of this with one line of awk but for now it is OK to do this in four separate bash pipelines.

Deliverables

Print the sum of the SPEND column values corresponding to each of the four store regions. This will take four separate bash pipelines, one Jupyter Lab cell each.

Question 5 (2 pts)

Consider the data in the file /anvil/projects/tdm/data/flights/subset/1990.csv

Use awk for formatted output, like this:

awk -F, '{print "flights from "$17" to "$18;}'

incorporated into a pipeline (with sort | uniq -c | sort -n | tail) from the previous projects, to find the 10 most popular flight paths in 1990 and the number of flights on those paths. Hint: The top two flight paths should be:

  25779 flights from LAX to SFO
  26134 flights from SFO to LAX

Deliverables

Print the 10 most popular flight paths in 1990 and the number of flights on those paths, with the nice formatting described above.

Submitting your Work

We are just starting to get familiar with awk so please feel welcome to ask for clarifications and help anytime. This is a powerful tool that will enable you to (pre-)process data and to analyze data very, very quickly. It is also a wonderful tool to incorporate in bash pipelines.

Items to submit

firstname-lastname-project6.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.