TDM 20100: Project 4 - AWK

Project Objectives

This project introduces you to awk, a powerful and flexible text processing tool in Unix. Through this assignment, you’ll gain experience in using awk to analyze, manipulate, and summarize structured data from CSV files. You’ll also practice formatting outputs, filtering data with patterns, and combining awk with other common shell utilities like sort, uniq, head, and cut.

Learning Objectives

Use awk to extract and process fields from CSV and text files
Define and use variables in awk to track counts, sums, and averages
Format output using awk to create custom reports
Use conditionals and pattern matching to filter rows
Combine awk with sort, uniq, and other Unix tools for advanced data processing
Write concise awk scripts directly in the shell for real-world data tasks

Dataset

/anvil/projects/tdm/data/restaurant
/anvil/projects/tdm/data/amazon/tracks.csv
/anvil/projects/tdm/data/flights/subset/1990.csv (airplane data used in the video examples)
/anvil/projects/tdm/data/beer/reviews.csv (beer reviews data used in the video example)

Introduction

Recall the previous project where you got the opportunity to maniplate the formatting of the output.

head /anvil/projects/tdm/data/amazon/Reviews.csv | cut -d "," -f4,7,9

The line above using cut can be rewritten as the line below using awk instead. Both will get the same output.

head /anvil/projects/tdm/data/amazon/Reviews.csv | awk -F"," '{print $4,$7,$9}

awk often gives a greater flexibility into how you want your output looks like. It’s an handy command to know! To learn more about awk, you always can run man awk in your terminal.

In short, awk is a text-processing tool in Unix. It scans and processes text based on patterns. awk is a versatile tool, ideally used for tasks in data that is organized in columns. It provides, pattern matching, field-based calculations, and file formatting tasks, all performed efficiently from the command line.

To begin learning about awk, it is important to conceptualize the two fundamental units that awk operates on: records and fields.

You can think of a record as one row of data in a file.

A field can be thought of as a singular element of data (within a row of data).

Simple awk files that run on comma-separated data look this this:

awk -F, 'BEGIN{action to run before processing the data} {action to perform on each row of data} END{action to run after processing the data}' mydatafile.txt

but many awk programs do not have the beginning or ending section, i.e., they just look like this:

awk -F, '{action to perform on each row of data}' mydatafile.txt

The action to run before processing the data only runs once, before awk ever touches the data. It is helpful, for example, if you want to print a header before running your output. (The BEGIN section is often omitted.)

The action to run after processing the data will print after the data is all processed. It is helpful if you want to print out some calculations that you ran on the data set. (The END section is often omitted.)

The main action of an awk program will run on each and every line of the data.

If the data is not comma separated, but (instead) is tab-separated, then we use -F\t instead of -F, (as an example). Or if the data has | between the pieces of data, then we use -F'|' instead. Or if the data has ; between the pieces of data, then we use -F';' instead.

Questions

Question 1 (2 points)

Let’s practice converting some lines we wrote in previous projects into awk commands.

Navigate to the /anvil/projects/tdm/data/restaurant/ directory.
Use awk (instead of cut) to print the second column of the head of orders.csv.
Recall how we used grep to list files in the previous project - try using ls and awk to print all .csv files in the restaurant directory (hint: use regex).
What is the default field separator for awk?
Use awk to print the second column of the first fivve lines in /anvil/projects/tdm/data/amazon/tracks.csv.

Relevant topics: awk, head, ls, pipeline

Deliverables

1a. Code used to solve all the steps above
1b. Written solution to Question 1.4 in Markdown
1c. Output from 1.2, 1.3, 1.5

Please refer to the following videos by Dr. Ward, which demonstrate the general structure of an awk command using airplane data:

Example 1 - awk

Example 2 - awk

Question 2 (2 points)

You can create temporary variables within awk commands!

Suppose I want to know the average score for the first 10 lines, I can run something like this:

head /anvil/projects/tdm/data/amazon/Reviews.csv | awk -F',' 'BEGIN{count=0}{count+=1;score+=$7}END{print score/count}'

Suppose I want to know the average score of all reviews given by a specific user, I can run something like this:

cat /anvil/projects/tdm/data/amazon/Reviews.csv | grep "AQQLWCMRNDFGI" | awk -F',' 'BEGIN{count=0} {count+=1;score+=$7} END{print score/count}'

# OR

cat /anvil/projects/tdm/data/amazon/Reviews.csv | awk -F',' 'BEGIN{count=0} {if ($3 ~ /AQQLWCMRNDFGI/) {count+=1;score+=$7} } END{print score/count}'

Stay in /anvil/projects/tdm/data/restaurant/ and use orders.csv to answer the following:
Find the total item count sold. (hint: this is not asking for line count)
Find the total revenue in orders.csv. To keep this question simple, don’t consider any discounts.
Modify the print component above to not print the total revenue in scienific notation. (hint: printf)
Find the total count of orders that are $100 or more.
Using cut, uniq, and sort, who was the most frequently ordered customer (find their customer ID)?
How much did the customer above spend in total?

Deliverables

2a. Code used to solve all the steps above
2b. Written solution to Question 2.6 and 2.7 in Markdown
2c. Output from 2.2, 2.3, 2.4, 2.5, 2.6, 2.7

Example 3 - awk:

The following videos by Dr. Ward shows more awk examples with beer reviews and airplane data, respectively:

Example 4 - awk:

Question 3 (2 points)

You can create arrays in awk, known as associative arrays.

Suppose I want to find the five most popular reviewers, I can use run something like this:

# Add up all helpfulness numerator scores for each reviewer ID
# Print every element in the array
# Sort the output, then print the first five

cat /anvil/projects/tdm/data/amazon/Reviews.csv | awk -F',' '{reviewer[$3] += $5} END {for (r in reviewer) print reviewer[r], r}' | sort -nr | head -n 5

Stay in /anvil/projects/tdm/data/restaurant/
Using test_full.csv, what is the total number of orders (a row is one order)for each gender group (hint: add NR > 1 to ignore the header)?
Using orders.csv, how many orders were placed from a customer’s favorite vendor?
Using orders.csv, which vendor is marked as a favorite by the highest number of customers?

Deliverables

3a. Code used to solve all the steps above
3b. Written solution to Question 3.3 in Markdown
3c. Output from 3.2, 3.3, 3.4, 3.5

Question 4 (2 points)

Let’s ask some complex questions that can be answered using awk commands.

Stay in the /anvil/projects/tdm/data/restaurant/ directory.
Using test_full.csv, by gender (female, male, blank), which rating (field 30) occurs the most amongst vendors?
Using test_full.csv, by gender (female, male, blank), what is the average vendor rating they ordered from?
Using orders.csv, which vendor (vendor_id) offered the most discounts?
Using orders.csv, how many orders did the vendor (from the previous question) receive, and what percentage of those had a discount applied?
Using orders.csv, which vendor received the most orders delivered to work?

Deliverables

4a. Code used to solve all the steps above.
4b. Written solution to Questions 4.2, 4.3, 4.4, 4.5, and 4.6 in Markdown.
4c. Output from Questions 4.2, 4.3, 4.4, 4.5, and 4.6

Question 5 (2 points)

You can use awk to format your output.

**************************************************
*              Our Best Customers                *
**************************************************
30 4Y0K8NQ
26 NETYQ1C
20 5EP9F62
19 YYKZN0F
19 4UZKXW2
**************************************************
* Total lines processed: 59504                   *
**************************************************

Write an awk command that generates the exact same output as shown above using the file train_locations.csv.
1. Your command must calculate the top 5 most frequent customer_id values and the total number of lines processed. None of them should be (not hardcoded).
2. You are free to copy and paste the formatting, but the actual values must be generated programmatically.

There are many ways to do this, and full credit will be given if your command prints the exact same output format. If you need a hint, two awk can be used in one command line.

Relevant topics: awk, pipeline, cut, sort

Deliverables

5a. Code to answer Question 5.1
5b. Output from 5.1

Submitting your Work

Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.

Items to submit

firstname_lastname_project4.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not. Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.