TDM 20100: Project 4 - AWK
Project Objectives
This project introduces you to awk
, a powerful and flexible text processing tool in Unix. Through this assignment, you’ll gain experience in using awk
to analyze, manipulate, and summarize structured data from CSV files. You’ll also practice formatting outputs, filtering data with patterns, and combining awk
with other common shell utilities like sort
, uniq
, head
, and cut
.
Dataset
-
/anvil/projects/tdm/data/restaurant/
-
/anvil/projects/tdm/data/flights/subset/ (airplane data used in the video examples)
-
/anvil/projects/tdm/data/beer/reviews.csv (beer reviews data used in the video example)
Introduction
Recall the previous project where you got the opportunity to maniplate the formatting of the output.
head /anvil/projects/tdm/data/amazon/Reviews.csv | cut -d "," -f4,7,9
The line above using cut
can be rewritten as the line below using awk
instead. Both will get the same output.
head /anvil/projects/tdm/data/amazon/Reviews.csv | awk -F"," '{print $4,$7,$9}
awk
often gives a greater flexibility into how you want your output looks like. It’s an handy command to know! To learn more about awk
, you always can run man awk
in your terminal.
In short, awk
is a text-processing tool in Unix. It scans and processes text based on patterns. awk
is a versatile tool, ideally used for tasks in data that is organized in columns. It provides, pattern matching, field-based calculations, and file formatting tasks, all performed efficiently from the command line.
To begin learning about awk
, it is important to conceptualize the two fundamental units that awk
operates on: records and fields.
You can think of a record as one row of data in a file.
A field can be thought of as a singular element of data (within a row of data).
Simple awk files that run on comma-separated data look this this:
awk -F, 'BEGIN{action to run before processing the data} {action to perform on each row of data} END{action to run after processing the data}' mydatafile.txt
but many awk programs do not have the beginning or ending section, i.e., they just look like this:
awk -F, '{action to perform on each row of data}' mydatafile.txt
The action to run before processing the data only runs once, before awk
ever touches the data. It is helpful, for example, if you want to print a header before running your output. (The BEGIN section is often omitted.)
The action to run after processing the data will print after the data is all processed. It is helpful if you want to print out some calculations that you ran on the data set. (The END section is often omitted.)
The main action of an awk program will run on each and every line of the data.
If the data is not comma separated, but (instead) is tab-separated, then we use -F\t
instead of -F,
(as an example). Or if the data has |
between the pieces of data, then we use -F'|'
instead. Or if the data has ;
between the pieces of data, then we use -F';'
instead.
Questions
Question 1 (2 points)
Let’s practice converting some lines we wrote in previous projects into awk
commands.
-
Navigate to the
/anvil/projects/tdm/data/restaurant/
directory. -
Use
awk
(instead ofcut
) to print the second column of the head oforders.csv
. -
Recall how we used
grep
to list files in the previous project - try usingls
andawk
to print all.csv
files in therestaurant
directory (hint: use regex). -
What is the default field separator for
awk
? -
Use
awk
to print the second column of the first fivve lines in/anvil/projects/tdm/data/amazon/tracks.csv
.
Relevant topics: awk, head, ls, pipeline
1a. Code used to solve all the steps above
1b. Written solution to Question 1.4 in Markdown
1c. Output from 1.2, 1.3, 1.5
Please refer to the following videos by Dr. Ward, which demonstrate the general structure of an Example 1 - awk Example 2 - awk |
Question 2 (2 points)
You can create temporary variables within awk
commands!
Suppose I want to know the average score for the first 10 lines, I can run something like this:
head /anvil/projects/tdm/data/amazon/Reviews.csv | awk -F',' 'BEGIN{count=0}{count+=1;score+=$7}END{print score/count}'
Suppose I want to know the average score of all reviews given by a specific user, I can run something like this:
cat /anvil/projects/tdm/data/amazon/Reviews.csv | grep "AQQLWCMRNDFGI" | awk -F',' 'BEGIN{count=0} {count+=1;score+=$7} END{print score/count}' # OR cat /anvil/projects/tdm/data/amazon/Reviews.csv | awk -F',' 'BEGIN{count=0} {if ($3 ~ /AQQLWCMRNDFGI/) {count+=1;score+=$7} } END{print score/count}'
-
Stay in
/anvil/projects/tdm/data/restaurant/
and useorders.csv
to answer the following: -
Find the total item count sold. (hint: this is not asking for line count)
-
Find the total revenue in
orders.csv
. To keep this question simple, don’t consider any discounts. -
Modify the print component above to not print the total revenue in scienific notation. (hint:
printf
) -
Find the total count of orders that are $100 or more.
-
Using
cut
,uniq
, andsort
, who was the most frequently ordered customer (find their customer ID)? -
How much did the customer above spend in total?
2a. Code used to solve all the steps above
2b. Written solution to Question 2.6 and 2.7 in Markdown
2c. Output from 2.2, 2.3, 2.4, 2.5, 2.6, 2.7
Example 3 - awk: The following videos by Dr. Ward shows more Example 4 - awk: |
Question 3 (2 points)
You can create arrays in awk
, known as associative arrays.
Suppose I want to find the five most popular reviewers, I can use run something like this:
# Add up all helpfulness numerator scores for each reviewer ID # Print every element in the array # Sort the output, then print the first five cat /anvil/projects/tdm/data/amazon/Reviews.csv | awk -F',' '{reviewer[$3] += $5} END {for (r in reviewer) print reviewer[r], r}' | sort -nr | head -n 5
-
Stay in
/anvil/projects/tdm/data/restaurant/
-
Using
test_full.csv
, what is the total number of orders (a row is one order)for each gender group (hint: addNR > 1
to ignore the header)? -
Using
orders.csv
, how many orders were placed from a customer’s favorite vendor? -
Using
orders.csv
, which vendor is marked as a favorite by the highest number of customers?
3a. Code used to solve all the steps above
3b. Written solution to Question 3.3 in Markdown
3c. Output from 3.2, 3.3, 3.4, 3.5
Question 4 (2 points)
Let’s ask some complex questions that can be answered using awk
commands.
-
Stay in the
/anvil/projects/tdm/data/restaurant/
directory. -
Using
test_full.csv
, by gender (female, male, blank), which vendor has the highest rating (field 30) they ordered from? -
Using
test_full.csv
, by gender (female, male, blank), what is the average vendor rating they ordered from? -
Using
orders.csv
, which vendor (vendor_id
) offered the most discounts? -
Using
orders.csv
, how many orders did the vendor (from the previous question) receive, and what percentage of those had a discount applied? -
Using
orders.csv
, which vendor received the most orders delivered to work?
4a. Code used to solve all the steps above.
4b. Written solution to Questions 4.2, 4.3, 4.4, 4.5, and 4.6 in Markdown.
4c. Output from Questions 4.2, 4.3, 4.4, 4.5, and 4.6
Question 5 (2 points)
You can use awk
to format your output.
**************************************************
* Our Best Customers *
**************************************************
30 4Y0K8NQ
26 NETYQ1C
20 5EP9F62
19 YYKZN0F
19 4UZKXW2
**************************************************
* Total lines processed: 59504 *
**************************************************
-
Write an
awk
command that generates the exact same output as shown above using the filetrain_locations.csv
.-
Your command must calculate the top 5 most frequent
customer_id
values and the total number of lines processed. None of them should be (not hardcoded). -
You are free to copy and paste the formatting, but the actual values must be generated programmatically.
-
There are many ways to do this, and full credit will be given if your command prints the exact same output format. If you need a hint, two awk
can be used in one command line.
Relevant topics: awk, pipeline, cut, sort
5a. Code to answer Question 5.1
5b. Output from 5.1
Submitting your Work
Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.
-
firstname_lastname_project1.ipynb
You must double check your You will not receive full credit if your |