TDM 20100: Project 11 - More Practise in Awk

Project Objectives

This project focuses exclusively on associative arrays in awk, which allow you to use strings as array indices instead of just integers. This is a powerful feature that enables counting, grouping, and aggregating data in ways that would be very difficult with regular arrays.

Learning Objectives
  • Understand what associative arrays are and how they work

  • Learn how to use string indices in awk arrays

  • Practice counting occurrences and tracking values

  • Learn how to loop through associative arrays

  • Apply associative arrays to real-world data analysis problems

Dataset

  • /anvil/projects/tdm/data/restaurant/orders.csv

  • /anvil/projects/tdm/data/restaurant/test_full.csv

  • /anvil/projects/tdm/data/flights/subset/1995.csv

  • /anvil/projects/tdm/data/flights/subset/ (airplane data, used in the video)

  • /anvil/projects/tdm/data/death_records/DeathRecords.csv (used in the video)

If AI is used in any cases, such as for debugging, research, etc., we now require that you submit a link to the entire chat history. For example, if you used ChatGPT, there is an “Share” option in the conversation sidebar. Click on “Create Link” and please add the shareable link as a part of your citation.

The project template in the Examples Book now has a “Link to AI Chat History” section; please have this included in all your projects. If you did not use any AI tools, you may write “None”.

We allow using AI for learning purposes; however, all submitted materials (code, comments, and explanations) must all be your own work and in your own words. No content or ideas should be directly applied or copy pasted to your projects. Please refer to the-examples-book.com/projects/fall2025/syllabus#guidance-on-generative-ai. Failing to follow these guidelines is considered as academic dishonesty.

Questions

Question 1 (2 points)

Associative arrays in awk are like dictionaries in Python or hash maps in other languages. Instead of using numbers as indices (like array[1], array[2]), you can use strings as indices (like array["apple"], array["banana"]).

Here’s a simple example (remember to add %%bash at the beginning of your each cells):

# Basic associative array example
echo "apple banana apple orange banana apple" | awk '
{
    # Loop through each word in the input
    for (i = 1; i <= NF; i++) {
        # Increment the count for the word
        # $i is the current word
        # count[$i] is the count for the word
        # ++ is the increment operator
        count[$i]++
    }
}
END {
    print "Word counts:"
    for (word in count) {
        print word ": " count[word]
    }
}'

This script counts how many times each word appears in the input.

Let’s start with a simple counting example. We’ll count how many orders each customer made in the orders.csv file.

First, let’s look at the structure of the orders.csv file:

head -n 3 /anvil/projects/tdm/data/restaurant/orders.csv

This will show use the first 3 rows of the file, where the first row is the header/column names.

From that, we can see that the customer ID is in column 2. Let’s count orders per customer:

cat /anvil/projects/tdm/data/restaurant/orders.csv | awk -F',' '
NR > 1 {  # Skip header row
    customer_id = $2
    if (customer_id != "") {
        order_count[customer_id]++
    }
}
END {
    print "Customer ID, Number of Orders"
    for (customer in order_count) {
        print customer "," order_count[customer]
    }
}' | sort -t, -k2,2nr | head -n 10

The sort -t, -k2,2nr command sorts the results by the second column in descending order, where the second column is the number of orders, as we print just two columns (customer ID, order count).

Now that you see how to count orders per customer, please modify the command to count orders per vendor.

hint: the vendor ID is in column $22

Deliverables

1.1. Run the above command and show the results.
1.2. Which customer has the most orders? How many orders do they have?
1.3. Which vendor has the most orders? How many orders do they have?

For more practice, please refer the Dr. Ward’s videos below with airplane data and death records data, respectively:

Question 2 (2 points)

Now let’s work with the test_full.csv file and count restaurants by category. The vendor category is in column 15.

head -n 3 /anvil/projects/tdm/data/restaurant/test_full.csv

Let’s count how many restaurants are in each category:

cat /anvil/projects/tdm/data/restaurant/test_full.csv | awk -F',' '
NR > 1 {  # Skip header row
    category = $15
    if (category != "") {
        category_count[category]++
    }
}
END {
    print "Category, Number of Restaurants"
    for (cat in category_count) {
        print cat "," category_count[cat]
    }
}' | sort -t, -k2,2nr | head -n 10

Now, instead of counting the number of restaurants in each category, can you count the number of restaurants with each rating? (4.0, 4.1, 4.2, etc.). Sort your output by the rating in descending order.

hint: the vendor rating is in column $30

Deliverables

2.1. Run the above command and show the results.
2.2. What category has the most restaurants? How many restaurants are in that category?
2.3. How many different categories are there in total?
2.4. How many restaurants have a rating of 4.0? Do any restaurants have a 5.0 rating?

Question 3 (2 points)

Now, let’s learn how to find average information. To calculate an average, we need a total and a count. Therefore, we can use two associative arrays to track this information.

We need to track:
- Total amount spent by each customer (sum),
- Number of orders by each customer (count),
- Then calculate average = sum / count.

In this example, we are tracking the total amount spent by each customer, and the number of orders made by each customer. This way, at the end, we can divide the total amount spent by the number of orders to get the average amount spent per order, on a per-customer basis.

cat /anvil/projects/tdm/data/restaurant/orders.csv | awk -F',' '
NR > 1 {  # Skip header row
    customer_id = $2
    order_amount = $4 + 0  # Convert string to number by adding 0
    if (customer_id != "" && order_amount > 0) { # ensure that the customer and order actually exist
        total_spent[customer_id] += order_amount
        order_count[customer_id]++
    }
}
END {
    print "Customer ID, Total Spent, Number of Orders, Average Order"
    for (customer in total_spent) {
        avg = total_spent[customer] / order_count[customer]
        printf "%s,%.2f,%d,%.2f\n", customer, total_spent[customer], order_count[customer], avg
    }
}' | sort -t, -k4,4nr | head -n 10

With -k4, 4nr we are sorting by the fourth column in descending order, to see the customers with the highest average order amount. Please modify this command to show the customers with the highest total spent and run it.

Additionally, let’s look at some different information. Can you write a command to find the average number of items ordered in each order per vendor?

The vendor ID is in column $22, and the number of items ordered is in column $3 and must be converted to a number by adding 0.

Deliverables

3.1. Run the above command and show the results.
3.2. Which customer has the highest average order amount?
3.3. What is the highest average order amount?
3.4. Modify the command to show the customers with the highest total spent. Which customer spent the most money, and how much did they spend?
3.5. Which vendor has the highest average number of items ordered per order? How many items are ordered on average per order for this vendor?

Question 4 (2 points)

Now that you see how to find the average of data on the restaurant orders dataset, let’s try applying it to a different dataset. Using the 1995 flight dataset, find the average distance traveled each flight for each airline.

First, let’s look at the structure of the flights dataset:

head -n 3 /anvil/projects/tdm/data/flights/subset/1995.csv

The unique carrier (airline) is in column $9, and the distance is in column $18. Remember to convert the distance to a number by adding 0, and skip the header row.

Now, write an awk command to find the average distance traveled per flight for each airline.

Sort your output by the average distance in descending order to see the carriers with the highest average distance per flight.

Deliverables

4.1. Write and run a command to find the average distance traveled per flight for each airline. Show the results.
4.2. Which carrier has the highest average distance per flight?
4.3. What is the highest average distance per flight?

Question 5 (2 points)

Now, it’s up to you. Please choose a couple or few columns from the orders.csv file and use associative arrays to track some form of statistics. Explain what you are tracking, and how this may be beneficial information to have. Also explain how you use associative arrays to accomplish this.

Deliverables

5.1. Select columns to gather/aggregate some statistics
5.2. Explain why these statistics may be useful.
5.3. Explain how you use associative arrays to accomplish this.

Submitting your Work

Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.

Items to submit
  • firstname_lastname_awk2_project12.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not. Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.