TDM 20100: Project 7 — 2023

Motivation: awk is a programming language designed for text processing. It can be a quick and efficient way to quickly parse through and process textual data. While Python and R definitely have their place in the data science world, it can be extremely satisfying to perform an operation extremely quickly using something like awk.

Context: This is the third of three projects where we introduce awk. awk is a powerful tool that can be used to perform a variety of the tasks that we’ve previously used other UNIX utilities for. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner.

Scope: awk, awk arrays, UNIX utilities

Learning Objectives
  • Use awk arrays to efficiently store sets of data

  • Use awk and functions to process and manipulate data.

  • Use piping and redirection within the terminal to pass around data between utilities.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

  • /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt

  • /anvil/projects/tdm/data/election/escaped2020sample.txt

  • /anvil/projects/tdm/data/flights/1990.csv

In awk, arrays are associative, meaning you can store data with key-value pairs. This makes it very efficient to manage data, especially for large files. We can manage to index our data in easier ways than in other programming languages. Awk arrays are versatile and powerful tools: you can index them using strings and numbers.

Question 1 (2 pts)

Consider the dataset iowa_liquor_sales_cleaner.txt.

In Project 05’s Question 2, you were asked to find out which Store Name had the most purchases, and once you obtained the store name, you calculated the total cost from that store. Now let’s use awk associative arrays to list each of the top 10 stores' total cost of purchases (per store).

(In Project 5 we used column 19, which was the State Bottle Cost. This time, instead, let’s focus on column 22.)

For this question, please use column 22, which is the Sale (Dollars).

  1. Find the total cost of the purchases for each store. For the output, show only the top 10 stores, in terms of the total cost of the purchases. List each of these top 10 stores and the total cost of the purchases of each.

Use column 4, which is the Store Name, for the index in the associate array. Use column 22, which is the Sale (Dollars), for the values to add up. When you print your results, it would help to print mytotal[i] first, and then print i second, so that you can sort your results numerically, using sort -n.

You might want to use sort -g or sort -gr so that your numbers are sorted with the scientific notation allowed. The -g specifically tells bash that some of your numbers are in scientific notation.

Question 2 (2 pts)

Let’s look at the dataset escaped2020sample.txt.

  1. This dataset contains the CITY(column 9), STATE (column 10), and TRANSACTION_AMT (column 15) for each donation. Please calculate the total transaction amounts for each city/state pair from this dataset (for instance, West Lafayette, IN). For your answer, list the top 10 city/state pairs with the largest total transaction amounts. Please use awk associative arrays for this question.

The values in columns 9, 10, and 15, all have double quotes on them. This will make it hard to add the values in column 15. You might want to add this short sequence into your pipeline, to remove the double quotes:

sed 's/"//g' (the "s" is for substitute and the "g" is for global, and we are removing the double quotes and replacing it with nothing!)

Compare these two lines, for instance:

cat /anvil/projects/tdm/data/election/escaped2020sample.txt | awk -F"|" '{print $9, $10, $15}' | head

versus this line:

cat /anvil/projects/tdm/data/election/escaped2020sample.txt | sed 's/"//g' | awk -F"|" '{print $9, $10, $15}' | head

I hope that helps! By removing the double quotes, you will be able to add the values in the 15th column. (If you have double quotes present, then awk will not like to add up the values in the 15th column, so you need to remove the double quotes.)

  • Since there have are some cities that share the same city names but different states, we need to combine the city and state into a city/state pair, to differentiate the locations. This is demonstrated with the year,month pair, in the fourth introductory video, at the start of the project, and is demonstrated again with the year,month,day triple near the end of the fourth video. You can do this in a similar way, using the city,state pair just like you used the year,month pair.

Question 3 (2 pts)

Now let us take a look at the dataset 1990.csv.

  1. You may have noticed that the "FlightDate" column (6th column) contains dates formatted as "1990-01-31". Please write an awk command to extract the year and month (not the day) from this column and then reformat them as (for instance) "01/1990". For your output, print each of the twelve months from 1990, and the number of flights that occur during each of those months.

You do NOT need associative arrays for this question. You just need to use cut or awk to extract the year,month pair.

Question 4 (2 pts)

  1. Use awk to create a new dataset from 1990.csv called 1990_flight_info.csv. This new file should include the following columns: the flight_month_year(MM/YYYY) just the same as you created in question 3, and also the total number of flights during that month_year. Order the results according to the number of flights in the month_year, from smallest to largest. The header of this file called 1990_flight_info.csv should look like:

columns
flight_month_year;total_number_of_flights

You do NOT need associative arrays for this question. We are just learning how to store the results of a bash pipeline into a file, using the > symbol.

Use > to redirect. You can output from the awk command to a new file with this operator. If you were to replace > by >> it would append instead of replace. In other words, if you use a single > it will first erase the output file before adding the results of the awk command to the file. If you use >>, it will append the results.

Make sure to submit the file 1990_flight_info.csv when you upload your files to Gradescope.

Project 07 Assignment Checklist

  • Jupyter notebook with your code, comments and output for questions 1 to 4

    • firstname-lastname-project07.ipynb.

  • A .sh text file with all of your bash code and comments written inside of it

    • bash code and comments used to solve questions 1 through 4

  • The output file from question 4, called: 1990_flight_info.csv

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.