TDM 20100: Project 6 — 2023

Motivation: awk is a programming language designed for text processing. It can be a quick and efficient way to quickly parse through and process textual data. While Python and R definitely have their place in the data science world, it can be extremely satisfying to perform an operation extremely quickly using something like awk.

Context: This is the second of three projects where we introduce awk. awk is a powerful tool that can be used to perform a variety of the tasks that we’ve previously used other UNIX utilities for. After this project, we will continue to utilize all of the utilities, and bash scripts, to perform tasks in a repeatable manner.

Scope: awk, UNIX utilities

Learning Objectives
  • Use awk to process and manipulate textual data.

  • Use piping and redirection within the terminal to pass around data between utilities.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/restaurant/orders.csv

  • /anvil/projects/tdm/data/whin/observations.csv

Questions

Question 1 (1 pt)

  1. How many columns and rows are in the following dataset: /anvil/projects/tdm/data/restaurant/orders.csv.

The following is example output

output
rows: 12345
columns: 12345

Question 2 (1 pt)

  1. Please list all possible values of "Location Type" in the file

/anvil/projects/tdm/data/restaurant/orders.csv

and how many times each value occurs.

Your output should give each location type, followed by the numbers of orders for that Location Type. Use awk to answer this question. Make sure to format the output as follows:

output
Location Type       Number of Orders
--------------      ----------------
AAA                 12345
bb                  99999

Question 3 (2 pts)

  1. What is the year range for the data in the dataset:

/anvil/projects/tdm/data/restaurant/orders.csv?

Question 4 (2 pts)

  1. What is the sum of the order amounts for each year in the data set

/anvil/projects/tdm/data/restaurant/orders.csv?

Pease make sure the output format is the following:

output
Year Summary of Orders in dollars
2019 $PUT THE TOTAL DOLLAR AMOUNT HERE
It is totally OK if you put the dollar amount in scientific notation (that will probably happen by default when you add up the dollar amounts, because there were a lot of restaurant orders!

ANOTHER NOTE: There is only 1 year (namely, 2019) in this data set.

Question 5 (2 pts)

  1. Please extract both the years and months for the file:

/anvil/projects/tdm/data/whin/observations.csv

and how many times each year-and-month pair occurs.

Your output should give each year-and-month value, followed by the numbers of times that this year-and-month appears. Use awk to answer this question. You likely will need to use awk twice in a pipeline. Make sure to format the output as follows:

output
Month and Year      Number of Occurrences
--------------      ---------------------
2020-06             12345
2020-07             99999

Project 06 Assignment Checklist

  • Jupyter notebook with your code, comments and output for questions 1 to 5

    • firstname-lastname-project06.ipynb.

  • A .sh text file with all of your bash code and comments written inside of it

    • bash code and comments used to solve questions 1 through 5

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.