TDM 20100: Project 4 — 2023

Motivation: Becoming comfortable piping commands in a chain, and getting used to navigating files in a terminal, are important skills for every data scientist to learn. These skills will give you the ability to quickly understand and manipulate files in a way which is not possible using tools like Microsoft Office, Google Sheets, etc. You may find that these UNIX tools are really useful for analyzing data.

Context: We’ve been using UNIX tools in a terminal to solve a variety of problems. In this project we will continue to solve problems by combining a variety of tools using a form of redirection called 'piping'.

Scope: grep, regular expression basics, UNIX utilities, redirection, piping

Learning Objectives
  • Use cut to section off and slice up data from the command line.

  • Use | piping to string UNIX commands together.

  • Use sort and it’s options to sort data in different ways.

  • Use head to isolate n lines of output.

  • Use wc to summarize the number of lines in a file or in output.

  • Use uniq to filter out non-unique lines.

  • Use grep to search files effectively.

Make sure to read about, and use the template found here, and the important information about projects submissions here.


The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/stackoverflow/unprocessed/*

  • /anvil/projects/tdm/data/stackoverflow/processed/*

  • /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt


For this project, please submit a .sh text file with all of you bash code written inside of it. This should be submitted in addition to your notebook (the .ipynb file). Failing to submit the accompanying .sh file may result and points being removed from your final submission. Thanks!

Question 1 (2 pts)

The following statement will check how many columns are found in this csv file:

cat /anvil/projects/tdm/data/stackoverflow/unprocessed/2011.csv | tr ',' '\n' | wc -l

BUT this file is a little bit strange, because it only has 1 large line. (In fact, there is no line ending at the end of the line, so wc says that the file has 0 lines!)

wc /anvil/projects/tdm/data/stackoverflow/unprocessed/2011.csv

In the question below, we can to turn the commas in this file into newline characters, and then count the number of words in the file.

  • In a csv file, the number of columns is usually 1 larger than the number of commas.

  • cat prints the file

  • head -n10 prints the first 10 lines of the file

  • tr ',' '\n' replaces all commas with the newline character \n

  • wc counts the number of lines, words, and characters.

  1. Please use commands head, tr and wc to find out how many words occur in the first 10 lines of the file /anvil/projects/tdm/data/stackoverflow/unprocessed/2011.csv

Question 2 (2 pts)

As you can see, csv files are not always so straightforward to parse. For this particular set of questions, we want to focus on using some other UNIX tools that are more useful on semi-clean datasets, e.g. awk

The following statement outputs the number of columns in each of the first 10 lines of the file:

head /anvil/projects/tdm/data/stackoverflow/processed/2011.csv | awk -F";" '{print NF}'
  • awk can be used for simple data manipulation tasks that involve pattern matching, field extraction, arithmetic, and string operations

    • -F";": Set the field separator to ;.

    • {print NF}: Print the number of fields in each line.

We are just starting to introduce awk, a little utility that allows us to analyze each line of the data. The main part of an awk command runs once on each line of the data set.

  1. Let’s turn our attention to a different file. Use awk to find out how many columns appear in the fifth row of the file /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt

Question 3 (2 pts)

With appropriate commands, the following statement use finds the 5 largest orders, in terms of the number of Bottles Sold

cat /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | cut -d';' -f21 | sort -nr | head -n 5
  • cat is used to display the entire content of the file

  • cut is an UNIX command used to remove or "cut out" certain sections of each line from a file or the output of a command.

    • -d ';' specifies that the delimiter (or separator) between fields is the semicolon (;).

    • -f21 tells cut to only retrieve the 21st field/column (Bottles Sold column) based on the semicolon delimiter. So, after this command, you’ll get only the Bottles Sold values from the 21st column of the file iowa_liquor_sales_cleaner.txt.

  • sort arranges lines of text alphabetically or numerically.

    • -n means "numeric sort", so the values are treated as numbers and not as strings.

    • -r means "reverse", so the output will be in descending order

  • head is used to display only first 5 lines

  1. Use UNIX commands to find out what are the 6 highest 'state bottle retail' prices from the file /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt and what are the analogous item descriptions for these 6 items? (Some are repeated, and that is OK.)

  • column 16 is for 'item description' and column 20 is for 'state bottle retail' price

Question 4 (2 pts)

Here is another example. We can pipeline cat, cut,sort and uniq to display how many times each unique bottle volume appears in the file

cat /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt | cut -d';' -f18 | sort -n | uniq -c
  • column 18 (-f18) is for 'Bottle Volume (ml)'

  • uniq with the -c option, finds the number of occurrences of each outcome

  1. Please find out how many times each bottle volume appears in the file

  • column 18 indicates the bottle volume

Project 04 Assignment Checklist

  • Jupyter Lab notebook with your code and comments for the assignment

    • firstname-lastname-project04.ipynb.

  • A .sh text file with all of your bash code and comments written inside of it

    • bash code and comments used to solve questions 1 through 4

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.