TDM 20100: Project 4 — 2022

Motivation: Becoming comfortable chaining commands and getting used to navigating files in a terminal is important for every data scientist to do. By learning the basics of a few useful tools, you will have the ability to quickly understand and manipulate files in a way which is just not possible using tools like Microsoft Office, Google Sheets, etc. While it is always fair to whip together a script using your favorite language, you may find that these UNIX tools are a better fit for your needs.

Context: We’ve been using UNIX tools in a terminal to solve a variety of problems. In this project we will continue to solve problems by combining a variety of tools using a form of redirection called piping.

Scope: grep, regular expression basics, UNIX utilities, redirection, piping

Learning Objectives
  • Use cut to section off and slice up data from the command line.

  • Use piping to string UNIX commands together.

  • Use sort and it’s options to sort data in different ways.

  • Use head to isolate n lines of output.

  • Use wc to summarize the number of lines in a file or in output.

  • Use uniq to filter out non-unique lines.

  • Use grep to search files effectively.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/stackoverflow/unprocessed/*

  • /anvil/projects/tdm/data/stackoverflow/processed/*

  • /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt

Questions

For this project, please submit a .sh text file with all of you bash code written inside of it. This should be submitted in addition to your notebook (the .ipynb file). Failing to submit the accompanying .sh file may result and points being removed from your final submission. Thanks!

Question 1

In a csv file, there are n+1 columns where n is the number of commas (in theory). Take the first line of unprocessed/2011.csv, replace all commas with the newline character, \n, and use wc to count the resulting number of lines. This should approximate how many columns are in the dataset. What is the value?

This can’t be right, can it? Print the first 100 lines after using tr to replace commas with newlines. What do you notice?

The newline character in UNIX is \n.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

As you can see, csv files are not always so straightforward to parse. For this particular set of questions, we want to focus on using other UNIX tools that are more useful on semi-clean datasets. Take a look at the first few lines of the data in processed/2011.csv. How many columns are there?

Take a look at iowa_liquor_sales_cleaner.txt — how many columns does that file have?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Continuing to look at the iowa_liquor_sales_cleaner.txt dataset, what are the 5 largest orders by number of bottles sold? How about by Gallons sold?

cat, cut, sort, and head will be useful.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

What are the different sizes (in ml) that a bottle of liquor comes in?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Benford’s law states that the leading digit in real-life sets of numerical data, the leading digit is likely to follow a distinct distribution (see the plot in the provided link). By this logic, the dollar amount in the orders should roughly match this, right?

Use any available bash tools you’d like to get a good idea of the count or percentage of the sales (in dollars) by starting digit. Are the results expected? Could there be some "funny business" going on? Write 1-2 sentences explaining what you think.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.