STAT 29000: Project 8 — Fall 2021

Motivation:

Context: This is the third and final part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and awk. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.

Scope: awk, bash scripts, R, Python

Learning Objectives
  • Use awk to process and manipulate textual data.

  • Use piping and redirection within the terminal to pass around data between utilities.

  • Write bash scripts to automate potential repeated tasks.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/taxi/*

Questions

This is the last project based on bash and awk — the rest are SQL. If you struggled or did not like the bash projects, you are not alone! This is frequently the most intimidating for students. Students tend to really like the SQL projects, so relief is soon to come.

Question 1

Take some time to explore /depot/datamine/data/taxi/**, and answer the following questions using UNIX utilities.

  • In which two directories is the bulk of the data (except fhv — we don’t care about that data for now)?

  • What is the total size in Gb of the data in those two directories?

So for example do all the files in dir1 have the same number of columns for every row? Do the files in dir2 have the same number of columns for every row?

Check out the PDFs in the directory to learn more about the dataset.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

To start, let’s focus on the yellow taxi data. The Total_Amt column is the total cost of the taxi ride. It is broken down into 4 categories: Fare_Amt, surcharge, Tip_Amt, and Tolls_Amt.

Write a bash script called question2.sh that accepts a path to a yellow taxi data file as an argument, and returns a breakdown of the overall percentage each of the 4 categories make up of the total.

Example output
fares: 5.0%
surcharges: 2.5%
tips: 2.5%
tolls: 90.0%

To help get you started, here is some skeleton code.

#!/bin/bash

awk -F',' '{
    # calculate stuff
    fares+=$13;
} END {
    # print stuff
}' $1

Make sure your output format matches this example exactly. Every value should be with 1 decimal place followed by a percentage sign.

It may take a minute to run. You are processing 2.5G of data!

This link may be useful.

The result of the following.

%%bash

chmod +x ./question2.sh
./question2.sh /depot/datamine/data/taxi/yellow/yellow_tripdata_2009-01.csv

Should be:

fares: 92.6%
surcharges: 1.7%
tips: 4.5%
tolls: 1.1%
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Did you know awk has the ability to process multiple files at once? Pass multiple files to your script from question (2) to test it out.

%%bash

chmod +x ./question3.sh
./question3.sh /depot/datamine/data/taxi/yellow/yellow_tripdata_2009-01.csv /depot/datamine/data/taxi/yellow/yellow_tripdata_2009-02.csv

Now, modify your script from question (2). Return the summary values from question (2), but for each month instead of for the overall data. Use Trip_Pickup_dateTime to determine the month.

Example output
January
----
fares: 5.0%
surcharges: 2.5%
tips: 2.5%
tolls: 90.0%
----

February
----
fares: 5.0%
surcharges: 2.5%
tips: 2.5%
tolls: 90.0%
----

etc..

You may will need to pass more than 1 file to your script in order to get more than 1 month of output.

To help get you started, you can find some skeleton code below.

#!/bin/bash

awk -F',' 'BEGIN{
    months[1] = "January"
    months[2] = "February"
    months[3] = "March"
    months[4] = "April"
    months[5] = "May"
    months[6] = "June"
    months[7] = "July"
    months[8] = "August"
    months[9] = "September"
    months[10] = "October"
    months[11] = "November"
    months[12] = "December"
} NR > 1 {
    # use split to parse out the month

    # convert the month to int
    month = int();

    # sum values by month using awk array

} END {
    for (m in total) {
        if (m != 0) {
            # print stuff
        }
    }
}' $@
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Pick 1 of the 2 following questions to answer. If you would like to answer both, your instructors and graders will be wow’d and happy (no pressure)!

To be clear, however, you only need to answer 1 of the following 2 questions in order to get full credit.

Question 4 (Option 1)

There are a lot of interesting questions that you could ask for this dataset. Here are some questions that could be interesting:

  • Does time of day, day of week, or month of year appear to have an effect on tips?

  • Are people indeed more generous (with tips) near Christmas?

  • How many trips are there, by hour of day? What are the rush hours?

  • Do different vendors charge more or less than other vendors?

Either choose a provided question, or write your own. Use your newfound knowledges of UNIX utilities and bash scripts to answer the question. Include the question you want answered, what, if any, hypotheses you have, what the data told you, and what you conclude (anecdotally).

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4 (Option 2)

Standard UNIX utilities are not the end-all be-all to terminal tools. this repository has a lot of really useful tools that tend to have an opinionated take on a classic UNIX tool.

ripgrep is the poster child of this new generation of tools. It is a text search utility that is empirically superior in the majority of metrics (to grep). Additionally, it has subjectively better defaults. You can read (in great detail) about ripgrep here.

In addition to those tools, there is xsv from the same developer as ripgrep. xsv is a utility designed to perform operations on delimited separated value files. Many of the questions that have been asked about in the previous few projects could have been quickly and easily answered using xsv.

Most of these utilities are available to you in a bash cell in Jupyter Lab. Choose 2 questions from previous projects and re-answer them using these modern tools. Which did you prefer, and why?

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.