STAT 29000: Project 6 — Fall 2021

The anatomy of a bash script

Motivation: A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. awk is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.

Context: This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and awk. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.

Scope: awk, bash scripts, UNIX utilities

Learning Objectives
  • Use awk to process and manipulate textual data.

  • Use piping and redirection within the terminal to pass around data between utilities.

  • Write bash scripts to automate potential repeated tasks.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/election/*

Questions

Question 1

Originally, this project was a bit more involved than intended. For this reason, I have provided you the solution to the question below the last "note" in this question. Instead of writing this script, I would like you to study it and try and understand what is going on.

We now have a grip on a variety of useful tools, that are often used together using pipes and redirections. As you start to see, "one-liners" can start to become a bit unwieldy. In these cases, wrapping everything into a bash script can be a good solution.

Imagine for a minute, that you have a single file that is continually appended to by another system. Let’s say this file is /depot/datamine/data/election/itcont1990.txt. Every so often, your manager asks you to generate a summary of the data in this file. Every time you do this, you have to dig through old notes to remember how you did this previously. Instead of constantly doing this manual process, you decide to write a script to handle this for you!

Write a bash script to generate a summary of the data in /depot/datamine/data/election/itcont1990.txt. The summary should include the following information, in the following format.

120 RECORDS READ
----------------
File: /depot/datamine/data/election/itcont1990.txt
Largest donor:
Most common donor state: NY
Total donations in USD by state:
- NY: 100000
- CA: 50000
...
----------------

For this question, assume that the data file will always be in the same location.

#!/bin/bash

FILE=/depot/datamine/data/election/itcont1990.txt

RECORDS_READ=`wc -l $FILE | awk '{print $1}'`

awk -v RECORDS_READ="$RECORDS_READ" -F'|' 'BEGIN{
    print RECORDS_READ" RECORDS READ\n----------------";
}{
    donor_total_by_name[$8] += $15;
    most_common_donor_by_state[$10]++;
    donor_total_by_state[$10] += $15;
}END{
    PROCINFO["sorted_in"] = "@val_num_desc";
    print "File: "FILENAME;

    ct=0;

    for (i in donor_total_by_name) {
        if (ct < 1) {
            print "Largest donor: " i;
            ct++;
        }
    };

    ct=0;

    for (i in most_common_donor_by_state) {
        if (ct < 1) {
            print "Most common donor state: " i;
            ct++;
        }
    }

    print "Total donations in USD by state:";

    for (i in donor_total_by_state) {
        if (i != "STATE" && i != "") {
            print "\t- " i ": " donor_total_by_state[i];
        }
    }

    print "----------------";

}' "$FILE"

In order to run this script, you will need to paste the contents into a new file called firstname-lastname-q1.sh in your $HOME directory. In a new bash cell, run it as follows.

%%bash

chmod +x $HOME/firstname-lastname-q1.sh
$HOME/firstname-lastname-q1.sh

That chmod command is necessary to ensure that you can execute the script.

Create the script and run the script in a bash cell.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

Your manager loves your script, but wants you to modify it so it works with any file formatted the same way. A new system is being installed that saves new data into new files rather than appending to the same file.

Modify the script from question (1) to accept an argument that specifies the file to process.

Start by copying the cold script from question (1) into a new file called firstname-lastname-q2.sh.

%%bash

cp $HOME/firstname-lastname-q1.sh $HOME/firstname-lastname-q2.sh

Then, test the updated script out on /depot/datamine/data/election/itcont2000.txt.

%%bash

$HOME/firstname-lastname-q2.sh /depot/datamine/data/election/itcont2000.txt

You can edit your scripts directly within Jupyter Lab by right clicking the files and opening in the editor.

The only difference between the two scripts are the new script you will be able to replace the $FILE argument to the wc command with something else.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Modify your script once again to accept n arguments, each a path to another file to generate a summary for.

Start by copying the cold script from question (2) into a new file called firstname-lastname-q3.sh.

%%bash

cp $HOME/firstname-lastname-q2.sh $HOME/firstname-lastname-q3.sh

You should be able to run the script as follows.

%%bash

$HOME/firstname-lastname-q3.sh /depot/datamine/data/election/itcont2000.txt /depot/datamine/data/election/itcont1990.txt
155 RECORDS READ
----------------
File: /depot/datamine/data/election/itcont2000.txt
Largest donor:
Most common donor state: NY
Total donations in USD by state:
- NY: 100000
- CA: 50000
...
----------------

120 RECORDS READ
----------------
File: /depot/datamine/data/election/itcont1990.txt
Largest donor:
Most common donor state: NY
Total donations in USD by state:
- NY: 100000
- CA: 50000
...
----------------

Again, the modification that will need to be made here aren’t so bad at all! If you just wrap the entirety of question (2)'s solution in a for loop where you loop through each argument, you’ll just need to make sure you change the $FILE argument to the wc command to be the argument you are setting in each loop.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Originally, this project was a bit more involved than intended. For this reason, I have provided you the solution to the question below the last "tip" in this question. Instead of writing this script, I would like you to study it and try and understand what is going on, and run the example we provide.

You are particularly interested in donors from your alma mater, Purdue University. Modify your script from question (3) yet again. This time, add a flag, that, when present, will include the name and amount for each donor where the word "purdue" (case insensitive) is present in the EMPLOYER column.

%%bash

$HOME/firstname-lastname-q4.sh -p /depot/datamine/data/election/itcont2000.txt /depot/datamine/data/election/itcont1990.txt
155 RECORDS READ
----------------
File: /depot/datamine/data/election/itcont2000.txt
Largest donor: ASARO, SALVATORE
Most common donor state: NY
Purdue donors:
- John Smith: 500
- Alice Bob: 1000
Total donations in USD by state:
- NY: 100000
- CA: 50000
...
----------------

120 RECORDS READ
----------------
File: /depot/datamine/data/election/itcont1990.txt
Largest donor: ASARO, SALVATORE
Most common donor state: NY
Purdue donors:
- John Smith: 500
- Alice Bob: 1000
Total donations in USD by state:
- NY: 100000
- CA: 50000
...
----------------

This stackoverflow response has an excellent template using getopt to parse your flags. Use this as a "start".

You may want to comment out or delete the part of the template that limits your non-flag arguments to one.

#!/bin/bash

# More safety, by turning some bugs into errors.
# Without `errexit` you don’t need ! and can replace
# PIPESTATUS with a simple $?, but I don’t do that.
set -o errexit -o pipefail -o noclobber -o nounset

# -allow a command to fail with !’s side effect on errexit
# -use return value from ${PIPESTATUS[0]}, because ! hosed $?
! getopt --test > /dev/null
if [[ ${PIPESTATUS[0]} -ne 4 ]]; then
    echo 'I’m sorry, `getopt --test` failed in this environment.'
    exit 1
fi

OPTIONS=p
LONGOPTS=purdue

# -regarding ! and PIPESTATUS see above
# -temporarily store output to be able to check for errors
# -activate quoting/enhanced mode (e.g. by writing out “--options”)
# -pass arguments only via   -- "[email protected]"   to separate them correctly
! PARSED=$(getopt --options=$OPTIONS --longoptions=$LONGOPTS --name "$0" -- "[email protected]")
if [[ ${PIPESTATUS[0]} -ne 0 ]]; then
    # e.g. return value is 1
    #  then getopt has complained about wrong arguments to stdout
    exit 2
fi
# read getopt’s output this way to handle the quoting right:
eval set -- "$PARSED"

p=n
# now enjoy the options in order and nicely split until we see --
while true; do
    case "$1" in
        -p|--purdue)
            p=y
            shift
            ;;
        --)
            shift
            break
            ;;
        *)
            echo "Programming error"
            exit 3
            ;;
    esac
done

# handle non-option arguments
# if [[ $# -ne 1 ]]; then
#     echo "$0: A single input file is required."
#     exit 4
# fi

for file in "[email protected]"
do
    RECORDS_READ=`wc -l $file | awk '{print $1}'`

    awk -v PFLAG="$p" -v RECORDS_READ="$RECORDS_READ" -F'|' 'BEGIN{
        print RECORDS_READ" RECORDS READ\n----------------";
    }{

        if ($8 != "") {
            donor_total_by_name[$8] += $15;
        }
        most_common_donor_by_state[$10]++;
        donor_total_by_state[$10] += $15;

        # see if "purdue" appears in line
        if (PFLAG == "y") {
            has_purdue = match(tolower($0), /purdue/)
            if (has_purdue != 0) {
                purdue_total_by_name[$8] += $15;
            }
        }

    }END{
        PROCINFO["sorted_in"] = "@val_num_desc";
        print "File: "FILENAME;

        ct=0;

        for (i in donor_total_by_name) {
            if (ct < 1) {
                print "Largest donor: " i;
                ct++;
            }
        };

        ct=0;

        for (i in most_common_donor_by_state) {
            if (ct < 1) {
                print "Most common donor state: " i;
                ct++;
            }
        }

        if (PFLAG == "y") {
            print "Purdue donors:";
            for (i in purdue_total_by_name) {
                print "\t- " i ": " purdue_total_by_name[i];
            }
        }

        print "Total donations in USD by state:";

        for (i in donor_total_by_state) {
            if (i != "STATE" && i != "") {
                print "\t- " i ": " donor_total_by_state[i];
            }
        }

        print "----------------\n";

    }' $file
done

Please copy and paste this code into a new script called firstname-lastname-q4.sh and run it.

%%bash

$HOME/firstname-lastname-q4.sh -p /depot/datamine/data/election/itcont2000.txt /depot/datamine/data/election/itcont1990.txt
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Originally, this project was a bit more involved than intended. Instead of writing this script from scratch, I would like you to fill in the parts of the script with the text FIXME, and then test out the script with the commands provided.

Your manager liked that new feature, however, she thinks the tool would be better suited to search the EMPLOYER column for a specific string, and then handle this generically, rather than just handling the specific case of Purdue.

Modify your script from question (4). Accept one and only one flag -e or --employer. This flag should take a string as an argument, and then search the EMPLOYER column for that string. Then, the script will print out the results. Only include the top 5 donors from an employer. The following is an example if we chose to search for "ford".

$HOME/firstname-lastname-q5.sh -e'ford' /depot/datamine/data/election/itcont2000.txt /depot/datamine/data/election/itcont1990.txt
155 RECORDS READ
----------------
File: /depot/datamine/data/election/itcont1990.txt
Largest donor: ASARO, SALVATORE
Most common donor state: NY
ford donors:
- John Smith: 500
- Alice Bob: 1000
Total donations in USD by state:
- NY: 100000
- CA: 50000
...
----------------

120 RECORDS READ
----------------
File: /depot/datamine/data/election/itcont2000.txt
Largest donor: ASARO, SALVATORE
Most common donor state: NY
ford donors:
- John Smith: 500
- Alice Bob: 1000
Total donations in USD by state:
- NY: 100000
- CA: 50000
...
----------------
#!/bin/bash

# More safety, by turning some bugs into errors.
# Without `errexit` you don’t need ! and can replace
# PIPESTATUS with a simple $?, but I don’t do that.
set -o errexit -o pipefail -o noclobber -o nounset

# -allow a command to fail with !’s side effect on errexit
# -use return value from ${PIPESTATUS[0]}, because ! hosed $?
! getopt --test > /dev/null
if [[ ${PIPESTATUS[0]} -ne 4 ]]; then
    echo 'I’m sorry, `getopt --test` failed in this environment.'
    exit 1
fi

OPTIONS=e:
LONGOPTS=employer:

# -regarding ! and PIPESTATUS see above
# -temporarily store output to be able to check for errors
# -activate quoting/enhanced mode (e.g. by writing out “--options”)
# -pass arguments only via   -- "[email protected]"   to separate them correctly
! PARSED=$(getopt --options=$OPTIONS --longoptions=$LONGOPTS --name "$0" -- "[email protected]")
if [[ ${PIPESTATUS[0]} -ne 0 ]]; then
    # e.g. return value is 1
    #  then getopt has complained about wrong arguments to stdout
    exit 2
fi
# read getopt’s output this way to handle the quoting right:
eval set -- "$PARSED"

e=-
# now enjoy the options in order and nicely split until we see --
while true; do
    case "$1" in
        -e|--employer)
            e="$2"
            shift 2
            ;;
        --)
            shift
            break
            ;;
        *)
            echo "Programming error"
            exit 3
            ;;
    esac
done

# handle non-option arguments
# if [[ $# -ne 1 ]]; then
#     echo "$0: A single input file is required."
#     exit 4
# fi

for file in "[email protected]"
do
    RECORDS_READ=`wc -l $file | awk '{print $1}'`

    awk -v EFLAG="$FIXME" -v RECORDS_READ="$RECORDS_READ" -F'|' 'BEGIN{ (1)
        print RECORDS_READ" RECORDS READ\n----------------";
    }
    {

        if ($8 != "") {
            donor_total_by_name[$8] += $15;
        }
        most_common_donor_by_state[$10]++;
        donor_total_by_state[$10] += $15;

        # see if search string appears in line
        if (EFLAG != "") {
            has_string = match(tolower($12), EFLAG)
            if (has_string != 0) {
                employer_total_by_name[$8] += $15;
            }
        }

    }END{
        PROCINFO["sorted_in"] = "@val_num_desc";
        print "File: "FILENAME;

        ct=0;

        for (i in donor_total_by_name) {
            if (ct < 1) {
                print "Largest donor: " i;
                ct++;
            }
        };

        ct=0;

        for (i in most_common_donor_by_state) {
            if (ct < 1) {
                print "Most common donor state: " i;
                ct++;
            }
        }

        ct=0;

        if (EFLAG != "") {
            print EFLAG" donors:";
            for (i in FIXME) { (2)
                if (ct < 5) {
                    print "\t- " i ": " FIXME[i]; (3)
                    FIXME; (4)
                }
            }
        }

        print "Total donations in USD by state:";

        for (i in donor_total_by_state) {
            if (i != "STATE" && i != "") {
                print "\t- " i ": " donor_total_by_state[i];
            }
        }

        print "----------------\n";

    }' $file
done
1 We should put "$something" here — check out how we handle this is question (4) and look at the changes it question (5) to help isolate what goes here.
2 What are we looping through here? All you need to do is change it to the only remaining awk array we haven’t looped through in the rest of the code.
3 Now we want to access the value of the array — it would make sense if it were the same array as the previous FIXME, right?!
4 Without this code, we will print ALL of the donors — not just the first 5.

Then test it out!

%%bash

$HOME/firstname-lastname-q5.sh -e'ford' /depot/datamine/data/election/itcont2000.txt /depot/datamine/data/election/itcont1990.txt
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.