STAT 29000: Project 7 — Fall 2021

Bashing out liquor sales data

Motivation: A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential issues with Firefox, etc. awk is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.

Context: This is the second project in a series of projects focused on bash and awk. Here, we take a deeper dive and create some more complicated awk scripts, as well as utilize the bash skills learned in previous projects.

Scope: bash, awk, bash scripts, R, Python

Learning Objectives
  • Use awk to process and manipulate textual data.

  • Use piping and redirection within the terminal to pass around data between utilities.

  • Write bash scripts to automate potential repeated tasks.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt

Questions

Question 1

You may have noticed that the "Store Location" column (8th column) contains latitude and longitude coordinates. That is some rich data that could be fun and useful.

The data will look something like the following:

1013 MAINKEOKUK 52632(40.39978, -91.387531)

What this means is that you can’t just parse out the latitude and longitude coordinates and call it a day — you need to use awk functions like gsub and split to extract the latitude and longitude coordinates.

Use awk to print out the latitude and longitude for each line in the original dataset. Output should resemble the following.

lat,lon
1.23,4.56

Make sure to take care of rows that don’t have latitude and longitude coordinates — just skip them. So if your results look like this, you need to add logic to skip the "empty" rows:

40.39978, -91.387531
40.739238, -95.02756
40.624226, -91.373211
,
41.985887, -92.579244

To do this, just go ahead and wrap your print in an if statement similar to:

if (length(coords[1]) > ) {
    print coords[1]";"coords[2]
}

split and gsub will be useful awk functions to use for this question.

If we have a bunch of data formatted like the following:

1013 MAINKEOKUK 52632(40.39978, -91.387531)

If we first used split to split on "(", for example like:

split($8, coords, "(", seps);

coords[2] would be:

40.39978, -91.387531)

Then, you could use gsub to remove any ")" characters from coords[2] like:

gsub(/\)/, "", coords[2]);

coords[2] would be:

40.39978, -91.387531

At this point I’m sure you can see how to use awk to extract and print the rest!

Don’t forget any lingering space after the first comma! We don’t want that.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

Redo question (4) (and reproduce sales_by_store.csv) from project (5), but this time add 2 additional columns to the dataset — lat and lon.

  • 'lat': latitude

  • 'lon': longitude

Before you panic (this was a tough question), we’ve provided the solution below as a starting point for you.

%%bash

awk -F';' 'BEGIN{ print "store_name;month_number;year;sold_usd;volume_sold" }
    {
        gsub(/\$/, "", $22); split($2, dates, "/", seps);
        mysales[$4";"dates[1]";"dates[3]] += $22;
        myvolumes[$4";"dates[1]";"dates[3]] += $24;
    }
    END{
        for (mytriple in mysales)
        {
            print mytriple";"mysales[mytriple]";"myvolumes[mytriple]
        }
    }' /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt > sales_by_store.csv

It may take a few minutes to run this script. Grab a coffee, tea, or something else to keep you going.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Believe it or not, awk even supports geometric calculations like sin and cos. Write a bash script that, given a pair of latitude and pair of longitude, calculates the distance between the two points.

Okay, so how to get started? To calculate this, we can use the Haversine formula. The formula is:

$2*r*arcsin(\sqrt{sin^2(\frac{\phi_2 - \phi_1}{2}) + cos(\phi_1)*cos(\phi_2)*sin^2(\frac{\lambda_2 - \lambda_1}{2})})$

Where:

  • $r$ is the radius of the Earth in kilometers, we can use: 6367.4447 kilometers

  • $\phi_1$ and $\phi_2$ are the latitude coordinates of the two points

  • $\lambda_1$ and $\lambda_2$ are the longitude coordinates of the two points

In awk, sin is sin, cos is cos, and sqrt is sqrt.

To get the arcsin use the following awk function:

function arcsin(x) { return atan2(x, sqrt(1-x*x)) }

To convert from degrees to radians, use the following awk function:

function dtor(x) { return x*atan2(0, -1)/180 }

The following is how the script should work (with a real example you can test):

./question3.sh 40.39978 -91.387531 40.739238 -95.02756
Results
309.57

To include functions in your awk command, do as follows:

awk -v lat1=$1 -v lat2=$3 -v lon1=$2 -v lon2=$4 'function arcsin(x) { return atan2(x, sqrt(1-x*x)) }function dtor(x) { return x*atan2(0, -1)/180 }BEGIN{
    lat1 = dtor(lat1);
    print lat1;
    # rest of your code here!
}'

We want you to create a bash script called question3.sh. After you have your bash script, we want you to run it in a bash cell to see the output.

The following is some skeleton code that you can use to get started.

#!/bin/bash

lat1=$1
lat2=$3
lon1=$2
lon2=$4

awk -v lat1=$1 -v lat2=$3 -v lon1=$2 -v lon2=$4 'function arcsin(x) { return atan2(x, sqrt(1-x*x)) }function dtor(x) { return x*atan2(0, -1)/180 }BEGIN{
    lat1 = dtor(lat1);
    print lat1;
    # rest of your code here!
}'

You may need to give your script execute permissions like this.

chmod +x /path/to/question3.sh
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

Create a new bash script called question4.sh that accepts a latitude, longitude, filename, and n.

The latitude and longitude are a point that we want to calculate the distance from.

The filename is sales_by_store.csv — our resulting dataset from question 3.

Finally, n is the number of stores from our sales_by_store.csv file that we want to calculate the distance from the provided longitude and latitude.

./question4.sh 40.39978 -91.387531 sales_by_store.csv 3
Output
Distance from (40.39978,-91.387531)
store_name,distance
The Music Station,253.915
KUM & GO #4 / LAMONI,213.455
KUM & GO #4 / LAMONI,213.447

To get you started, you can use the following "starter" code. Fix the code to work:

#!/bin/bash

lat_from=$1
lon_from=$2
file=$3
n=$4

awk -F';' -v n=$n -v lat_from=$lat_from -v lon_from=$lon_from 'function arcsin(x) { return atan2(x, sqrt(1-x*x)) }function dtor(x) { return x*atan2(0, -1)/180 }function distance(lat1, lon1, lat2, lon2) {
    # question 2 code here (1)
    return dist;
}BEGIN {
    print "Distance from ("lat_from","lon_from")"
    print "store_name,distance";
} NR>1 && NR <= n+1 {
    lat2 = FIXME; (2)
    lon2 = FIXME; (3)
    dist = distance(lat_from, lon_from, FIXME, FIXME); (4)
    print $1","dist
}' $file
1 Add your code from question 2 here and make sure your distance is stored in a variable called dist (which we return).
2 Which value goes here?
3 Which value goes here?
4 Which values go here?
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5 (optional, 0 pts)

Use your choice of Python or R, with our sales_by_store.csv to create a beautiful graphic mapping the latitudes and longitudes of the stores. If you want to, get creative and increase the size of the points on the map based on the number of sales. You could create a graphic for each month to see how sales change month-to-month. The options are limitless, get creative!

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.