STAT 39000: Project 6 — Fall 2020

Motivation: A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. awk is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.

Context: This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and awk. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.

Scope: awk, UNIX utilities, bash scripts

Learning objectives
  • Use awk to process and manipulate textual data.

  • Use piping and redirection within the terminal to pass around data between utilities.

  • Use output created from the terminal to create a plot using R.

Dataset

The following questions will use the dataset found here or in Scholar:

/class/datamine/data/flights/subset/YYYY.csv

An example from 1987 data can be found here or in Scholar:

/class/datamine/data/flights/subset/1987.csv

Questions

Question 1

In previous projects we learned how to get a single column of data from a csv file. Write 1 line of UNIX commands to print the 17th column, the Origin, from 1987.csv. Write another line, this time using awk to do the same thing. Which one do you prefer, and why?

Here is an example, from a different data set, to illustrate some differences and similarities between cut and awk:

Items to submit
  • One line of UNIX commands to solve the problem without using awk.

  • One line of UNIX commands to solve the problem using awk.

  • 1-2 sentences describing which method you prefer and why.

Question 2

Write a bash script that accepts a year (1987, 1988, etc.) and a column n and returns the nth column of the associated year of data.

Here are two examples to illustrate how to write a bash script:

In this example, you only need to turn in the content of your bash script (starting with #!/bin/bash) without evaluation in a code chunk. However, you should test your script before submission to make sure it works. To actually test out your bash script, take the following example. The script is simple and just prints out the first two arguments given to it:

#!/bin/bash
echo "First argument: $1"
echo "Second argument: $2"

If you simply drop that text into a file called my_script.sh, located here: /home/$USER/my_script.sh, and if you run the following:

# Setup bash to run; this only needs to be run one time per session.
# It makes bash behave a little more naturally in RStudio.
exec bash
# Navigate to the location of my_script.sh
cd /home/$USER
# Make sure that the script is runable.
# This only needs to be done one time for each new script that you write.
chmod 755 my_script.sh
# Execute my_script.sh
./my_script.sh okay cool

then it will print:

First argument: okay
Second argument: cool

In this example, if we were to turn in the "content of your bash script (starting with #!/bin/bash) in a code chunk, our solution would look like this:

#!/bin/bash
echo "First argument: $1"
echo "Second argument: $2"

And although we aren’t running the code chunk above, we know that it works because we tested it in the terminal.

Using awk you could have a script with just two lines: 1 with the "hash-bang" (#!/bin/bash), and 1 with a single awk command.

Items to submit
  • The content of your bash script (starting with #!/bin/bash) in a code chunk.

Question 3

How many flights arrived at Indianapolis (IND) in 2008? First solve this problem without using awk, then solve this problem using only awk.

Here is a similar example, using the election data set:

Items to submit
  • One line of UNIX commands to solve the problem without using awk.

  • One line of UNIX commands to solve the problem using awk.

  • The number of flights that arrived at Indianapolis (IND) in 2008.

Question 4

Do you expect the number of unique origins and destinations to be the same based on flight data in the year 2008? Find out, using any command line tool you’d like. Are they indeed the same? How many unique values do we have per category (Origin, Dest)?

Here is an example to help you with the last part of the question, about Origin-to-Destination pairs. We analyze the city-state pairs from the election data:

Items to submit
  • 1-2 sentences explaining whether or not you expect the number of unique origins and destinations to be the same.

  • The UNIX command(s) used to figure out if the number of unique origins and destinations are the same.

  • The number of unique values per category (Origin, Dest).

Question 5

In (4) we found that there are not the same number of unique Origin as Dest. Find the IATA airport code for all Origin that don’t appear in a Dest and all Dest that don’t appear in an Origin in the 2008 data.

The examples on this page should help. Note that these examples are based on Process Substitution, which basically allows you to specify commands whose output would be used as the input of comm. There should be no space between the open bracket and open parenthesis, otherwise your bash will not work as intended.

Items to submit
  • The line(s) of UNIX command(s) used to answer the question.

  • The list of all Origin that don’t appear in Dest.

  • The list of all Dest that don’t appear in Origin.

Question 6

What was the percentage of flights in 2008 per unique Origin with the Dest of "IND"? What percentage of flights had "PHX" as Origin (among all flights with Dest of "IND")?

Here is an example using the percentages of donations contributed from CEOs from various States:

You can do the mean calculation in awk by dividing the result from (3) by the number of unique Origin that have a Dest of "IND".

Items to submit
  • The percentage of flights in 2008 per unique Origin with the Dest of "IND".

  • 1-2 sentences explaining how "PHX" compares (as a unique ORIGIN) to the other Origin`s (all with the `Dest of "IND")?

Question 7

Write a bash script that takes a year and IATA airport code and returns the year, and the total number of flights to and from the given airport. Example rows may look like:

1987, 12345
1988, 44

Run the script with inputs: 1991 and ORD. Include the output in your submission.

Items to submit
  • The content of your bash script (starting with "#!/bin/bash") in a code chunk.

  • The output of the script given 1991 and ORD as inputs.

OPTIONAL QUESTION 1

Pick your favorite airport and get its IATA airport code. Write a bash script that, given the first year, last year, and airport code, runs the bash script from (7) for all years in the provided range for your given airport, or loops through all of the files for the given airport, appending all of the data to a new file called my_airport.csv.

Items to submit
  • The content of your bash script (starting with "#!/bin/bash") in a code chunk.

OPTIONAL QUESTION 2

In R, load my_airport.csv and create a line plot showing the year-by-year change. Label your x-axis "Year", your y-axis "Num Flights", and your title the name of the IATA airport code. Write 1-2 sentences with your observations.

Items to submit
  • Line chart showing year-by-year change in flights into and out of the chosen airport.

  • R code used to create the chart.

  • 1-2 sentences with your observations.