STAT 29000: Project 8 — Fall 2020

Motivation: A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. awk is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.

Context: This is the last part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and awk. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.

Scope: awk, UNIX utilities, bash scripts

Learning objectives
  • Use awk to process and manipulate textual data.

  • Use piping and redirection within the terminal to pass around data between utilities.

Dataset

The following questions will use the dataset found in Scholar: /class/datamine/data/flights/subset/YYYY.csv

An example of the data for the year 1987 can be found here.

Questions

Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

Question 1

Let’s say we have a theory that there are more flights on the weekend days (Friday, Saturday, Sunday) than the rest of the days, on average. We can use awk to quickly check it out and see if maybe this looks like something that is true!

Write a line of awk code that, prints the total number of flights that occur on weekend days, followed by the total number of flights that occur on the weekdays. Complete this calculation for 2008 using the 2008.csv file.

Under the column DayOfWeek, Monday through Sunday are represented by 1-7, respectively.

Items to submit
  • Line of awk code that solves the problem.

  • The result: the number of flights on the weekend days, followed by the number of flights on the weekdays for the flights during 2008.

Question 2

Note that in (1), we are comparing 3 days to 4! Write a line of awk code that, prints the average number of flights on a weekend day, followed by the average number of flights on the weekdays. Continue to use data for 2008.

You don’t need a large if statement to do this, you can use the ~ comparison operator.

Items to submit
  • Line of awk code that solves the problem.

  • The result: the average number of flights on the weekend days, followed by the average number of flights on the weekdays for the flights during 2008.

Question 3

We want to look to see if there may be some truth to the whole "snow bird" concept where people will travel to warmer states like Florida and Arizona during the Winter. Let’s use the tools we’ve learned to explore this a little bit.

Take a look at airports.csv. In particular run the following:

head airports.csv

Notice how all of the non-numeric text is surrounded by quotes. The surrounding quotes would need to be escaped for any comparison within awk. This is messy and we would prefer to create a new file called new_airports.csv without any quotes. Write a line of code to do this.

You may be wondering why we are asking you to do this. This sort of situation (where you need to deal with quotes) happens a lot! It’s important to practice and learn ways to fix these things.

You could use gsub within awk to replace '"' with ''. You can find how to use gsub here.

If you leave out the column number argument to gsub it will apply the substitution to every field in every column.

cat new_airports.csv | wc -l
# should be 159 without header
Items to submit
  • Line of awk code used to create the new dataset.

Question 4

Write a line of commands that creates a new dataset called az_fl_airports.txt. az_fl_airports.txt should only contain a list of airport codes for all airports from both Arizona (AZ) and Florida (FL). Use the file we created in (3), new_airports.csv as a starting point.

How many airports are there? Did you expect this? Use a line of bash code to count this.

Create a new dataset called az_fl_flights.txt that contains all of the data for flights into or out of Florida and Arizona using the 2008.csv file. Use the newly created dataset, az_fl_airports.txt, to accomplish this.

cat az_fl_flights.txt | wc -l # should be 484705
Items to submit
  • All UNIX commands used to answer the questions.

  • The number of airports.

  • 1-2 sentences explaining whether you expected this number of airports.

Question 5

Write a bash script that accepts the year as an argument and performs the same operations as in question 4, returning the number of flights into and out of both AZ and FL for any given year.

Items to submit
  • The content of your bash script (starting with "#!/bin/bash") in a code chunk.

  • The line of UNIX code you used to execute the script and create the new dataset.