STAT 29000: Project 7 — Fall 2020

Motivation: A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc. awk is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.

Context: This is the first part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and awk. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.

Scope: awk, UNIX utilities, bash scripts

Learning objectives
  • Use awk to process and manipulate textual data.

  • Use piping and redirection within the terminal to pass around data between utilities.

Dataset

The following questions will use the dataset found in Scholar: /class/datamine/data/flights/subset/YYYY.csv

An example of the data for the year 1987 can be found here.

Sometimes if you are about to dig into a dataset, it is good to quickly do some sanity checks early on to make sure the data is what you expect it to be.

Questions

Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

Question 1

Write a line of code that prints a list of the unique values in the DayOfWeek column. Write a line of code that prints a list of the unique values in the DayOfMonth column. Write a line of code that prints a list of the unique values in the Month column. Use the 1987.csv dataset. Are the results what you expected?

Items to submit
  • 3 lines of code used to get a list of unique values for the chosen columns.

  • 1-2 sentences explaining whether or not the results are what you expected.

Question 2

Our files should have 29 columns. For a given file, write a line of code that prints any lines that do not have 29 columns. Test it on 1987.csv, were there any rows without 29 columns?

See [here](#unix-awk-built-in-variables). NF looks like it may be useful!

Items to submit
  • Line of code used to solve the problem.

  • 1-2 sentences explaining whether or not there were any rows without 29 columns.

Question 3

Write a bash script that, given a "begin" year and "end" year, cycles through the associated files and prints any lines that do not have 29 columns.

Items to submit
  • The content of your bash script (starting with "#!/bin/bash") in a code chunk.

  • The results of running your bash scripts from year 1987 to 2008.

Question 4

awk is a really good tool to quickly get some data and manipulate it a little bit. The column Distance contains the distances of the flights in miles. Use awk to calculate the total distance traveled by the flights in 1990, and show the results in both miles and kilometers. To convert from miles to kilometers, simply multiply by 1.609344.

The following is an output example:

Miles: 12345
Kilometers: 19867.35168
Items to submit
  • The code used to solve the problem.

  • The results of running the code.

Question 5

Use awk to calculate the sum of the number of DepDelay minutes, grouped according to DayOfWeek. Use 2007.csv.

The following is an output example:

DayOfWeek:  0
1:  1234567
2:  1234567
3:  1234567
4:  1234567
5:  1234567
6:  1234567
7:  1234567

1 is Monday.

Items to submit
  • The code used to solve the problem.

  • The output from running the code.

Question 6

It wouldn’t be fair to compare the total DepDelay minutes by DayOfWeek as the number of flights may vary. One way to take this into account is to instead calculate an average. Modify (5) to calculate the average number of DepDelay minutes by the number of flights per DayOfWeek. Use 2007.csv.

The following is an output example:

DayOfWeek:  0
1:  1.234567
2:  1.234567
3:  1.234567
4:  1.234567
5:  1.234567
6:  1.234567
7:  1.234567
Items to submit
  • The code used to solve the problem.

  • The output from running the code.