STAT 29000: Project 8 — Fall 2020
Motivation: A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential isues with Firefox, etc.
awk is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.
Context: This is the last part in a series of projects that are designed to exercise skills around UNIX utilities, with a focus on writing bash scripts and
awk. You will get the opportunity to manipulate data without leaving the terminal. At first it may seem overwhelming, however, with just a little practice you will be able to accomplish data wrangling tasks really efficiently.
Scope: awk, UNIX utilities, bash scripts
The following questions will use the dataset found in Scholar:
An example of the data for the year 1987 can be found here.
Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.
Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like
Let’s say we have a theory that there are more flights on the weekend days (Friday, Saturday, Sunday) than the rest of the days, on average. We can use awk to quickly check it out and see if maybe this looks like something that is true!
Write a line of
awk code that, prints the total number of flights that occur on weekend days, followed by the total number of flights that occur on the weekdays. Complete this calculation for 2008 using the
Under the column
awkcode that solves the problem.
The result: the number of flights on the weekend days, followed by the number of flights on the weekdays for the flights during 2008.
Note that in (1), we are comparing 3 days to 4! Write a line of
awk code that, prints the average number of flights on a weekend day, followed by the average number of flights on the weekdays. Continue to use data for 2008.
You don’t need a large if statement to do this, you can use the
awkcode that solves the problem.
The result: the average number of flights on the weekend days, followed by the average number of flights on the weekdays for the flights during 2008.
We want to look to see if there may be some truth to the whole "snow bird" concept where people will travel to warmer states like Florida and Arizona during the Winter. Let’s use the tools we’ve learned to explore this a little bit.
Take a look at
airports.csv. In particular run the following:
Notice how all of the non-numeric text is surrounded by quotes. The surrounding quotes would need to be escaped for any comparison within
awk. This is messy and we would prefer to create a new file called
new_airports.csv without any quotes. Write a line of code to do this.
You may be wondering why we are asking you to do this. This sort of situation (where you need to deal with quotes) happens a lot! It’s important to practice and learn ways to fix these things.
You could use
If you leave out the column number argument to
awkcode used to create the new dataset.
Write a line of commands that creates a new dataset called
az_fl_airports.txt should only contain a list of airport codes for all airports from both Arizona (AZ) and Florida (FL). Use the file we created in (3),
new_airports.csv as a starting point.
How many airports are there? Did you expect this? Use a line of bash code to count this.
Create a new dataset called
az_fl_flights.txt that contains all of the data for flights into or out of Florida and Arizona using the
2008.csv file. Use the newly created dataset,
az_fl_airports.txt, to accomplish this.
All UNIX commands used to answer the questions.
The number of airports.
1-2 sentences explaining whether you expected this number of airports.
Write a bash script that accepts the year as an argument and performs the same operations as in question 4, returning the number of flights into and out of both AZ and FL for any given year.
The content of your bash script (starting with "#!/bin/bash") in a code chunk.
The line of UNIX code you used to execute the script and create the new dataset.