TDM 20100: Project 3 — Pipelines

Motivation: In the previous project, at each stage in our analysis, we saved the output to a file. A more efficient method is to take the output from one command and use it as the input to the next command. This is called a pipeline of bash commands.

Context: Once we learn how to write bash commands in a pipeline, we can (more easily) use several bash commands in tandem.

Scope: Pipelines in Bash

Learning Objectives:
  • Learn about the concept of bash pipelines, to use several bash commands in sequence.

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following dataset(s):

  • /anvil/projects/tdm/data/flights/subset/ (airplane data)

  • /anvil/projects/tdm/data/election (election data)

  • /anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv (grocery store data)

Questions

Question 1 (2 pts)

In the previous project, on this collection of files:

/anvil/projects/tdm/data/flights/subset/[12]*.csv

we ran the following commands in bash:

`cat` (to print the files; we did this in Project 2, Question 1)
`cut -d, -f17,18` (to extract the 17th and 18th fields, for the Origin and Destination columns)
`sort` (to get all of the same flight paths next to each other in the file)
`uniq -c` (to discover how many times that each flight path occurs)
`sort -n` (to numerically sort the number of times that the flight paths occur)
`tail` (to get the 10 most popular flight paths from the years 1987 to 2008 and the number of times that airplanes flew on each of these flight paths)

Now we can do all of this together, in one long line:

cat /anvil/projects/tdm/data/flights/subset/[12]*.csv | cut -d, -f17,18 | sort | uniq -c | sort -n | tail

(To simplify things, we are not removing the head of each file.)

Please use 3 or 4 cores when working on this question.

Deliverables
  • Show the 10 most popular flight paths from the years 1987 to 2008 and the number of times that airplanes flew on each of these flight paths.

Question 2 (2 pts)

In the previous project, from this file:

/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv

we discovered how many times that each of the STORE_R values appear in the file, using the following commands in bash:

`cut -d, -f7` (to extract the `STORE_R` values from this file)
`sort` (to get all of the same `STORE_R` values next to each other in the file)
`uniq -c` (to discover how many times that each `STORE_R` value occurs)
`sort -n` (to numerically sort the number of times that the `STORE_R` values occur)

Now we can do all of this together, in one long line:

cut -d, -f7 /anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv | sort | uniq -c | sort -n
Deliverables
  • List the number of times that each of the STORE_R values appear in the file.

Question 3 (2 pts)

Using a pipeline to discover the 10 states in which the largest number of donations have been made (and the number of donations from each of these states), using the data stored in:

/anvil/projects/tdm/data/election/itcont*.txt

The data can be extracted from the 10th field of the files. The symbol "|" is the delimiter. So the cut command should look like: cut -d'|' -f10

Deliverables
  • The 10 states in which the largest number of donations have been made (and the number of donations from each of these states)

Question 4 (2 pts)

Modify your solution to Question 3 so that you extract both the city and the state (simultaneously) for each donation. In this way, you can discover the 10 city-and-state pairs in which the largest number of donations have been made (and the number of donations from each of these city-and-state pairs).

Deliverables
  • The 10 city-and-state pairs in which the largest number of donations have been made (and the number of donations from each of these city-and-state pairs).

Question 5 (2 pts)

Return to the analysis of the airline data. Modify your solution to Question 1 so that, instead of extracting the Origin and Destination airports, this time you can extract three columns: the year, month, and day of the flights. In this way, you can discover the 10 most popular days to fly from 1987 to 2008, i.e., the 10 dates on which the most flights occurred (and the number of flights on each of those 10 dates).

Deliverables
  • The 10 dates on which the most flights occurred (and the number of flights on each of those 10 dates).

Submitting your Work

Congratulations, with your understanding of pipelines, you are ready to leverage the strength of many bash commands in a sequence! Please feel encouraged to refer back to previous projects and ask questions (anytime) about how you can use bash for powerful and easy data analysis!

Items to submit
  • firstname-lastname-project3.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.