TDM 20100: Project 5 — More practice with pipelines

Motivation: We continue to practice how to use pipelines.

Context: Pipelines enable us to work on data across many files.

Scope: Pipelines and pattern matching in Bash

Learning Objectives:
  • Learn about using pattern matching in bash pipelines.

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following dataset(s):

  • /anvil/projects/tdm/data/flights/subset/ (airplane data)

  • /anvil/projects/tdm/data/election (election data)

  • /anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv (grocery store data)

  • /anvil/projects/tdm/data/beer/breweries.csv (breweries data)

  • /anvil/projects/tdm/data/icecream/combined/reviews.csv (ice cream review data)

Questions

Question 1 (2 pts)

For this question, use the flight files found here:

/anvil/projects/tdm/data/flights/subset/[12]*.csv

  1. Make a list of all of the values that appear in column 9, namely, the UniqueCarrier values. (These are abbreviations for the airlines.) For each such airline, list the number of flights on each airline. The list is short enough that you can display the full list of airlines and the number of flights on each airline.

  2. Make a list of all of the TailNum values that appear in column 11 BUT do not print this list! Just print the number of (distinct) TailNum values.

You can just use uniq instead of uniq -c for part b, since you do not need to keep track of the counts.

There are more than 10,000 TailNum values, so please do not print the full list of TailNum values. You only need to print the total number of (unique) TailNum values that occur.

Note: If you ever looked at an airplane (or a picture of an airplane), you might have noticed that there is one TailNum painted onto the tail of each airplane, which uniquely identifies that airplane. So, in part b, you are printing the number of airplanes that have flown in the United States!

Deliverables
  • a. Print a list of all of all UniqueCarrier values and how many times that each UniqueCarrier appears.

  • b. Print the number of (distinct) TailNum values that occur (do not print the list itself; there are more than 10,000 values!)

Question 2 (2 pts)

In the grocery store data:

/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv

how many (distinct) PRODUCT_NUM values appear?

You only need to print the number of distinct PRODUCT_NUM values. Please do NOT print the entire list. There are more than 100,000 such values!

Deliverables
  • Print the number of (distinct) PRODUCT_NUM values that appear. (Do not print the list itself!)

Question 3 (2 pts)

Consider the election files:

/anvil/projects/tdm/data/election/itcont*.txt

There are more than 200 million donations altogether (i.e., 200 million lines of data; one donation per line). But the number of committee IDs (which are in column 1) is much smaller.

  1. How many (distinct) committee IDs appear altogether?

  2. Which committee ID received the largest number of donations? How many donations did this committee ID receive? (Do not worry about the monetary amounts. Just keep track of the number of donations. There is one donation per line.)

Deliverables
  • a. The number of (distinct) committee IDs that appear.

  • b. The committee ID that received the largest number of donations, and how many donations that this committee ID received.

Question 4 (2 pts)

In the breweries data in this file:

/anvil/projects/tdm/data/beer/breweries.csv

use the grep command to print all 22 lines of data (everything on each line) corresponding to Lafayette or West Lafayette (Indiana).

Deliverables
  • The full contents of the lines of data corresponding to Lafayette or West Lafayette (Indiana). (Please print the whole line of data each time, so you are printing 22 lines of data!)

Question 5 (2 pts)

In this ice cream reviews file:

/anvil/projects/tdm/data/icecream/combined/reviews.csv

change each space into a newline, for instance, using any of the methods here:

It might be easiest, for instance, to embed:

tr ' ' '\n'

into your pipeline, which turns each space into a newline.

Then sort the resulting lines, which will now have only 1 word per line, and find the 25 words that occur most often in the file, along with the number of occurrences of each such word. Hint: The 5 most popular words are:

  18688 ice
  19480 a
  26848 and
  29090 I
  34937 the
Deliverables
  • The 25 words that occur most often in the file, along with the number of occurrences of each such word.

Submitting your Work

You are now very familiar with bash pipelines! BUT you can still ask us questions anytime, if you need advice or help!

Items to submit
  • firstname-lastname-project5.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.