TDM 20100: Project 2 — Working with the bash shell

Motivation: In the previous project we became (re-)familiarized with working on Anvil, before diving straight into using the bash shell (the command line interface). By learning to create, destroy, and move files and directories, along with some basic commands to begin to analyze files, we will be well on our way to performing some primitive forms of data analysis, using nothing but the terminal!

Context: The ability to use bash shell commands such as cat, cd, du, ls, mv, pwd, rm, sort, uniq, wc, to get familiar with the bash shell, and get a basic understanding of the man (manual) pages, will enable you to see some of the power and speed of using the bash shell.

Scope: Anvil, Jupyter Labs, CLI, Bash, GNU, filesystem manipulation

Learning Objectives:
  • Learn how to create and destroy files from the CLI

  • Learn how to create and destroy directories from the CLI

  • Learn how to move files and directories around the filesystem

  • Learn about basic file analysis commands

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following dataset(s):

  • /anvil/projects/tdm/data/flights/subset/ (airplane data)

  • /anvil/projects/tdm/data/election (election data)

  • /anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv (grocery store data)

Questions

Question 1 (2 pts)

In your $HOME directory, you can store only 25GB of data, but in your $SCRATCH directory, you can store up to 200TB of data.

Your $SCRATCH directory is not intended for long-term storage, and it can be erased by the system administrators at regular points in time. Nonetheless, it can be very helpful for working on data sets that do not need to be stored for a long time.

Your project templates (and all of your Jupyter Lab files) should be stored in your $HOME directory, but it is OK to put temporary data files into your $SCRATCH directory.

Make a new file called myflights.csv in the $SCRATCH directory that has only the first line of the 1987.csv file.

Now take all of the csv data files called 1987.csv through 2008.csv from the /anvil/projects/tdm/data/flights/subset/ directory and add their rows of data, one at a time, to the myflights.csv file. Be sure to not add the headers of these files. To accomplish this, use the grep command with the -h and -v options. (The -h option is used to hide the name of the file in the results, and the -v option is used to avoid any lines of the files that have the word "Year".) To append data to the end of a file, use ">>".

mycommand myfile1.txt >myfile2.txt will run mycommand on myfile1.txt and will save the results as myfile2.txt, destroying whatever was previously in myfile2.txt.

In contrast, mycommand myfile1.txt >>myfile2.txt will run mycommand on myfile1.txt and will append the results to the end of myfile2.txt, without destroying whatever was previously in myfile2.txt.

Now check that the resulting file has the correct number of lines.

The original files 1987.csv through 2008.csv have a total of 118914480 lines.

The file myflights.csv has all of these lines, except for the 22 header lines from the 22 respective files, plus it has the header from the 1987.csv file. So it should have a total of 118914480 - 22 + 1 = 118914459 lines.

Note: wc, which stands for word count, is actually capable of much more than simply counting the words in a file! Take a look at some of the below examples, along with this man page, for some ideas about the power of wc. The wc command gives the number of lines, bytes, or characters within a file.

%%bash

# prints line count, then word count, then byte count for `2012.csv`
wc /anvil/projects/tdm/data/stackoverflow/processed/2012.csv

# prints just the line count for `2012.csv`
wc -l /anvil/projects/tdm/data/stackoverflow/processed/2012.csv

# prints just the word count for `2012.csv`
wc -w /anvil/projects/tdm/data/stackoverflow/processed/2012.csv

# prints just the byte count for `2012.csv`
wc -c /anvil/projects/tdm/data/stackoverflow/processed/2012.csv

Another note: The du command (which stands for disk usage) measures the total disc space occupied by files and directories. Again, review the man page for du and the below examples, and then move onto the tasks for the final set of tasks for this project.

%%bash

# print the number of bytes that all of the processed directory is taking up
du -b /anvil/projects/tdm/data/stackoverflow/processed

# prints the number of kilobytes that the processed directory is taking up
du --block-size=KB /anvil/projects/tdm/data/stackoverflow/processed

# prints the number of kilobytes that each file in the processed directory is taking up
du --block-size=KB -a /anvil/projects/tdm/data/stackoverflow/processed
Deliverables
  • Show the output from running wc $SCRATCH myflights.csv (which will demonstrate that you produced a file with 118914459 lines).

  • Show the head of the file, namely: head $SCRATCH myflights.csv (which should have the header and the data about 9 flights from 1987).

  • As always, be sure to document your work from Question 1 (and from all of the questions!), using some comments and insights about your work. We will stop adding this note to document your work, but please remember, we always assume that you will document every single question with your comments and your insights.

Question 2 (2 pts)

Sometimes we want to copy files directly. Let’s create a new directory in our $SCRATCH folder and copy all of those files with flight data (1987.csv through 2008.csv) into that directory. Call the directory myfolder. Inside that folder, after those files are copied, build another file (like in Question 1) called myflightsremix.csv. Finally, compare these two files, using cmp $SCRATCH/myflights.csv $SCRATCH/myfolder/myflightsremix.csv (if the files are exactly the same, there should be no output because the files have no differences). Also compare them by running: ls -la $SCRATCH/myflights.csv $SCRATCH/myfolder/myflightsremix.csv which should demonstrate that they are the same size. Check wc $SCRATCH/myflights.csv $SCRATCH/myfolder/myflightsremix.csv to ensure that they have the same number of lines, words, and bytes.

Now go back to the scratch directory and remove this folder and its contents, using: cd $SCRATCH and then rm -r $SCRATCH/myfolder

Deliverables
  • Show the output of: cmp $SCRATCH/myflights.csv $SCRATCH/myfolder/myflightsremix.csv (which should be empty output, i.e., it should not do anything, because these files should have no differences)

  • Show the output of: ls -la $SCRATCH/myflights.csv $SCRATCH/myfolder/myflightsremix.csv (which should demonstrate that they are the same size)

  • Show the output of: wc $SCRATCH/myflights.csv $SCRATCH/myfolder/myflightsremix.csv (to ensure that they have the same number of lines, words, and bytes)

  • Then throw away the folder $SCRATCH/myfolder and finally show ls -la $SCRATCH to demonstrate that the folder $SCRATCH/myfolder is gone!

Question 3 (2 pts)

Copy the files itcont1980.txt through itcont2024.txt from the directory /anvil/projects/tdm/data/election into your $SCRATCH directory. Then create a new directory called mytemporarydirectory in your $SCRATCH directory and move all of these election files into that new directory. Finally, put the content from all of these election files into a new file called myelectiondata.txt. Check the size of this new file using the wc command. When you are finished, it is OK to remove the directory myelectiondata from the $SCRATCH directory.

Deliverables
  • Show the output of: wc mytemporarydirectory/myelectiondata.txt (which should show that the file has 229169299 lines and 1385963208 words and 42790681570 bytes).

Question 4 (2 pts)

Extract the Origin and Destination columns from all of the files 1987.csv to 2008.csv in the directory /anvil/projects/tdm/data/flights/subset. Save these origins and destinations into a file called $SCRATCH/myoriginsanddestinations.txt

Then sort this data and save the results to: $SCRATCH/mysortedoriginsanddestinations.txt

Then use the uniq -c command to get the counts corresponding to the number of times that each flight path occurred: $SCRATCH/mycounts.txt Note: you need to sort the file before using uniq -c

Now sort the file again, this time in numerical order, using sort -n and save the results to $SCRATCH/mysortedcounts.txt

Finally display the tail of the file, which contains the 10 most popular flight paths from the years 1987 to 2008 and the number of times that airplanes flew on each of these flight paths.

Deliverables
  • Show the 10 most popular flight paths from the years 1987 to 2008 and the number of times that airplanes flew on each of these flight paths.

Question 5 (2 pts)

Use the cut command with the flags -d, -f7 to extract the STORE_R values from this file:

/anvil/projects/tdm/data/8451/The_Complete_Journey_2_Master/5000_transactions.csv

Then use the techniques that you learned in Question 4, to discover how many times that each of the STORE_R values appear in the file.

Deliverables
  • List the number of times that each of the STORE_R values appear in the file.

Submitting your Work

Congratulations! With this project complete, you’re now familiar with many of the basic uses of the command line! With these tools in your belt, you can now explore, analyze, and manipulate a large part of Anvil at your whims! Please don’t use your newfound powers for evil!

In the next project, we’ll be building on these more primal analysis tools by introducing some more complex commands that allow us to perform specific search-and-return processes on data. From there, the sky is the limit, and we will be ready to dive into one of the most useful and important concepts in all of code: pipelines. More to come!

Items to submit
  • firstname-lastname-project2.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.