TDM 20100: Project 2 — 2023

Motivation: The ability to navigate a shell, like bash, and use some of its powerful tools, is very useful. The number of disciplines utilizing data in new ways is ever-growing, and as such, it is very likely that many of you will eventually encounter a scenario where knowing your way around a terminal will be useful. We want to expose you to some of the most useful UNIX tools, help you navigate a filesystem, and even run UNIX tools from within your Jupyter Lab notebook.

Context: At this point in time, our Jupyter Lab system, using ondemand.anvil.rcac.purdue.edu, is new to some of you, and maybe familiar to others. The comfort with which you each navigate this UNIX-like operating system will vary. In this project we will learn how to use the terminal to navigate a UNIX-like system, experiment with various useful commands, and learn how to execute bash commands from within Jupyter Lab.

Scope: bash, Jupyter Lab

Learning Objectives
  • Distinguish differences in /home, /anvil/scratch, and /anvil/projects/tdm.

  • Navigating UNIX via a terminal: ls, pwd, cd, ., .., ~, etc.

  • Analyzing file in a UNIX filesystem: wc, du, cat, head, tail, etc.

  • Creating and destroying files and folder in UNIX: scp, rm, touch, cp, mv, mkdir, rmdir, etc.

  • Use man to read and learn about UNIX utilities.

  • Run bash commands from within Jupyter Lab.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data

Questions

If you are not a bash user and you use an alternative shell like zsh or tcsh, you will want to switch to bash for the remainder of the semester, for consistency. Of course, if you plan on just using Jupyter Lab cells, the %%bash magic will use /bin/bash rather than your default shell, so you will not need to do anything.

While it is not super common for us to push a lot of external reading at you (other than the occasional blog post or article), this is an excellent, and very short resource to get you started using a UNIX-like system. We strongly recommend readings chapters: 1, 3, 4, 5, & 7. It is safe to skip chapters 2, 6, and 8.

Question 1 (1 pt)

  1. A list of length >=2 of modifications you made to your environment, in a markdown cell.

Let’s ease into this project by taking some time to adjust the environment you will be using the entire semester, to your liking. Begin by launching your Jupyter Lab session from ondemand.anvil.rcac.purdue.edu.

Open your settings by navigating to Settings  Advanced Settings Editor.

Explore the settings, and make at least 2 modifications to your environment, and list what you’ve changed.

Here are some settings Kevin likes:

  • Theme  Selected Theme  JupyterLab Dark

  • Document Manager  Autosave Interval  30

  • File Browser  Show hidden files  true

  • Notebook  Line Wrap  on

  • Notebook  Show Line Numbers  true

  • Notebook  Shut down kernel  true

Dr. Ward does not like to customize his own environment, but he does use the Emacs key bindings. Jackson loves to customize his own environment, but he despises Emacs bindings. Feel free to choose whatever is most comfortable to you.

  • Settings  Text Editor Key Map  emacs

Only modify your keybindings if you know what you are doing, and like to use Emacs/Vi/etc.

Items to submit
  • List (using a markdown cell) of the modifications you made to your environment.

Question 2 (1 pt)

  1. In a markdown cell, what is the absolute path of your home directory in Jupyter Labs?

In the previous project’s question 3, we used a tool called awk to parse through a dataset. This was an example of running bash code using the seminar kernel. Aside from use the %%bash magic from the previous project, there are 2 other straightforward ways to run bash code from within Jupyter Lab.

The first method allows you to run a bash command from within the same cell as a cell containing Python code. For example, using ls can be done like so:

!ls

import pandas as pd
myDF = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
myDF.head()

This does not require you to have other, Python code in the cell. The following is perfectly valid.

!ls
!ls -la /anvil/projects/tdm/

With that being said, using this method, each line must start with an exclamation point.

The second method is to open up a new terminal session. To do this, go to File  New  Terminal. This should open a new tab and a shell for you to use. You can make sure the shell is working by typing your first command, man.

# man is short for manual, to quit, press "q"
# use "k" or the up arrow to scroll up, or "j" or the down arrow to scroll down.
man man

Great! Now that you’ve learned 2 new ways to run bash code from within Jupyter Lab, please answer the following question:

What is the absolute path of the default directory of your bash shell? When we say "default directory" we mean the folder that you are "in" when you first run bash code in a Jupyter cell or when you first open a Terminal. This is also referred to as the home directory.

Relevant topics: pwd

Items to submit
  • bash code to print the full filepath of the default directory (home directory), and the output of running that code. Ex: Kevin’s is: /home/x-kamstut and Dr Ward’s is: /home/x-mdw.

Question 3 (1 pt)

  1. bash to navigate to /anvil/projects/tdm/data

  2. bash to print the current working directory

  3. bash to list the files in the current working directory

  4. bash to list all of the files in /anvil/projects/tdm/data/movies_and_tv, including hidden files

  5. bash to return to your home directory

  6. bash to confirm that you are back in your home directory (print your current working directory)

It is a critical skill to be able to navigate a UNIX-like operating system, and you will very likely need to use UNIX or Linux (or something similar) at some point in your career. For this question, write bash code to perform the following tasks in order. In your final submission, please ensure that all of your steps and their outputs are included.

For the sake of consistency, please run your bash code using the %%bash magic. This ensures that we are all using the correct shell (there are many shells), and that your work is displayed properly for your grader.

  1. Navigate to the directory containing the datasets used in this course: /anvil/projects/tdm/data.

  2. Print the current working directory. Is the result what you expected?

  3. Output the $PWD variable, using the echo command.

  4. List the files within the current working directory (excluding subfiles).

  5. Without navigating out of /anvil/projects/tdm/data, list all of the files within the the movies_and_tv directory, including hidden files.

  6. Return to your home directory.

  7. Write a command to confirm that you are back in your home directory.

/ is commonly referred to as the root directory in a UNIX-like system. Think of it as a folder that contains every other folder in the computer. /home is a folder within the root directory. /home/x-kamstut is the absolute path of Kevin’s home directory.

Relevant topics:

pwd, cd, ls

Items to submit
  • bash to navigate to /anvil/projects/tdm/data, and print the current working directory

  • bash to list the primary files in the current working directory

  • bash to list all of the files in /anvil/projects/tdm/data/movies_and_tv, including hidden files

  • bash to return to your home directory and confirm you are there.

Question 4 (1 pt)

  1. Write a single command to navigate to the modulefiles directory: /anvil/projects/tdm/opt/lmod, then confirm that you are in the correct directory using the echo command. (0.5 pts)

  2. Write a single command to navigate back to your home directory, using relative paths, then confirm that you are in the correct directory using the 'echo' command. (0.5 pts)

When running the ls command (specifically the ls command that showed hidden files and folders), you may have noticed two oddities that appeared in the output: . and ... . represents the directory you are currently in, or, if it is a part of a path, it means "this directory". For example, if you are in the /anvil/projects/tdm/data directory, the . refers to the /anvil/projects/tdm/data directory. If you are running the following bash command, the . is redundant and refers to the /anvil/projects/tdm/data/yelp directory.

ls -la /anvil/projects/tdm/data/yelp/.

.. represents the parent directory, relative to the rest of the path. For example, if you are in the /anvil/projects/tdm/data directory, the .. refers to the parent directory, /anvil/projects/tdm.

Any path that contains either . or .. is called a relative path (because it is relative to the directory you are currently in). Any path that contains the entire path, starting from the root directory, /, is called an absolute path.

For this question, perform the following operations in order. Each operation should be a single command. In your final submission, please ensure that all of your steps and their outputs are included.

  1. Write a single command to navigate to our modulefiles directory: /anvil/projects/tdm/opt/lmod.

  2. Confirm that you are in the correct directory using the echo command.

  3. Write a single command to navigate back to your home directory, however, rather than using cd, cd ~, or cd $HOME without the path argument, use cd and a relative path.

  4. Confirm that you are in the corrrect directory using the echo command.

If you don’t fully understand the text above, please take the time to understand it. It will be incredibly helpful to you, not only in this class, but in your career. You can also come to seminar or visit TA office hours to get assistance. We love to talk to students, and everyone benefits when we all collaborate.

Relevant topics: pwd, cd, special-symbols

Items to submit
  • Single command to navigate to the modulefiles directory.

  • Single command to navigate back to your home directory using relative paths.

  • Commands confirming your navigation steps were successful.

Question 5 (1 pt)

  1. Navigate to your scratch directory using environment variables.

  2. Run tokei on your home directory (use an environment variable).

  3. Output the first 5 lines and last 5 lines of /anvil/datasets/training/anvil-101/batch-test/batch-test-README. Make sure it is clear which lines are the first 5 and which are the last 5.

  4. Output the number of lines in /anvil/datasets/training/anvil-101/batch-test/batch-test-README

  5. Output the size, in bytes, of /anvil/datasets/training/anvil-101/batch-test/batch-test-README

  6. Output the location of the tokei program we used earlier.

$SCRATCH and $USER are referred to as environment variables. You can see what they are by typing echo $SCRATCH and echo $USER. $SCRATCH contains the absolute path to your scratch directory, and $USER contains the username of the current user. We will learn more about these in the rest of this question.

Your $HOME directory is your default directory. You can navigate to your $HOME directory using any of the following commands.

cd
cd ~
cd $HOME
cd /home/$USER

This is typically where you will work, and where you will store your work (for instance, your completed projects).

The /anvil/projects/tdm space is a directory created for The Data Mine. It holds our datasets (in the data directory), as well as data for many of our corporate partners projects.

There exists 1 more important location on each cluster, scratch. Your scratch directory is located at /anvil/scratch/$USER, or, even shorter, $SCRATCH. scratch is meant for use with really large chunks of data. The quota on Anvil is currently 100TB and 1 million files. You can see your quota and usage on Anvil by running the following command.

myquota

Doug Crabill is the one of the Data Mine’s extraordinarily wise computer wizards, and he has kindly collated a variety of useful scripts to be publicly available to students. These can be found in /anvil/projects/tdm/bin. Feel free to explore this directory and learn about these scripts in your free time.

One of the helpful scripts we have at our disposal is tokei, a code analysis tool. We can use this tool to quickly determine the language makeup of a project. An in-depth explanation of tokei can be found here, but for now, you can use it like so:

tokei /path/to/project

Sometimes, you may want to know what the first or last few lines of your file look like. head and tail can help us do that. Take a look at their documentation to learn more.

One goal of our programs is often to be size-efficient. If we have a very simple program, but it is enormous, it may not be worth our time to download and use. The wc tool can help us determine the size of our file. Take a look at its documentation for more information.

Be careful. We want the size of the script, not the disk usage.

Finally, we often may know that a program exists, but we don’t know where it is. which can help us find the location of a program. Take a look at its documentation for more information, and use it to solve the last part of this question.

Commands often have options. Options are features of the program that you can trigger specifically. You can see the options of a command in the DESCRIPTION section of the man pages.

man wc

You can see -m, -l, and -w are all options for wc. Then, to test the options out, you can try the following examples.

# using the default wc command. "/anvil/projects/tdm/data/flights/1987.csv" is the first "argument" given to the command.
wc /anvil/projects/tdm/data/flights/1987.csv

# to count the lines, use the -l option
wc -l /anvil/projects/tdm/data/flights/1987.csv

# to count the words, use the -w option
wc -w /anvil/projects/tdm/data/flights/1987.csv

# you can combine options as well
wc -w -l /anvil/projects/tdm/data/flights/1987.csv

# some people like to use a single "tack" `-`
wc -wl /anvil/projects/tdm/data/flights/1987.csv

# order doesn't matter
wc -lw /anvil/projects/tdm/data/flights/1987.csv

Relevant topics: pwd, cd, head, tail, wc, which, type

Items to submit
  • Navigate to your scratch directory, and run tokei on your home directory, using only environment variables.

  • Print out the first 5 lines and last 5 lines of the /anvil/datasets/training/anvil-101/batch-test/batch-test-README file.

  • Print out the number of lines in the /anvil/datasets/training/anvil-101/batch-test/batch-test-README file.

  • Print out the size in bytes of the /anvil/datasets/training/anvil-101/batch-test/batch-test-README file.

  • Print out the location of the tokei program we used earlier in this question.

Question 6 (2 pts)

  1. Navigate to your scratch directory.

  2. Copy the file /anvil/projects/tdm/data/movies_and_tv/imdb.db to your current working directory.

  3. Create a new directory called movies_and_tv in your current working directory.

  4. Move the file, imdb.db, from your scratch directory to the newly created movies_and_tv directory (inside of scratch).

  5. Use touch to create a new, empty file called im_empty.txt in your scratch directory.

  6. Remove the directory, movies_and_tv, from your scratch directory, including all of the contents.

  7. Remove the file, im_empty.txt, from your scratch directory.

Now that we know how to navigate a UNIX-like system, let’s learn how to create, move, and delete files and folders. For this question, perform the following operations in order. Each operation should be a single command. In your final submission, please ensure that all of your steps and their outputs are included.

First, let’s review the cp command. cp is short for copy, and it is used to copy files and folders. The syntax is as follows:

cp <source> <destination>

Next let’s take a look at the rm command. rm is short for remove, and it is used to remove files and folders. The syntax is as follows:

rm <file>
rm -r <directory>

Be very careful when using this command. If you use rm on a file or directory, you very likely will not be able to recover it. There is no "taking it out of the recycling bin". It is gone. Forever. If you are unsure, please ask for help.

Finally, let’s learn about touch and mkdir. touch is used to create new files, whereas mkdir creates new directories. The basic syntax for these is as follows:

touch <file>
mkdir <directory>

With that, you should have all of the knowledge you need to work on this question! Remember, each command has its own unique flags and syntax. When in doubt, use man to learn more about a command and its flags before using it haphazardly.

Relevant topics: cp, rm, touch, cd

Question 7 (1 pt)

  1. Use terminal autocompletion to print the contents of hello_there.txt, and put the contents in a markdown cell in your notebook.

This question should be performed by opening a terminal window. File  New  Terminal. Enter the result/content in a markdown cell in your notebook.

Tab completion is a feature in shells that allows you to tab through options when providing an argument to a command. It is a really useful feature, that you may not know is there unless you are told!

Here is the way it works, in the most common case — using cd. Have a destination in mind, for example /anvil/projects/tdm/data/flights/. Type cd /anvil/, and press tab. You should be presented with a small list of options — the folders in the anvil directory. Type p, then press tab, and it will complete the word for you. Type t, then press tab. Finally, press tab, but this time, press tab repeatedly until you’ve selected data. You can then continue to type and press tab as needed.

Below is an image of the absolute path of a file in Anvil. Use cat and tab completion to print the contents of that file.

Tab completion
Figure 1. Tab completion
Items to submit
  • The contents of the file, hello_there.txt, in a markdown cell in your notebook.

Submitting your Work

Congratulations, you’ve finished Project 2! Make sure that all of the below files are included in your submission, and feel free to come to seminar, post on Piazza, or visit some office hours if you have any further questions.

Items to submit
  • firstname-lastname-project02.ipynb.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output, when in fact it does not. Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or it does not render properly in gradescope. Please ask a TA if you need help with this.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.

Here is the Zoom recording of the 4:30 PM discussion with students from 28 August 2023: