STAT 29000: Project 2 — Fall 2021

Motivation: The ability to navigate a shell, like bash, and use some of its powerful tools, is very useful. The number of disciplines utilizing data in new ways is ever-growing, and as such, it is very likely that many of you will eventually encounter a scenario where knowing your way around a terminal will be useful. We want to expose you to some of the most useful UNIX tools, help you navigate a filesystem, and even run UNIX tools from within your Jupyter Lab notebook.

Context: At this point in time, our new Jupyter Lab system, using gateway.scholar.rcac.purdue.edu and gateway.brown.rcac.purdue.edu, is very new to everyone. The comfort with which you each navigate this UNIX-like operating system will vary. In this project we will learn how to use the terminal to navigate a UNIX-like system, experiment with various useful commands, and learn how to execute bash commands from within Jupyter Lab.

Scope: bash, Jupyter Lab

Learning Objectives
  • Distinguish differences in /home, /scratch, /class, and /depot.

  • Navigating UNIX via a terminal: ls, pwd, cd, ., .., ~, etc.

  • Analyzing file in a UNIX filesystem: wc, du, cat, head, tail, etc.

  • Creating and destroying files and folder in UNIX: scp, rm, touch, cp, mv, mkdir, rmdir, etc.

  • Use man to read and learn about UNIX utilities.

  • Run bash commands from within Jupyter Lab.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

/depot/datamine/data/

Questions

If you are not a bash user and you use an alternative shell like zsh or tcsh, you will want to switch to bash for the remainder of the semester, for consistency. Of course, if you plan on just using Jupyter Lab cells, the %%bash magic will use /bin/bash rather than your default shell, so you will not need to do anything.

While it is not super common for us to push a lot of external reading at you (other than the occasional blog post or article), this is an excellent, and very short resource to get you started using a UNIX-like system. We strongly recommend readings chapters: 1, 3, 4, 5, & 7. It is safe to skip chapters 2, 6, and 8.

Question 1

Let’s ease into this project by taking some time to adjust the environment you will be using the entire semester, to your liking. Begin by launching your Jupyter Lab session from either gateway.scholar.rcac.purdue.edu or gateway.brown.rcac.purdue.edu.

Explore the settings, and make at least 2 modifications to your environment, and list what you’ve changed.

Here are some settings Kevin likes:

  • Settings  JupyterLab Theme  JupyterLab Dark

  • Settings  Text Editor Theme  material

  • Settings  Text Editor Key Map  vim

  • Settings  Terminal Theme  Dark

  • Settings  Advanced Settings Editor  Notebook  codeCellConfig  lineNumbers  true

  • Settings  Advanced Settings Editor  Notebook  kernelShutdown  true

  • Settings  Advanced Settings Editor  Notebook  codeCellConfig  fontSize  16

Dr. Ward does not like to customize his own environment, but he does use the Emacs key bindings.

  • Settings  Text Editor Key Map  emacs

Only modify your keybindings if you know what you are doing, and like to use Emacs/Vi/etc.

Items to submit
  • List (using a markdown cell) of the modifications you made to your environment.

Question 2

In the previous project, we used the ls command to list the contents of a directory as an example of running bash code using the f2021-s2022 kernel. Aside from use the %%bash magic from the previous project, there are 2 more straightforward ways to run bash code from within Jupyter Lab.

The first method allows you to run a bash command from within the same cell as a cell containing Python code. For example.

!ls

import pandas as pd
myDF = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
myDF.head()

The second, is to open up a new terminal session. To do this, go to File  New  Terminal. This should open a new tab and a shell for you to use. You can make sure the shell is working by typing your first command, man.

# man is short for manual
# use "k" or the up arrow to scroll up, or "j" or the down arrow to scroll down.
man man

What is the absolute path of the default directory of your bash shell?

Relevant topics: pwd

Items to submit
  • The full filepath of the default directory (home directory). Ex: Kevin’s is: /home/kamstut.

  • The bash code used to show your home directory or current directory (also known as the working directory) when the bash shell is first launched.

Question 3

It is critical to be able to navigate a UNIX-like operating system. It is more likely than not that you will need to use a UNIX-like system at some point in your career. Perform the following actions, in order, using the bash shell.

I would recommend using a code cell with the magic %%bash to make sure that you are using the correct shell, and so your work is automatically saved.

  1. Write a command to navigate to the directory containing the datasets used in this course: /depot/datamine/data/.

  2. Print the current working directory, is the result what you expected? Output the $PWD variable, using the echo command.

  3. List the files within the current working directory (excluding subfiles).

  4. Without navigating out of /depot/datamine/data/, list all of the files within the the movies_and_tv directory, including hidden files.

  5. Return to your home directory.

  6. Write a command to confirm that you are back in the appropriate directory.

/ is commonly referred to as the root directory in a UNIX-like system. Think of it as a folder that contains every other folder in the computer. /home is a folder within the root directory. /home/kamstut is the absolute path of Kevin’s home directory. There is a folder called home inside the root / directory. Inside home is another folder named kamstut, which is Kevin’s home directory.

Relevant topics: pwd, cd, echo, ls

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

When running the ls command, you may have noticed two oddities that appeared in the output: "." and "..". . represents the directory you are currently in, or, if it is a part of a path, it means "this directory". For example, if you are in the /depot/datamine/data directory, the . refers to the /depot/datamine/data directory. If you are running the following bash command, the . is redundant and refers to the /depot/datamine/data/yelp directory.

ls -la /depot/datamine/data/yelp/.

.. represents the parent directory, relative to the rest of the path. For example, if you are in the /depot/datamine/data directory, the .. refers to the parent directory, /depot/datamine.

Any path that contains either . or .. is called a relative path. Any path that contains the entire path, starting from the root directory, /, is called an absolute path.

  1. Write a single command to navigate to our modulefiles directory: /depot/datamine/opt/modulefiles

  2. Write a single command to navigate back to your home directory, however, rather than using cd, cd ~, or cd $HOME without the path argument, use cd and a relative path.

Relevant topics: pwd, cd, . & .. & ~

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Your $HOME directory is your default directory. You can navigate to your $HOME directory using any of the following commands.

cd
cd ~
cd $HOME
cd /home/$USER

This is typically where you will work, and where you will store your work (for instance, your completed projects). At the time of writing this, the $HOME directories on Brown and Scholar are not synced. What this means is, files you create on one cluster will not be available on the other cluster. To move files between clusters, you will need to copy them using scp or rsync.

$HOME and $USER are environment variables. You can see what they are by typing echo $HOME and echo $USER. Environment variables are variables that are set by the system, or by the user. To get a list of your terminal session’s environment variables, type env.

The depot space is a network file system (as is the home space, albeit on a different system). It is attached to the root directory on all of the nodes in the cluster. One convenience that this provides is files in this space exist everywhere the filesystem is mounted! In summary, files added anywhere in /depot/datamine will be available on both Scholar and Brown. Although you will not utilize this space very often (other than to access project datasets), this is good information to know.

There exists 1 more important location on each cluster, scratch. Your scratch directory is located in the same place on either cluster: /scratch/$RCAC_CLUSTER/$USER. scratch is meant for use with really large chunks of data. The quota on Brown is 200TB and 2 million files. The quota on Scholar is 1TB and 2 million files. You can see your quota and usage on each system by running the following command.

myquota

$RCAC_CLUSTER and $USER are environment variables. You can see what they are by typing echo $RCAC_CLUSTER and echo $USER. $RCAC_CLUSTER contains the name of the cluster (for this course, "scholar" or "brown"), and $USER contains the username of the current user.

  1. Navigate to your scratch directory.

  2. Confirm you are in the correct location using a command.

  3. Execute the tokei command, with input ~dgc/bin.

    Doug Crabill is a the compute wizard for the Statistics department here at Purdue. ~dgc/bin is a directory he has made publicly available with a variety of useful scripts.

  4. Output the first 5 lines and last 5 lines of ~dgc/bin/union.

  5. Count the number of lines in the bash script ~dgc/bin/union (using a UNIX command).

  6. How many bytes is the script?

    Be careful. We want the size of the script, not the disk usage.

  7. Find the location of the tokei command.

When you type myquota on Scholar or Brown there are sometimes warnings about xauth. If you get a warning that says something like the following warning, you can safely ignore it.

Warning: untrusted X11 forwarding setup failed: xauth key data not generated

— Scholar/Brown

Commands often have options. Options are features of the program that you can trigger specifically. You can see the options of a command in the DESCRIPTION section of the man pages.

man wc

You can see -m, -l, and -w are all options for wc. Then, to test the options out, you can try the following examples.

# using the default wc command. "/depot/datamine/data/flights/1987.csv" is the first "argument" given to the command.
wc /depot/datamine/data/flights/1987.csv

# to count the lines, use the -l option
wc -l /depot/datamine/data/flights/1987.csv

# to count the words, use the -w option
wc -w /depot/datamine/data/flights/1987.csv

# you can combine options as well
wc -w -l /depot/datamine/data/flights/1987.csv

# some people like to use a single tack `-`
wc -wl /depot/datamine/data/flights/1987.csv

# order doesn't matter
wc -lw /depot/datamine/data/flights/1987.csv

Relevant topics: pwd, cd, head, tail, wc, du, which, type

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 6

Perform the following operations.

  1. Navigate to your scratch directory.

  2. Copy the following file to your current working directory: /depot/datamine/data/movies_and_tv/imdb.db.

  3. Create a new directory called movies_and_tv in your current working directory.

  4. Move the file, imdb.db, from your scratch directory to the newly created movies_and_tv directory (inside of scratch).

  5. Use touch to create a new, empty file called im_empty.txt in your scratch directory.

  6. Remove the directory, movies_and_tv, from your scratch directory, including all of the contents.

  7. Remove the file, im_empty.txt, from your scratch directory.

Relevant topics: cp, rm, touch, cd

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 7

This question should be performed by opening a terminal window. File  New  Terminal. Enter the result/content in a markdown cell in your notebook.

Tab completion is a feature in shells that allows you to tab through options when providing an argument to a command. It is a really useful feature, that you may not know is there unless you are told!

Here is the way it works, in the most common case — using cd. Have a destination in mind, for example /depot/datamine/data/flights/. Type cd /depot/d, and press tab. You should be presented with a large list of options starting with d. Type a, then press tab, and you will be presented with an even smaller list. This time, press tab repeatedly until you’ve selected datamine. You can then continue to type and press tab as needed.

Below is an image of the absolute path of a file in the Data Depot. Use cat and tab completion to print the contents of that file.

Tab completion
Figure 1. Tab completion
Items to submit
  • The content of the file, hello_there.txt, in a markdown cell in your notebook.

For this question, you will most likely want to launch a terminal. To launch a terminal click on File  New  Terminal. No need to input this question in your notebook.

  1. Use vim, emacs, or nano to create a new file in your scratch directory called im_still_here.sh. Add the following contents to the file, save, and close it.

    #!/bin/bash
    
    i=0
    
    while true
    do
        echo "I'm still here! Count: $i"
        sleep 1
        ((i+=1))
    done
  2. Confirm the contents of the file using cat.

  3. Try and run the program by typing im_still_here.sh.

    As you can see, simply typing im_still_here.sh will not work. You need to run the program with ./im_still_here.sh. The reason is, by default, the operating system looks at the locations in your $PATH environment variable for executables to execute. im_still_here.sh is not in your $PATH environment variable, so it will not be found. In order to make it clear where the program is, you need to run it with ./.

  4. Instead, try and run the program by typing ./im_still_here.sh.

    Uh oh, another warning. This time, you get a warning that says something like "permission denied". In order to execute a program, you need to grant the program execute permissions. To grant execute permissions for your program, run the following command.

    chmod +x im_still_here.sh
  5. Try and run the program by typing ./im_still_here.sh.

  6. The program should begin running, printing out a count every second.

  7. Suspend the program by typing Ctrl+Z.

  8. Run the program again by typing ./im_still_here.sh, then suspend it again.

  9. Run the command, jobs, to see the jobs you have running.

  10. To continue running a job, use either the fg command or bg command.

    fg stands for foreground and bg stands for background.

    fg %1 will continue to run job 1 in the foreground. During this time you will not have the shell available for you to use. To re-suspend the program, you can press Ctrl+Z again.

    bg %1 will run job 1 in the background. During this time the shell will be available to use. Try running ls to demonstrate. Note that the program, although running in the background, will still be printing to your screen. Although annoying, you can still run and use the shell. In this case, however, you will most likely want to stop running this program in the background due to its disruptive behavior. kdb:[Ctrl+Z] will will no longer suspend the program, because this program is running in the background, not foreground. To suspend the program, first send it to the foreground with fg %1, then use Ctrl+Z to suspend it.

Experiment moving the jobs to the foreground, background, and suspended until you feel comfortable with it. It is a handy trick to learn!

By default, a program is launched in the foreground. To run a program in the background at the start, and the command with a &, like in the following example.

./im_still_here.sh &
Items to submit
  • Code used to solve this problem. Since you will need to use Ctrl+Z, and things of that nature, when what you are doing isn’t "code", just describe what you are did. For example, if I press Ctrl+Z, I would say "I pressed Ctrl+Z".

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.