TDM 20100: Project 1 — Welcome to the CLI!

Motivation: The Command Line Interface, often referred to simply as the CLI, is the bread-and-butter of working with computers. With it, we can navigate through files, search for patterns, create, modify, and delete thousands of files with a single line of code, and more! In the next few projects we’ll be learning all about the CLI and what it is capable of. In just a few weeks, you’ll be well on your way to mastery of the command line!

Context: Experience working in Anvil will make this project easier to start but is not a prerequisite.

Scope: Anvil, Jupyter Labs, CLI, Bash, GNU, filesystem navigation

Learning Objectives:
  • Remember how to work in Anvil

  • Learn how to navigate the CLI

  • Learn how to navigate a filesystem in the CLI

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following dataset(s):

  • /anvil/projects/tdm/data/flights/

  • /anvil/projects/tdm/data/flights/subset/

  • /anvil/projects/tdm/data/icecream/combined/reviews.csv

Questions

Question 1 (2 pts)

It’s been a long summer, so let’s start our first project this semester off with a quick review of Anvil. In case you haven’t already, visit ondemand.anvil.rcac.purdue.edu and log in using your ACCESS account credentials. If you don’t already have an account, follow these instructions to set one up. If you’ve forgotten your account credentials or are having other issues related to Anvil, please reach out to [email protected] with as much information as possible about your issue.

Your ACCESS account credentials may not necessarily be the same as your Purdue Career account.

Once logged in, start a new Anvil session for a few hours (try not to use more than 3 for the moment) and 1 CPU core.

For a reminder on how to start a new session on Anvil:

In the upper-middle part of your screen, you should see a dropdown button labeled The Data Mine. Click on it, then select Jupyter Notebook from the options that appear. If you followed all the previous steps correctly, you should now be on a screen that looks like this:

OnDemand Jupyter Lab
Figure 1. OnDemand Jupyter Lab

If your screen doesn’t look like this, please try and select the correct dropdown option again or visit seminar for more assistance.

For this first question, we’re going to have you get used to working with Jupyter. To start, upload this project template to Jupyter and fill in your name and the project number as needed, selecting seminar (not seminar-r) as your kernel.

Then, run the following Python in a code cell. For a more in-depth reminder on working in Jupyter, feel free to take a look at this year’s 10100 project 1 or check out this guide on Jupyter.

print("Hello World!")

Then, in a new code cell, run the following:

%%bash

echo Hello World!

The first line, %%bash, is cell magic, which tells our kernel to expect a different language than the default. (In this case, the default is Python and we are telling it to use bash instead.) When using cell magic, it is necessary to have the cell magic as the first line in the cell. If (for instance) a comment is the first thing in the cell, then the cell magic will fail.

The second line consists of echo Hello World!. echo is a Bash command similar to print() in Python, and we have it print "Hello World!"

As for Bash (short for Bourne-Again-SHell), you can think of it as a programming language for the command line. Bash has a lot of handy tools and commands to learn, and the rest of this project will be spent beginning to learn about Bash.

The terminal is what we call the area we typically work with the CLI in. While we can run Bash in our Jupyter notebook (as we did above), you will typically work directly in a terminal and throughout this semester we would recommend that you first run your Bash code in a terminal before copying the finished code over to your Jupyter notebook. To open a terminal on Anvil, open a new tab and select Terminal, where you’ll be greeted with a window that looks somewhat like the following (albeit jaxmattfair will be replaced by your access username.)

OnDemand Terminal
Figure 2. OnDemand Terminal

Try typing echo Hello World! and hitting enter. You should see the terminal print "Hello World!" before waiting for another command.

To get credit for this question, write a command using echo that prints "Hello X!" where "X" is replaced with your name. Be sure to copy your finished command into your Jupyter notebook and run it using cell magic to get credit for your work.

Deliverables
  • A command to print "Hello X!" (where "X" is replaced with your name) and the results of running it

  • Be sure to document your work from Question 1, using some comments and insights about your work.

Question 2 (2 pts)

Knowing how to navigate in the shell is helpful. A few notes:

Absolute paths start with a '/', like this:

/anvil/projects/tdm/data/flights/subset/

Relative paths do not start with a '/', like this:

subset

The 'cd' command is used to change directories.

By default, 'cd' just changes your location back to your home directory.

You can type 'cd' with absolute paths or relative paths, for instance:

%%bash
cd /anvil/projects/tdm/data/flights/subset/

or like this:

%%bash
cd /anvil/projects/tdm/data/flights/
cd subset

If you want to go back to a directory one level higher, type 'cd ..'

For instance, try this, which first moves our location to the flight subset directory, and then back to the flights directory, and then back to the data directory.

%%bash
cd /anvil/projects/tdm/data/flights/subset/
cd ..
cd ..
pwd

The pwd command prints the working directory.

The ls command prints the contents of the working directory, with only the file names.

Dr Ward likes to run ls -la (those are lowercase letter L’s, not number 1’s), which shows information about the files in the directories.

Dr Ward also uses pwd a lot, to make sure that he is working in the directory that he intended to be working in.

Each bash cell in Jupyter Lab is executed independently, starting from your home directory, as if nothing had been previously run. In other words, bash cells in Jupyter Lab ignore anything that you did in earlier cells.

Which years of flight data are in the directory:

/anvil/projects/tdm/data/flights/subset/?

Which years of flight data are in the directory:

/anvil/projects/tdm/data/flights/?

In which of the two directories are the files bigger in size?

Deliverables
  • The year range of flight data in the two directories indicated above, and which directory has bigger file sizes.

  • Be sure to document your work from Question 2, using some comments and insights about your work.

Question 3 (2 pts)

We can use the head and the tail commands to see the top lines and the bottom lines of a file. By default, we see 10 lines of output, in each case. We can use the -n flag to change the number of lines of output that we see. For instance:

%%bash

head -n6 /anvil/projects/tdm/data/flights/subset/1987.csv

shows the first 6 lines of the 1987.csv file in the flights subset directory. This includes the header line and also the information about the first 5 flights.

The cut command usually takes two flags, namely:

the -d flag that indicates how the data in a flag is delimited (in other words, what character is placed between the pieces of data), and

the -f flag that indicates which fields we want to cut.

Use the cut command to extract all of the origin airports and destination airports from the 1987.csv file in the flights subset directory, and store the resulting origin and destination airports into a file in your home directory.

You can save the results of your work in bash in a file in your home directory like this:

%%bash
myworkinbash >$HOME/originsanddestinations.csv
Deliverables
  • Show the head of the file originsanddestinations.csv that you created.

  • Be sure to document your work from Question 3, using some comments and insights about your work.

Question 4 (2 pts)

Use the grep command to find data in the 1987.csv file in the flights subset directory that contain the pattern IND. Save all of the lines of that 1987.csv file into a new file in your home directory called indyflights.csv.

Deliverables
  • Show the head of the file indyflights.csv that you created.

  • Be sure to document your work from Question 4, using some comments and insights about your work.

Question 5 (2 pts)

Now consider the file:

/anvil/projects/tdm/data/icecream/combined/reviews.csv

Use the grep command to extract all of the lines from this file that contain the word terrific and store these reviews in a new file called terrificreviews.csv in your home directory.

If you look at the first line of the file:

/anvil/projects/tdm/data/icecream/combined/reviews.csv

you will see that field 5 of each line has the number of stars for that product review.

Among (only) the reviews in the terrificreviews.csv file, how many of the reviews had only 1 star? How many had 4 stars? How many had 5 stars?

Deliverables
  • From the file terrificreviews.csv that you created, how many of the reviews had only 1 star? How many had 4 stars? How many had 5 stars?

  • Be sure to document your work from Question 5, using some comments and insights about your work.

Submitting your Work

With this last question completed, you’ve successfully made your first dive into the wonderful world of the command line, and can now successfully navigate just about any filesystem we throw at you! This may not seem like it was a hugely difficult project, but the skills you learned in this project are foundational tools that, when built upon, are extremely powerful skills that offer huge benefits in both research and industry.

In the next project we’ll go one step further than simply navigating the filesystem. We will learn how to create, destroy, and move files much more quickly than we can with R or Python.

Make sure to put all of your work into a Jupyter Lab notebook, and make sure that all of the desired output appears in the notebook. Once you upload your submission to Gradescope, make sure that everything appears as you would expect to ensure that you don’t lose any points.

Items to submit
  • firstname_lastname_project1.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.