TDM 20100: Project 3 - grep

Project Objectives

This project introduces you to grep, one of the most powerful and widely used tools in UNIX for searching and filtering text. Through hands-on practice, you’ll learn how to combine grep with other common command-line utilities like wc, cut, tr, uniq, and sort to extract and analyze information within your Jupyter notebook environment.

Learning Objectives
  • Use grep to search text patterns across one or multiple files,

  • Apply flags like -i, -n, -c, -E, and -v to modify search behavior,

  • Use regular expressions (regex) to match advanced search patterns,

  • Combine commands using pipes (|) to build multi-step workflows,

  • Filter and transform data using tools like cut, sort, uniq, wc, and tr,

  • Work with wildcard patterns (*) to search across files and directories,

  • Analyze data using command-line tools for quick insight.

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset

  • /anvil/projects/tdm/data/amazon/

  • /anvil/projects/tdm/data/movies_and_tv/

  • /anvil/projects/tdm/data/flights/subset/ (airplane data used in video)

Questions

Question 1 (2 points)

grep is a Bash command that returns lines containing the given pattern. It’s a useful command! By default, matching is case-sensitive, but you can add -i to make it case-insensitive. A pattern can be as simple as a plain word or as complex as a regular expression (regex). You need to add -E to interpret the given pattern as a regular expression—otherwise, the command treats it as a literal string. It is worth to note that a literal string is a sequence of characters treated as themselves, while a regular expression is a pattern that can match a variety of strings.

  1. Go to /anvil/projects/tdm/data/amazon/

  2. Search for all lines in tracks.csv containing "Little Mix"

  3. Find who sang the song "Waiting for Blue"

  4. Count how many lines contain "Michael Jackson" using the -c option

  5. Count how many lines contain "X Marks the Pedwalk" (hint: use -i)

  6. Which line in the file contains "Rainbow in the Night"? (hint: use -n)

  7. Find all lines containing "garlic" (case insensitive)

  8. Find all lines where "garlic" appears as a whole word

  9. Count how many lines contain "garlic" as a substring, but not as a whole word

Relevant topics: grep

Deliverables

1a. Code used to solve all the steps above
1b. Written solution to Question 1.3, 1.4, 1.5, 1.6, 1.9 in Markdown
1e. Output from 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8

Question 2 (2 points)

The grep command isn’t limited to a single file—you can search multiple files, or even entire directories, using wildcards like *.

Suppose you want to search all files ending with .txt. You can do this by entering /matcha/*.txt, which searches all .txt files in the matcha folder. If you want to search only files starting with "fli", enter /matcha/fli*, which searches all files that begin with "fli" in the matcha folder.

Let’s try it for ourselves!

  1. Go to /anvil/projects/tdm/data/amazon/

  2. Using -c, find how many lines in each .csv file contain "apple" (case-insensitive)

  3. If you try to run a grep command on /anvil/projects/tdm/data/movies_and_tv/*, it will print an error for all inner directories. Try it! What -OPTION can you add to make the command work?

  4. Now run the grep command on /anvil/projects/tdm/data/movies_and_tv/* with all the appropriate options, using "pointer" (case-insensitive) as the search pattern

Relevant topics: grep

Deliverables

2a. Code used to solve all the steps above
2b. Written solution to Question 2.2, 2.3 in Markdown
2c. Output from 2.2, 2.4

Question 3 (2 points)

A pipeline (|) allows you to run multiple commands in sequence, passing the output of one command as input to the next.

  1. Similar to Question 1.4, find how many lines contain "Michael Jackson" using wc instead of the -c option

  2. Repeat Question 1.5: find how many lines contain "X Marks the Pedwalk" using wc instead of -c

  3. Print the head of /anvil/projects/tdm/data/amazon/Reviews.csv, displaying only the ProfileName, Score, and Summary columns (Hint: use cut to split a line by a delimiter and select specific fields)

  4. Print the head of /anvil/projects/tdm/data/amazon/music.txt, replacing all whitespaces with underscores (_) using the tr command

  5. From the first 76 lines of /anvil/projects/tdm/data/amazon/Reviews.csv, search for lines containing "chocolate" (case-insensitive) and print only the ProductId column (Hint: use three | in this pipeline)

Relevant topics: grep, pipelines, wc, head, cut, tr

Deliverables

3a. Code used to solve all the steps above
3b. Written solution to Question 3.1, 3.2 in Markdown
3c. Output from 3.1, 3.2, 3.3., 3.4, 3.5

For more practise with pipeline (|), please refer to the following video of Dr. Ward working with airplane data, which demonstrates how the pipeline works:

Before practicing the codes in the video, please make sure to select 3 or 4 cores when starting your Kernel.
cat /anvil/projects/tdm/data/flights/subset/[12]*.csv | cut -d, -f17,18 | sort | uniq -c | sort -n | tail

Question 4 (2 points)

You can use uniq to count how many times each word or line occurs consecutively in a file. To do this, the input must be sorted first. You can also sort the output using the sort command.

  1. Stay in /anvil/projects/tdm/data/amazon/

  2. In the previous step, you found the ProductIds for the first 76 lines that contain "chocolate". Print unique ProductIds only.

  3. Count how many times each unique ProductId appears in that output.

  4. Sort the output from the previous step in descending order.

  5. For the entire file, find the count of each ProductId that appears on lines containing "chocolate", sort the counts in decreasing order, and print the first 10 lines of the final output. (i.e., the 10 ProductIds most frequently associated with "chocolate")

Relevant topics: grep, cut, uniq, sort, head, pipelines

Deliverables

4a. Code used to solve all the steps above
4b. Output from 4.2, 4.3, 4.4, 4.5

Question 5 (2 points)

In the first question, regular expressions (regex) were briefly mentioned. Regex is a powerful tool that allows for flexible and complex string pattern matching. For example, instead of performing two separate searches for "grey" and "gray," a single regex search using the pattern gr(e|a)y can match both variations. The rules for regex can be challenging to memorize (and that’s okay—they’re not required).

If you’re interested in learning more about regex, The Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan is a great place to start - and it’s free with your Purdue account!

The main goal for this question is to learn how to look up specific syntax and use it effectively. Regular expressions can get quite complex, but let’s start with some simple examples.

  1. Stay in /anvil/projects/tdm/data/amazon/

  2. Use a regular expression to find the number of lines containing either "love" or "hate" (case-insensitive)

  3. Print all lines that begin with a capital letter (Hint: ^)

  4. In the head, print the lines that begin with a capital letter in the Summary column

  5. Count lines for the whole file that begin with a capital letter in the Summary column

  6. In the Summary column, count lines that end with "great" (case-insensitive) (Hint: $)

  7. In the Text column, find how many reviews contain at least two digit numbers (Hint: \d.*\d)

Relevant topics: grep, -E, regular expressions (regex), pipelines

Deliverables

5a. Code used to solve all the steps above
5b. Written answer for 5.2, 5.5, 5.6, 5.7
5c. Output from 5.2, 5.3, 5.4, 5.5, 5.6, 5.7

Submitting your Work

Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.

Items to submit
  • firstname_lastname_project3.ipynb

It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not.

Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.