TDM 20100: Project 3 - grep
Project Objectives
This project introduces you to grep
, one of the most powerful and widely used tools in UNIX for searching and filtering text. Through hands-on practice, you’ll learn how to combine grep
with other common command-line utilities like wc
, cut
, tr
, uniq
, and sort
to extract and analyze information within your Jupyter notebook environment.
Dataset
-
/anvil/projects/tdm/data/amazon/
-
/anvil/projects/tdm/data/movies_and_tv/
-
/anvil/projects/tdm/data/flights/subset/ (airplane data used in video)
Questions
Question 1 (2 points)
grep
is a Bash command that returns lines containing the given pattern. It’s a useful command! By default, matching is case-sensitive, but you can add -i
to make it case-insensitive. A pattern can be as simple as a plain word or as complex as a regular expression (regex). You need to add -E
to interpret the given pattern as a regular expression—otherwise, the command treats it as a literal string. It is worth to note that a literal string is a sequence of characters treated as themselves, while a regular expression is a pattern that can match a variety of strings.
-
Go to
/anvil/projects/tdm/data/amazon/
-
Search for all lines in
tracks.csv
containing "Little Mix" -
Find who sang the song "Waiting for Blue"
-
Count how many lines contain "Michael Jackson" using the
-c
option -
Count how many lines contain "X Marks the Pedwalk" (hint: use
-i
) -
Which line in the file contains "Rainbow in the Night"? (hint: use
-n
) -
Find all lines containing "garlic" (case insensitive)
-
Find all lines where "garlic" appears as a whole word
-
Count how many lines contain "garlic" as a substring, but not as a whole word
Relevant topics: grep
1a. Code used to solve all the steps above
1b. Written solution to Question 1.3, 1.4, 1.5, 1.6, 1.9 in Markdown
1e. Output from 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8
Question 2 (2 points)
The grep
command isn’t limited to a single file—you can search multiple files, or even entire directories, using wildcards like *
.
Suppose you want to search all files ending with .txt
. You can do this by entering /matcha/*.txt
, which searches all .txt
files in the matcha
folder. If you want to search only files starting with "fli", enter /matcha/fli*
, which searches all files that begin with "fli" in the matcha
folder.
Let’s try it for ourselves!
-
Go to
/anvil/projects/tdm/data/amazon/
-
Using
-c
, find how many lines in each.csv
file contain "apple" (case-insensitive) -
If you try to run a
grep
command on/anvil/projects/tdm/data/movies_and_tv/*
, it will print an error for all inner directories. Try it! What-OPTION
can you add to make the command work? -
Now run the
grep
command on/anvil/projects/tdm/data/movies_and_tv/*
with all the appropriate options, using "pointer" (case-insensitive) as the search pattern
Relevant topics: grep
2a. Code used to solve all the steps above
2b. Written solution to Question 2.2, 2.3 in Markdown
2c. Output from 2.2, 2.4
Question 3 (2 points)
A pipeline (|
) allows you to run multiple commands in sequence, passing the output of one command as input to the next.
-
Similar to Question 1.4, find how many lines contain "Michael Jackson" using
wc
instead of the-c
option -
Repeat Question 1.5: find how many lines contain "X Marks the Pedwalk" using
wc
instead of-c
-
Print the head of
/anvil/projects/tdm/data/amazon/Reviews.csv
, displaying only theProfileName
,Score
, andSummary
columns (Hint: usecut
to split a line by a delimiter and select specific fields) -
Print the head of
/anvil/projects/tdm/data/amazon/music.txt
, replacing all whitespaces with underscores (_) using thetr
command -
From the first 76 lines of
/anvil/projects/tdm/data/amazon/Reviews.csv
, search for lines containing "chocolate" (case-insensitive) and print only theProductId
column (Hint: use three|
in this pipeline)
Relevant topics: grep, pipelines, wc, head, cut, tr
3a. Code used to solve all the steps above
3b. Written solution to Question 3.1, 3.2 in Markdown
3c. Output from 3.1, 3.2, 3.3., 3.4, 3.5
For more practise with pipeline (
|
Question 4 (2 points)
You can use uniq
to count how many times each word or line occurs consecutively in a file. To do this, the input must be sorted first. You can also sort the output using the sort
command.
-
Stay in
/anvil/projects/tdm/data/amazon/
-
In the previous step, you found the ProductIds for the first 76 lines that contain "chocolate". Print unique ProductIds only.
-
Count how many times each unique ProductId appears in that output.
-
Sort the output from the previous step in descending order.
-
For the entire file, find the count of each ProductId that appears on lines containing "chocolate", sort the counts in decreasing order, and print the first 10 lines of the final output. (i.e., the 10 ProductIds most frequently associated with "chocolate")
Relevant topics: grep, cut, uniq, sort, head, pipelines
4a. Code used to solve all the steps above
4b. Output from 4.2, 4.3, 4.4, 4.5
Question 5 (2 points)
In the first question, regular expressions (regex) were briefly mentioned. Regex is a powerful tool that allows for flexible and complex string pattern matching. For example, instead of performing two separate searches for "grey" and "gray," a single regex search using the pattern gr(e|a)y
can match both variations. The rules for regex can be challenging to memorize (and that’s okay—they’re not required).
If you’re interested in learning more about regex, The Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan is a great place to start - and it’s free with your Purdue account!
The main goal for this question is to learn how to look up specific syntax and use it effectively. Regular expressions can get quite complex, but let’s start with some simple examples.
-
Stay in
/anvil/projects/tdm/data/amazon/
-
Use a regular expression to find the number of lines containing either "love" or "hate" (case-insensitive)
-
Print all lines that begin with a capital letter (Hint:
^
) -
In the head, print the lines that begin with a capital letter in the
Summary
column -
Count lines for the whole file that begin with a capital letter in the
Summary
column -
In the
Summary
column, count lines that end with "great" (case-insensitive) (Hint:$
) -
In the
Text
column, find how many reviews contain at least two digit numbers (Hint:\d.*\d
)
Relevant topics: grep, -E, regular expressions (regex), pipelines
5a. Code used to solve all the steps above
5b. Written answer for 5.2, 5.5, 5.6, 5.7
5c. Output from 5.2, 5.3, 5.4, 5.5, 5.6, 5.7
Submitting your Work
Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.
-
firstname_lastname_project3.ipynb
It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template. You must double check your Please take the time to double check your work. See here for instructions on how to double check this. You will not receive full credit if your |