TDM 20100: Project 3 - grep
Project Objectives
This project introduces you to grep, one of the most powerful and widely used tools in UNIX for searching and filtering text. Through hands-on practice, you’ll learn how to combine grep with other common command-line utilities like wc, cut, tr, uniq, and sort to extract and analyze information within your Jupyter notebook environment.
Dataset
-
/anvil/projects/tdm/data/amazon/
-
/anvil/projects/tdm/data/movies_and_tv/
-
/anvil/projects/tdm/data/flights/subset/ (airplane data used in video)
Questions
Question 1 (2 points)
grep is a Bash command that returns lines containing the given pattern. It’s a useful command! By default, matching is case-sensitive, but you can add -i to make it case-insensitive. A pattern can be as simple as a plain word or as complex as a regular expression (regex). You need to add -E to interpret the given pattern as a regular expression—otherwise, the command treats it as a literal string. It is worth to note that a literal string is a sequence of characters treated as themselves, while a regular expression is a pattern that can match a variety of strings.
-
Go to
/anvil/projects/tdm/data/amazon/ -
Search for all lines in
tracks.csvcontaining "Little Mix" -
Find who sang the song "Waiting for Blue"
-
Count how many lines contain "Michael Jackson" using the
-coption -
Count how many lines contain "X Marks the Pedwalk" (hint: use
-i) -
Which line in the file contains "Rainbow in the Night"? (hint: use
-n) -
Find all lines containing "garlic" (case insensitive)
-
Find all lines where "garlic" appears as a whole word
-
Count how many lines contain "garlic" as a substring, but not as a whole word
Relevant topics: grep
1a. Code used to solve all the steps above
1b. Written solution to Question 1.3, 1.4, 1.5, 1.6, 1.9 in Markdown
1e. Output from 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8
Question 2 (2 points)
The grep command isn’t limited to a single file—you can search multiple files, or even entire directories, using wildcards like *.
Suppose you want to search all files ending with .txt. You can do this by entering /matcha/*.txt, which searches all .txt files in the matcha folder. If you want to search only files starting with "fli", enter /matcha/fli*, which searches all files that begin with "fli" in the matcha folder.
Let’s try it for ourselves!
-
Go to
/anvil/projects/tdm/data/amazon/ -
Using
-c, find how many lines in each.csvfile contain "apple" (case-insensitive) -
If you try to run a
grepcommand on/anvil/projects/tdm/data/movies_and_tv/*, it will print an error for all inner directories. Try it! What-OPTIONcan you add to make the command work? -
Now run the
grepcommand on/anvil/projects/tdm/data/movies_and_tv/*with all the appropriate options, using "pointer" (case-insensitive) as the search pattern
Relevant topics: grep
2a. Code used to solve all the steps above
2b. Written solution to Question 2.2, 2.3 in Markdown
2c. Output from 2.2, 2.4
Question 3 (2 points)
A pipeline (|) allows you to run multiple commands in sequence, passing the output of one command as input to the next.
-
Similar to Question 1.4, find how many lines contain "Michael Jackson" using
wcinstead of the-coption -
Repeat Question 1.5: find how many lines contain "X Marks the Pedwalk" using
wcinstead of-c -
Print the head of
/anvil/projects/tdm/data/amazon/Reviews.csv, displaying only theProfileName,Score, andSummarycolumns (Hint: usecutto split a line by a delimiter and select specific fields) -
Print the head of
/anvil/projects/tdm/data/amazon/music.txt, replacing all whitespaces with underscores (_) using thetrcommand -
From the first 76 lines of
/anvil/projects/tdm/data/amazon/Reviews.csv, search for lines containing "chocolate" (case-insensitive) and print only theProductIdcolumn (Hint: use three|in this pipeline)
Relevant topics: grep, pipelines, wc, head, cut, tr
3a. Code used to solve all the steps above
3b. Written solution to Question 3.1, 3.2 in Markdown
3c. Output from 3.1, 3.2, 3.3., 3.4, 3.5
|
For more practise with pipeline (
|
Question 4 (2 points)
You can use uniq to count how many times each word or line occurs consecutively in a file. To do this, the input must be sorted first. You can also sort the output using the sort command.
-
Stay in
/anvil/projects/tdm/data/amazon/ -
In the previous step, you found the ProductIds for the first 76 lines that contain "chocolate". Print unique ProductIds only.
-
Count how many times each unique ProductId appears in that output.
-
Sort the output from the previous step in descending order.
-
For the entire file, find the count of each ProductId that appears on lines containing "chocolate", sort the counts in decreasing order, and print the first 10 lines of the final output. (i.e., the 10 ProductIds most frequently associated with "chocolate")
Relevant topics: grep, cut, uniq, sort, head, pipelines
4a. Code used to solve all the steps above
4b. Output from 4.2, 4.3, 4.4, 4.5
Question 5 (2 points)
In the first question, regular expressions (regex) were briefly mentioned. Regex is a powerful tool that allows for flexible and complex string pattern matching. For example, instead of performing two separate searches for "grey" and "gray," a single regex search using the pattern gr(e|a)y can match both variations. The rules for regex can be challenging to memorize (and that’s okay—they’re not required).
If you’re interested in learning more about regex, The Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan is a great place to start - and it’s free with your Purdue account!
The main goal for this question is to learn how to look up specific syntax and use it effectively. Regular expressions can get quite complex, but let’s start with some simple examples.
-
Stay in
/anvil/projects/tdm/data/amazon/ -
Use a regular expression to find the number of lines containing either "love" or "hate" (case-insensitive)
-
Print all lines that begin with a capital letter (Hint:
^) -
In the head, print the lines that begin with a capital letter in the
Summarycolumn -
Count lines for the whole file that begin with a capital letter in the
Summarycolumn -
In the
Summarycolumn, count lines that end with "great" (case-insensitive) (Hint:$) -
In the
Textcolumn, find how many reviews contain at least two digit numbers (Hint:\d.*\d)
Relevant topics: grep, -E, regular expressions (regex), pipelines
5a. Code used to solve all the steps above
5b. Written answer for 5.2, 5.5, 5.6, 5.7
5c. Output from 5.2, 5.3, 5.4, 5.5, 5.6, 5.7
Submitting your Work
Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.
-
firstname_lastname_project3.ipynb
|
It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template. You must double check your Please take the time to double check your work. See here for instructions on how to double check this. You will not receive full credit if your |