TDM 20100: Project 3 — 2022

Motivation: The need to search files and datasets based on the text held within is common during various parts of the data wrangling process — after all, projects in industry will not typically provide you with a path to your dataset and call it a day. grep is an extremely powerful UNIX tool that allows you to search text using regular expressions. Regular expressions are a structured method for searching for specified patterns. Regular expressions can be very complicated, even professionals can make critical mistakes. With that being said, learning some of the basics is an incredible tool that will come in handy regardless of the language you are working in.

Regular expressions are not something you will be able to completely escape from. They exist in some way, shape, and form in all major programming languages. Even if you are less-interested in UNIX tools (which you shouldn’t be, they can be awesome), you should definitely take the time to learn regular expressions.

Context: We’ve just begun to learn the basics of navigating a file system in UNIX using various terminal commands. Now we will go into more depth with one of the most useful command line tools, grep, and experiment with regular expressions using grep, R, and later on, Python.

Scope: grep, regular expression basics, utilizing regular expression tools in R and Python

Learning Objectives
  • Use grep to search for patterns within a dataset.

  • Use cut to section off and slice up data from the command line.

  • Use wc to count the number of lines of input.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /anvil/projects/tdm/data/consumer_complaints/complaints.csv

Questions

Question 1

grep stands for (g)lobally search for a (r)egular (e)xpression and (p)rint matching lines. As such, to best demonstrate grep, we will be using it with textual data.

Let’s assume for a second that we didn’t provide you with the location of this projects dataset, and you didn’t know the name of the file either. With all of that being said, you do know that it is the only dataset with the text "That’s the sort of fraudy fraudulent fraud that Wells Fargo defrauds its fraud-victim customers with. Fraudulently." in it.

When you search for this sentence in the file, make sure that you type the single quote in "That’s" so that you get a regular ASCII single quote. Otherwise, you will not find this sentence. Or, just use a unique part of the sentence that will likely not exist in another file.

Write a grep command that finds the dataset. You can start in the /anvil/projects/tdm/data directory to reduce the amount of text being searched. In addition, use a wildcard to reduce the directories we search to only directories that start with a con inside the /anvil/projects/tdm/data directory. Just know that you’d eventually find the file without using the wildcard, but we don’t want to waste your time.

Use man to read about some of the options with grep. For example, you’ll want to search recursively through the entire contents of the directories starting with a con.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

In the previous project, you learned about a command that could quickly print out the first n lines of a file. A csv file typically has a header row to explain what data each column holds. Use the command you learned to print out the first line of the file, and only the first line of the file.

Great, now that you know what each column holds, repeat question (1), but, format the output so that it shows the complaint_id, consumer_complaint_narrative, and the state. Print only the first 100 lines (using head) so our notebook is not too full of text.

Now, use cat, head, tail, and cut to isolate those same 3 columns for the single line where we heard about the "fraudy fraudulent fraud".

You can find the exact line from the file where the "fraudy fraudulent fraud" occurs, by using the n option from grep. That will tell you the line number, which can then be used with head and tail to isolate the single line.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

Imagine a scenario where we are dealing with a much bigger dataset. Imagine that we live in the southeast and are really only interested in analyzing the data for Florida, Georgia, Mississippi, Alabama, and South Carolina. In addition, we are only interested in in the consumer_complaint_narrative, state, tags, and complaint_id.

Use UNIX tools to, in one line, create a new dataset called southeast.csv that only contains the data for the five states mentioned above, and only the columns listed above.

Be careful you don’t accidentally get lines with a word like "CAPITAL" in them (AL is the state code of Alabama and is present in the word "CAPITAL").

How many rows of data remain? How many megabytes is the new file? Use cut to isolate just the data we ask for. For example, just print the number of rows, and just print the value (in Mb) of the size of the file.

this
20M
not this
-rw-r--r-- 1 x-kamstut x-tdm-admin 20M Dec  13 10:59 /home/x-kamstut/southeast.csv
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 4

We want to isolate some of our southeast complaints. Return rows from our new dataset, southeast.csv, that have one of the following words: "wow", "irritating", or "rude" followed by at least 1 exclamation mark. Do this with just a single grep command. Ignore case (whether or not parts of the "wow", "rude", or "irritating" words are capitalized or not). Limit your output to only 5 rows (using head).

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

If you pay attention to the consumer_complaint_narrative column in our new dataset, southeast.csv, you’ll notice that some of the narratives contain dollar amounts in curly braces { and }. Use grep to find the narratives that contain at least one dollar amount enclosed in curly braces. Use head to limit output to only the first 5 results.

Use the option -E to use extended regular expressions. This will make your regular expressions less messy (less escaping).

There are instances like {>= $1000000} and { XXXX }. The first example qualifies, but the second doesn’t. Make sure the following are matched:

  • {$0.00}

  • { $1,000.00 }

  • {>= $1000000}

  • { >= $1000000 }

And that the following are not matched:

  • { XXX }

  • {XXX}

Regex is hard. Try the following logic.

  1. Match a "{"

  2. Match 0 or more of any character that isn’t a-z, A-Z, or 0-9

  3. Match 1 or more "$"

  4. Match 1 or more of any character that isn’t "}"

  5. Match "}"

To verify your answer, the following code should have the following result.

grep -E 'regexhere' $HOME/southeast.csv | head -n 5 | cut -d, -f4
result
3185125
3184467
3183547
3183544
3182879
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.