STAT 29000: Project 4 — Fall 2020

Motivation: The need to search files and datasets based on the text held within is common during various parts of the data wrangling process. grep is an extremely powerful UNIX tool that allows you to do so using regular expressions. Regular expressions are a structured method for searching for specified patterns. Regular expressions can be very complicated, even professionals can make critical mistakes. With that being said, learning some of the basics is an incredible tool that will come in handy regardless of the language you are working in.

Context: We’ve just begun to learn the basics of navigating a file system in UNIX using various terminal commands. Now we will go into more depth with one of the most useful command line tools, grep, and experiment with regular expressions using grep, R, and later on, Python.

Scope: grep, regular expression basics, utilizing regular expression tools in R and Python

Learning objectives
  • Use grep to search for patterns within a dataset.

  • Use cut to section off and slice up data from the command line.

  • Use wc to count the number of lines of input.

You can find useful examples that walk you through relevant material in The Examples Book:

It is highly recommended to read through, search, and explore these examples to help solve problems in this project.

I would highly recommend using single quotes ' to surround your regular expressions. Double quotes can have unexpected behavior due to some shell’s expansion rules. In addition, pay close attention to #faq-escape-characters[escaping] certain characters in your regular expressions.

Dataset

The following questions will use the dataset the_office_dialogue.csv found in Scholar under the data directory /class/datamine/data/. A public sample of the data can be found here: the_office_dialogue.csv

Answers to questions should all be answered using the full dataset located on Scholar. You may use the public samples of data to experiment with your solutions prior to running them using the full dataset.

grep stands for globally search for a regular expression and print matching lines. As such, to best demonstrate grep, we will be using it with textual data. You can read about and see examples of grep here.

Questions

Question 1

Login to Scholar and use grep to find the dataset we will use this project. The dataset we will use is the only dataset to have the text "Bears. Beets. Battlestar Galactica." Where is it located exactly?

Items to submit
  • The grep command used to find the dataset.

  • The name and location in Scholar of the dataset.

Question 2

grep prints the line that the text you are searching for appears in. In project 3 we learned a UNIX command to quickly print the first n lines from a file. Use this command to get the headers for the dataset. As you can see, each line in the tv show is a row in the dataset. You can count to see which column the various bits of data live in.

Write a line of UNIX commands that searches for "bears. beets. battlestar galactica." and, rather than printing the entire line, prints only the character who speaks the line, as well as the line itself.

The result if you were to search for "bears. beets. battlestar galactica." should be:

"Jim","Fact. Bears eat beets. Bears. Beets. Battlestar Galactica."

One method to solve this problem would be to pipe the output from grep to cut.

Items to submit
  • The line of UNIX commands used to find the character and original dialogue line that contains "bears. beets. battlestar galactica.".

Question 3

This particular dataset happens to be very small. You could imagine a scenario where the file is many gigabytes and not easy to load completely into R or Python. We are interested in learning what makes Jim and Pam tick as a couple. Use a line of UNIX commands to create a new dataset called jim_and_pam.csv (remember, a good place to store data temporarily is /scratch/scholar/$USER). Include only lines that are spoken by either Jim or Pam, or reference Jim or Pam in any way. How many rows of data are in the new file? How many megabytes is the new file (to the nearest 1/10th of a megabyte)?

It is OK if you get an erroneous line where the word "jim" or "pam" appears as a part of another word.

Items to submit
  • The line of UNIX commands used to create the new file.

  • The number of rows of data in the new file, and the accompanying UNIX command used to find this out.

  • The number of megabytes (to the nearest 1/10th of a megabyte) that the new file has, and the accompanying UNIX command used to find this out.

Question 4

Find all lines where either Jim/Pam/Michael/Dwight’s name is followed by an exclamation mark. Use only 1 "!" within your regular expression. How many lines are there? Ignore case (whether or not parts of the names are capitalized or not).

Items to submit
  • The UNIX command(s) used to solve this problem.

  • The number of lines where either Jim/Pam/Michael/Dwight’s name is followed by an exclamation mark.

Question 5

Find all lines that contain the text "that’s what" followed by any amount of any text and then "said". How many lines are there?

Items to submit
  • The UNIX command used to solve this problem.

  • The number of lines that contain the text "that’s what" followed by any amount of text and then "said".

Regular expressions are really a useful semi language-agnostic tool. What this means is regardless of the programming language your are using, there will be some package that allows you to use regular expressions. In fact, we can use them in both R and Python! This can be particularly useful when dealing with strings. Load up the dataset you discovered in (1) using read.csv. Name the resulting data.frame dat.

Question 6

The text_w_direction column in dat contains the characters' lines with inserted direction that helps characters know what to do as they are reciting the lines. Direction is shown between square brackets "[" "]". In this two-part question, we are going to use regular expression to detect the directions.

(a) Create a new column called has_direction that is set to TRUE if the text_w_direction column has direction, and FALSE otherwise. Use the grepl function in R to accomplish this.

Make sure all opening brackets "[" have a corresponding closing bracket "]".

Think of the pattern as any line that has a [, followed by any amount of any text, followed by a ], followed by any amount of any text.

(b) Modify your regular expression to find lines with 2 or more sets of direction. How many lines have more than 2 directions? Modify your code again and find how many have more than 5.

We count the sets of direction in each line by the pairs of square brackets. The following are two simple example sentences.

This is a line with [emphasize this] only 1 direction!
This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].

Your solution to part (a) should find both lines a match. However, in part (b) we want the regular expression pattern to find only lines with 2+ directions, so the first line would not be a match.

In our actual dataset, for example, dat$text_w_direction[2789] is a line with 2 directions.

Items to submit
  • The R code and regular expression used to solve the first part of this problem.

  • The R code and regular expression used to solve the second part of this problem.

  • How many lines have >= 2 directions?

  • How many lines have >= 5 directions?

OPTIONAL QUESTION

Use the str_extract_all function from the stringr package to extract the direction(s) as well as the text between direction(s) from each line. Put the strings in a new column called direction.

This is a line with [emphasize this] only 1 direction!
This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].

In this question, your solution may have extracted:

[emphasize this]
[emphasize this] 2 sets of direction, do you see the difference [shrug]

(It is okay to keep the text between neighboring pairs of "[" and "]" for the second line.)

Items to submit
  • The R code used to solve this problem.