STAT 39000: Project 4 — Fall 2020

Motivation: The need to search files and datasets based on the text held within is common during various parts of the data wrangling process. grep is an extremely powerful UNIX tool that allows you to do so using regular expressions. Regular expressions are a structured method for searching for specified patterns. Regular expressions can be very complicated, even professionals can make critical mistakes. With that being said, learning some of the basics is an incredible tool that will come in handy regardless of the language you are working in.

Context: We’ve just begun to learn the basics of navigating a file system in UNIX using various terminal commands. Now we will go into more depth with one of the most useful command line tools, grep, and experiment with regular expressions using grep, R, and later on, Python.

Scope: grep, regular expression basics, utilizing regular expression tools in R and Python

Learning objectives
  • Use grep to search for patterns within a dataset.

  • Use cut to section off and slice up data from the command line.

  • Use wc to count the number of lines of input.

You can find useful examples that walk you through relevant material in The Examples Book:

It is highly recommended to read through, search, and explore these examples to help solve problems in this project.

I would highly recommend using single quotes ' to surround your regular expressions. Double quotes can have unexpected behavior due to some shell’s expansion rules. In addition, pay close attention to escaping certain characters in your regular expressions.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/movies_and_tv/the_office_dialogue.csv

A public sample of the data can be found here: the_office_dialogue.csv

Answers to questions should all be answered using the full dataset located on Scholar. You may use the public samples of data to experiment with your solutions prior to running them using the full dataset.

grep stands for (g)lobally search for a (r)egular (e)xpression and (p)rint matching lines. As such, to best demonstrate grep, we will be using it with textual data. You can read about and see examples of grep here.

Question

Question 1

Login to Scholar and use grep to find the dataset we will use this project. The dataset we will use is the only dataset to have the text "Bears. Beets. Battlestar Galactica.". What is the name of the dataset and where is it located?

Items to submit
  • The grep command used to find the dataset.

  • The name and location in Scholar of the dataset.

  • Use grep and grepl within R to solve a data-driven problem.

Question 2

grep prints the line that the text you are searching for appears in. In project 3 we learned a UNIX command to quickly print the first n lines from a file. Use this command to get the headers for the dataset. As you can see, each line in the tv show is a row in the dataset. You can count to see which column the various bits of data live in.

Write a line of UNIX commands that searches for "bears. beets. battlestar galactica." and, rather than printing the entire line, prints only the character who speaks the line, as well as the line itself.

The result if you were to search for "bears. beets. battlestar galactica." should be:

"Jim","Fact. Bears eat beets. Bears. Beets. Battlestar Galactica."

One method to solve this problem would be to pipe the output from grep to cut.

Items to submit
  • The line of UNIX commands used to find the character and original dialogue line that contains "bears. beets. battlestar galactica.".

Question 3

Find all of the lines where Pam is called "Beesley" instead of "Pam" or "Pam Beesley".

A negative lookbehind would be one way to solve this, in order to use a negative lookbehind with grep make sure to add the -P option. In addition, make sure to use single quotes to make sure your regular expression is taken literally. If you use double quotes, variables are expanded.

Regular expressions are really a useful semi-language-agnostic tool. What this means is regardless of the programming language you are using, there will be some package that allows you to use regular expressions. In fact, we can use them in both R and Python! This can be particularly useful when dealing with strings. Load up the dataset you discovered in (1) using read.csv. Name the resulting data.frame dat.

Items to submit
  • The UNIX command used to solve this problem.

Question 4

The text_w_direction column in dat contains the characters' lines with inserted direction that helps characters know what to do as they are reciting the lines. Direction is shown between square brackets "[" "]". In this two-part question, we are going to use regular expression to detect the directions.

(a) Create a new column called has_direction that is set to TRUE if the text_w_direction column has direction, and FALSE otherwise. Use the grepl function in R to accomplish this.

Make sure all opening brackets "[" have a corresponding closing bracket "]".

Think of the pattern as any line that has a [, followed by any amount of any text, followed by a ], followed by any amount of any text.

(b) Modify your regular expression to find lines with 2 or more sets of direction. How many lines have more than 2 directions? Modify your code again and find how many have more than 5.

We count the sets of direction in each line by the pairs of square brackets. The following are two simple example sentences.

This is a line with [emphasize this] only 1 direction!
This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].

Your solution to part (a) should find both lines a match. However, in part (b) we want the regular expression pattern to find only lines with 2+ directions, so the first line would not be a match.

In our actual dataset, for example, dat$text_w_direction[2789] is a line with 2 directions.

Items to submit
  • The R code and regular expression used to solve the first part of this problem.

  • The R code and regular expression used to solve the second part of this problem.

  • How many lines have >= 2 directions?

  • How many lines have >= 5 directions?

Question 5

Use the str_extract_all function from the stringr package to extract the direction(s) as well as the text between direction(s) from each line. Put the strings in a new column called direction.

This is a line with [emphasize this] only 1 direction!
This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].

In this question, your solution may have extracted:

[emphasize this]
[emphasize this] 2 sets of direction, do you see the difference [shrug]

It is okay to keep the text between neighboring pairs of "[" and "]" for the second line.

Items to submit
  • The R code used to solve this problem.

OPTIONAL QUESTION

Repeat (5) but this time make sure you only capture the brackets and text within the brackets. Save the results in a new column called direction_correct. You can test to see if it is working by running the following code:

dat$direction_correct[747]
This is a line with [emphasize this] only 1 direction!
This is a line with [emphasize this] 2 sets of direction, do you see the difference [shrug].

In (5), your solution may have extracted:

[emphasize this]
[emphasize this] 2 sets of direction, do you see the difference [shrug]

This is ok for (5). In this question, however, we want to fix this to only extract:

[emphasize this]
[emphasize this] [shrug]

This regular expression will be hard to read.

The pattern we want is: literal opening bracket, followed by 0+ of any character other than the literal [ or literal ], followed by a literal closing bracket.

Items to submit
  • The R code used to solve this problem.