TDM 10100: Project 2 - Indexing in R

Project Objectives

Motivation: R is one of the most popular tools for data analysis. Indexing and manipulating values in R is a very useful skill.

Context: We will load a data frame into R, and will learn about and practice indexing the data in many ways.

Scope: R, Data Exploration, Indexing, Character Manipulation, Conditionals

Learning Objectives

Get comfortable with extracting data in R that will satisfy various conditions
Learn how to use indices in R
Learn base levels about plotting in R
Apply these techniques with real-world data

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset

/anvil/projects/tdm/data/movies_and_tv/rotten_tomatoes_movies.csv

This dataset contains information about the movies and tv shows that have been rated on Rotten Tomatoes. Rotten Tomatoes is a rating system very similar to IMDB that tracks movies and tv shows by their title, rating, genre, cast, and more. Unlike traditional 1-10 scores, Rotten Tomatoes has a unique rating metric:

Tomatometer: Percentage of critics who gave positive reviews.
Audience Score: Percentage of general users who noted positive experiences.

Rotten Tomatoes also classifies movies into three critic-based categories:

Rotten: When a movie or TV show has less than 60% of positive reviews, it is considered rotten
Fresh: When a movie or TV show has at least 60% of positive reviews, it is considered fresh
Certified Fresh: A special designation that marks only the best-reviewed movies and TV shows with scores of at least 75%, at least 5 reviews from top critics, and many more review requirements.

This dataset contains 23 columns with over 10,000 rows of information.
Some key columns we will be looking at include:

movie_title - Title of the movie
rating - MPAA film rating system for suggested audience of the content
genre - Category for motion pictures based on the content’s theme
directors - Person/people who directed the movie
studio_name - Studio which produced the movie
tomatometer_status - Critics' review score on a range of Rotten to Certified Fresh tomatoes

Why Rotten Tomatoes?

This dataset is an ideal starting point for learning R because:

Everyone has opinions about movies, so this topic easily allows for exploring already familiar trends
This is categorical data, which is perfect for practicing indexing
It is slightly messy, which gives experience in working with real-world data

What kinds of movies usually get this Certified Fresh rating? Are PG movies more likely to get Fresh reviews than R-rated ones? In this project, you’ll explore questions like these, while also building fluency with R tools like sort(), table(), grep(), sum(), and more.

As you work through each question, you’ll sharpen your skills in indexing, pattern matching, and plotting - all while discovering trends behind the movies we love (and hate).

Questions

Question 1 (2 points)

Import the Rotten Tomatoes Movies and TV Shows dataset from the file /anvil/projects/tdm/data/movies_and_tv/rotten_tomatoes_movies.csv into a data frame called myDF. Check out what data it contains by using the head(), tail(), and dim() functions.

The columns cast and movie_info contain a lot of words per row entry. This makes looking at the data harder, so let’s remove them from what gets displayed.

Try using myDF <- myDF[ , !names(myDF) %in% c("cast", "movie_info")] to exclude these columns.

What are some of the columns? What are the dimensions of this dataset (without the two columns)?

Now we will look at the rating column. It would be interesting to see how evenly these different rating levels are distributed across the 16638 rows of movie data. Making a table of the rating column will show us this!

Another way to become familiar with the data is by visualizing it. We will start with a basic R visualization function, plot(). Let’s use plot() to visualize the table we just created. Do the resulting lines look reasonable?

When plotting, it’s important to ensure the data is easily readable and properly labeled. The individual points should be visable with space around them; labels should be easily interpreted. Each plot should have a title, and both axes should be clearly named.

Plotting Details:

main = Title
xlab = Label along the x-axis
ylab = Label along the y-axis
col = Color
lwd = Line weight
pch = Shape of data points

And more.

Read more about customizing plots in R here.
Read about gradient color palettes in R here.

Please use read.csv() to read your data for this project as following:

myDF <- read.csv('/anvil/projects/tdm/data/movies_and_tv/rotten_tomatoes_movies.csv')

Deliverables

1.1 The count of the rows and columns in this dataset (excluding the cast and movie_info columns)
1.2 Table of the rating column
1.3 Plot of the table of the rating column with title, axis labels, and at least one other customization

Question 2 (2 points)

Take another look at the table() of the rating column. How many unique categories are there?

It is easy to see that there are 8 different film rating types. But what about when there are many more than 8 entries in a table? Let us try to see the same for another variable, directors.

It is not always easy to view all levels at a glance using the table function. In fact, printing all of them might even crash the kernel. Fortunately, there is a simpler and safer alternative: checking the length of the level vectors instead of listing them.

Try using the length() function here to get the count of the different ratings types.
Use length() on the tables of the directors and studio_name columns as well
DO NOT print out the entire tables for these columns as there are thousands of values

This is good, but we may also want to see which item from the table is the most/least populated. Wrapping our current table() statement in a sort() function, and then looking at the head() and the tail() will show us these values.

It is OK when there is a blank above a value in the tables. This just means this is a count of the occurrences of when there was an item left blank rather than entered with a name when being added to the dataset

If you prefer not to see the empty string level in the output, you can remove it using the following code:

tail(sort(table(myDF$directors[myDF$directors != ""])))

The only difference here is that we’re adding a condition inside the brackets: myDF$directors != "". This filters out any entries that are empty strings ("") before creating the table, so they won’t appear in the result.

Deliverables

2.1 Counts of the items in the tables from the rating, directors, and studio_name columns
2.2 Which person has directed the most projects?
2.3 Which studio has produced the most films?

Question 3 (2 points)

The movies (and tv shows!) in this dataset each are listed with an entry in the genre column. How would we go about finding the total number of entries in this dataset for a specific genre?

To start, let’s find all of the entries that are specifically 'Comedy'.

There are many different formats for how the movies are categorized in the dataset: 'Comedy, Drama', 'Comedy, Kids & Family', etc. We are currently looking for where it is listed as 'Comedy' exactly.

Using the '==' operator shows the results for when it is true that 'a' == 'b'. But trying myDF$genre == "Comedy", and looking at the head of this shows confusing results full of FALSE and TRUE values. Each of these are related to a binary value:

TRUE = 1
FALSE = 0

Please go through and get the sum() of where the genre being 'Comedy' is TRUE - this should give us what we are looking for.

But what about when the genre is not just listed as a singular category? The sum we just found is only a part of the movies in this dataset that have this genre. It is important that we know how to find the entire count rather than just where it is an exclusive listing. We need some specific functions for this aim in R. Please go through and read about grep and grepl here. (Feel free to read other sources as well!)

Run grepl("Comedy", myDF$genre). This still outputs in the TRUE and FALSE results for each row of the dataset (as before), but these results are different. Why? If we look back at the head() of the dataset, we can see that the first entry’s genre is 'Action & Adventure, Comedy, Drama, Science Fiction & Fantasy', so this was not counted as TRUE before when we were not using grepl.

To see the results of which rows are TRUE only, run grep("Comedy", myDF$genre).

Use caution when running outputting code and always test with the head() or tail() function before printing.

To print the total count of these results, use

sum() for grepl
length() for grep

While grepl returns logical values indicating whether a pattern is found (TRUE or FALSE), grep returns the positions (indices) of the matches.

We could find the number of rows of specifically not exclusively 'Comedy' listings, OR we could try a really cool grepl statement.
Try showing a few rows of sort(table(myDF$genre[grepl("Comedy", myDF$genre) & (myDF$genre != "Comedy")])). After looking at this table, remove the sort() and table() and get the length() for the count of these rows.

Deliverables

3.1 Count of rows where the genre is: exclusively listed as 'Comedy', contains multiple genres including 'Comedy', and finally all results that contain 'Comedy'. (If done correctly, the count of the first two instances combined should equal the total from the third instance.)
3.2 What 'Comedy' category had the most counts when excluding entries that were exclusively 'Comedy'?
3.3 Count of rows where the genre is: exclusively listed as 'Romance', contains multiple genres including 'Romance', and finally all results that contain 'Romance'. (If done correctly, the count of the first two instances combined should equal the total from the third instance.)

Question 4 (2 points)

Let’s go back to our table of the rating column.
This table shows us how many times each of the rating types corresponds to a row of the dataset.

     G   NC17     NR     PG  PG-13 PG-13)      R     R)
   652     37   4862   2062   2885      3   6136      1

This is not the most lovely table. Why? Look at the contents. There is 'G', 'NC17', 'NR', 'PG', 'PG-13', 'PG-13)', 'R', and 'R)'. The value counts of 'PG-13)' and 'R)' are not high because these are clearly not meant to be their own rating type. But this is just a case of where humans have been humans and have made a typo. This happens a lot in real-world data. Later in the semester, we will go on to show how to clean datasets by merging values like this with their true values, but in this project, please run myDF$rating[myDF$rating == "R)"] <- "R" and myDF$rating[myDF$rating == "PG-13)"] <- "PG-13" to clean up these values. Essentially, we are correcting typos by replacing them with the accurate versions.

We will also take a look at the table of the tomatometer_status column. There are just three categories here! What could we possibly find from this?

Let’s make a new table. This table should contain BOTH the rating column and the tomatometer_status column. We will save this table as the variable 'table_data'.

Extra:

Read about creating tables in R here. Run a page search for the Two-Way Tables section.

Taking our 'table_data', we will now create a plot. Here, we will use a plotting method that is not as common as a typical line or scatter plot.

For best results in this question, create the table with the rating column first and then the tomatometer_status column. This will help with how our plot ends up looking.

We are going to be using the mosaicplot() function to create our visualization. Read about mosaic plots in R here.

Later we will learn about using ggplot2 to make more complex plots, but at this point we will stick to mosaicplot() and other plotting methods in basic R.

Deliverables

4.1 One table with both the rating and tomatometer_status columns
4.2 Plot of 'table_data' with customized title, axis labels, and three chosen colors (one per tomatometer status)
4.3 Learn about other plotting methods in R. What visualization type(s) do you find interesting/insightful for future usage?

Question 5 (2 points)

We will continue to work with these two columns (rating and tomatometer_status). In Question 4, we showed the table of these columns put together. Now we will use the paste() function to merge them. This allows us to see the numerical values for the combined columns. Please display the sorted table of these two columns once they have been pasted together.

If these columns had more unique values, it would be extremely important to use the head() or tail() function when displaying this merged table. BUT in this case, one column has 3 choices, and the other had 6 (once cleaned), so it is alright to display everything here.

From the table, choose one combination you would like to continue learning about. The following examples will use the 'PG' and 'Certified Fresh' rating combination.
Let’s create a subset of our data. This subset will contain only the data where the rating column shows 'PG' and the tomatometer_status is Certified Fresh.

Try subset(myDF, rating=="PG" & tomatometer_status=="Certified Fresh"). BUT make sure to only view the first or last few rows of this selection

Now that we know a bit about subsetting, let’s do one last thing. Please subset the original myDF data again twice to find where the movie_title entry is listed as each of your two favorite movies (or shows).

Deliverables

5.1 What rating-and-tomatometer_status pairing has the most counts in the table where these columns are pasted together?
5.2 Display the head of the subsetted data with rating and tomatometer_status filters
5.3 Display the rows where the movie_title is a movie of your choice for two entries.

Submitting your Work

Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.

Items to submit

firstname_lastname_project2.ipynb

It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template.

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not.

Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.