TDM 10100: Project 2 - Indexing in R
Project Objectives
Motivation: R is one of the most popular tools for data analysis. Indexing and manipulating values in R is a very useful skill.
Context: We will load a data frame into R, and will learn about and practice indexing the data in many ways.
Scope: R, Data Exploration, Indexing, Character Manipulation, Conditionals
Dataset
-
/anvil/projects/tdm/data/movies_and_tv/rotten_tomatoes_movies.csv
This dataset contains information about the movies and tv shows that have been rated on Rotten Tomatoes. Rotten Tomatoes is a rating system very similar to IMDB that tracks movies and tv shows by their title, rating, genre, cast, and more. Unlike traditional 1-10 scores, Rotten Tomatoes has a unique rating metric:
-
Tomatometer: Percentage of critics who gave positive reviews.
-
Audience Score: Percentage of general users who noted positive experiences.
Rotten Tomatoes also classifies movies into three critic-based categories:
-
Rotten
: When a movie or TV show has less than 60% of positive reviews, it is considered rotten -
Fresh
: When a movie or TV show has at least 60% of positive reviews, it is considered fresh -
Certified Fresh
: A special designation that marks only the best-reviewed movies and TV shows with scores of at least 75%, at least 5 reviews from top critics, and many more review requirements.
This dataset contains 23 columns with over 10,000 rows of information.
Some key columns we will be looking at include:
-
movie_title
- Title of the movie -
rating
- MPAA film rating system for suggested audience of the content -
genre
- Category for motion pictures based on the content’s theme -
directors
- Person/people who directed the movie -
studio_name
- Studio which produced the movie -
tomatometer_status
- Critics' review score on a range of Rotten to Certified Fresh tomatoes
Why Rotten Tomatoes?
This dataset is an ideal starting point for learning R because:
-
Everyone has opinions about movies, so this topic easily allows for exploring already familiar trends
-
This is categorical data, which is perfect for practicing indexing
-
It is slightly messy, which gives experience in working with real-world data
What kinds of movies usually get this Certified Fresh
rating? Are PG movies more likely to get Fresh reviews than R-rated ones? In this project, you’ll explore questions like these, while also building fluency with R tools like sort()
, table()
, grep()
, sum()
, and more.
As you work through each question, you’ll sharpen your skills in indexing, pattern matching, and plotting - all while discovering trends behind the movies we love (and hate).
Questions
Question 1 (2 points)
Import the Rotten Tomatoes Movies and TV Shows dataset from the file /anvil/projects/tdm/data/movies_and_tv/rotten_tomatoes_movies.csv
into a data frame called myDF. Check out what data it contains by using the head()
, tail()
, and dim()
functions.
The columns Try using |
What are some of the columns? What are the dimensions of this dataset (without the two columns)?
Now we will look at the rating
column. It would be interesting to see how evenly these different rating levels are distributed across the 16638 rows of movie data. Making a table of the rating
column will show us this!
Another way to become familiar with the data is by visualizing it. We will start with a basic R visualization function, plot()
. Let’s use plot()
to visualize the table we just created. Do the resulting lines look reasonable?
When plotting, it’s important to ensure the data is easily readable and properly labeled. The individual points should be visable with space around them; labels should be easily interpreted. Each plot should have a title, and both axes should be clearly named. |
Plotting Details:
-
main
= Title -
xlab
= Label along the x-axis -
ylab
= Label along the y-axis -
col
= Color -
lwd
= Line weight -
pch
= Shape of data points
And more.
1.1 The count of the rows and columns in this dataset (excluding the cast
and movie_info
columns)
1.2 Table of the rating
column
1.3 Plot of the table of the rating
column with title, axis labels, and at least one other customization
Question 2 (2 points)
Take another look at the table()
of the rating
column. How many unique categories are there?
It is easy to see that there are 8 different film rating types. But what about when there are many more than 8 entries in a table? Let us try to see the same for another variable, directors
.
It is not always easy to view all levels at a glance using the table
function. In fact, printing all of them might even crash the kernel. Fortunately, there is a simpler and safer alternative: checking the length of the level vectors instead of listing them.
-
Try using the
length()
function here to get the count of the differentratings
types. -
Use
length()
on the tables of thedirectors
andstudio_name
columns as well -
DO NOT print out the entire tables for these columns as there are thousands of values
This is good, but we may also want to see which item from the table is the most/least populated. Wrapping our current table()
statement in a sort()
function, and then looking at the head()
and the tail()
will show us these values.
It is OK when there is a blank above a value in the tables. This just means this is a count of the occurrences of when there was an item left blank rather than entered with a name when being added to the dataset If you prefer not to see the empty string level in the output, you can remove it using the following code:
The only difference here is that we’re adding a condition inside the brackets: |
2.1 Counts of the items in the tables from the rating
, directors
, and studio_name
columns
2.2 Which person has directed the most projects?
2.3 Which studio has produced the most films?
Question 3 (2 points)
The movies (and tv shows!) in this dataset each are listed with an entry in the genre
column. How would we go about finding the total number of entries in this dataset for a specific genre?
To start, let’s find all of the entries that are specifically 'Comedy'
.
There are many different formats for how the movies are categorized in the dataset: |
Using the '=='
operator shows the results for when it is true that 'a' == 'b'
. But trying myDF$genre == "Comedy"
, and looking at the head of this shows confusing results full of FALSE and TRUE values. Each of these are related to a binary value:
-
TRUE = 1
-
FALSE = 0
Please go through and get the sum()
of where the genre being 'Comedy'
is TRUE
- this should give us what we are looking for.
But what about when the genre is not just listed as a singular category? The sum we just found is only a part of the movies in this dataset that have this genre. It is important that we know how to find the entire count rather than just where it is an exclusive listing. We need some specific functions for this aim in R. Please go through and read about grep
and grepl
here. (Feel free to read other sources as well!)
Run grepl("Comedy", myDF$genre)
. This still outputs in the TRUE
and FALSE
results for each row of the dataset (as before), but these results are different. Why? If we look back at the head()
of the dataset, we can see that the first entry’s genre is 'Action & Adventure, Comedy, Drama, Science Fiction & Fantasy'
, so this was not counted as TRUE
before when we were not using grepl
.
To see the results of which rows are TRUE
only, run grep("Comedy", myDF$genre)
.
Use caution when running outputting code and always test with the |
To print the total count of these results, use
-
sum()
for grepl -
length()
for grep
While grepl
returns logical values indicating whether a pattern is found (TRUE
or FALSE
), grep
returns the positions (indices) of the matches.
We could find the number of rows of specifically not exclusively 'Comedy'
listings, OR we could try a really cool grepl statement.
Try showing a few rows of sort(table(myDF$genre[grepl("Comedy", myDF$genre) & (myDF$genre != "Comedy")]))
. After looking at this table, remove the sort()
and table()
and get the length()
for the count of these rows.
3.1 Count of rows where the genre is: exclusively listed as 'Comedy'
, contains multiple genres including 'Comedy'
, and finally all results that contain 'Comedy'
. (If done correctly, the count of the first two instances combined should equal the total from the third instance.)
3.2 What 'Comedy'
category had the most counts when excluding entries that were exclusively 'Comedy'
?
3.3 Count of rows where the genre is: exclusively listed as 'Romance'
, contains multiple genres including 'Romance'
, and finally all results that contain 'Romance'
. (If done correctly, the count of the first two instances combined should equal the total from the third instance.)
Question 4 (2 points)
Let’s go back to our table of the rating
column.
This table shows us how many times each of the rating types corresponds to a row of the dataset.
G NC17 NR PG PG-13 PG-13) R R) 652 37 4862 2062 2885 3 6136 1
This is not the most lovely table. Why? Look at the contents. There is 'G'
, 'NC17'
, 'NR'
, 'PG'
, 'PG-13'
, 'PG-13)'
, 'R'
, and 'R)'
. The value counts of 'PG-13)'
and 'R)'
are not high because these are clearly not meant to be their own rating type. But this is just a case of where humans have been humans and have made a typo. This happens a lot in real-world data. Later in the semester, we will go on to show how to clean datasets by merging values like this with their true values, but in this project, please run myDF$rating[myDF$rating == "R)"] <- "R"
and myDF$rating[myDF$rating == "PG-13)"] <- "PG-13"
to clean up these values. Essentially, we are correcting typos by replacing them with the accurate versions.
We will also take a look at the table of the tomatometer_status
column. There are just three categories here! What could we possibly find from this?
Let’s make a new table. This table should contain BOTH the rating
column and the tomatometer_status
column. We will save this table as the variable 'table_data'
.
Extra:
Read about creating tables in R here. Run a page search for the Two-Way Tables section.
Taking our 'table_data'
, we will now create a plot. Here, we will use a plotting method that is not as common as a typical line or scatter plot.
For best results in this question, create the table with the |
We are going to be using the mosaicplot()
function to create our visualization. Read about mosaic plots in R here.
Later we will learn about using ggplot2 to make more complex plots, but at this point we will stick to |
4.1 One table with both the rating
and tomatometer_status
columns
4.2 Plot of 'table_data'
with customized title, axis labels, and three chosen colors (one per tomatometer status)
4.3 Learn about other plotting methods in R. What visualization type(s) do you find interesting/insightful for future usage?
Question 5 (2 points)
We will continue to work with these two columns (rating
and tomatometer_status
). In Question 4, we showed the table of these columns put together. Now we will use the paste()
function to merge them. This allows us to see the numerical values for the combined columns. Please display the sorted table of these two columns once they have been pasted together.
If these columns had more unique values, it would be extremely important to use the |
From the table, choose one combination you would like to continue learning about. The following examples will use the 'PG'
and 'Certified Fresh'
rating combination.
Let’s create a subset of our data. This subset will contain only the data where the rating
column shows 'PG'
and the tomatometer_status
is Certified Fresh
.
Try |
Now that we know a bit about subsetting, let’s do one last thing. Please subset the original myDF data again twice to find where the movie_title
entry is listed as each of your two favorite movies (or shows).
5.1 What rating
-and-tomatometer_status
pairing has the most counts in the table where these columns are pasted together?
5.2 Display the head of the subsetted data with rating
and tomatometer_status
filters
5.3 Display the rows where the movie_title
is a movie of your choice for two entries.
Submitting your Work
Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.
-
firstname_lastname_project2.ipynb
You must double check your You will not receive full credit if your |