TDM 10100: Project 3 - Grouping in R
Project Objectives
Motivation: Learning how to efficiently manipulate and summarize data using the (.)apply
family of functions and the dplyr
package from the tidyverse
is a valuable skill. These tools help us to simplify complex data operations, reduce written repetition, and make analysis faster and easier.
Context: We will be working with two separate dataframes in R, and will learn about and make comparisons through different methods of grouping.
Scope: R, grouping, apply family, dplyr
Datasets
-
/anvil/projects/tdm/data/grouping/vgchartz-2024.csv
-
/anvil/projects/tdm/data/grouping/googleplaystore.csv
-
/anvil/projects/tdm/data/election/itcont1980.txt (used in the video)
Video Game Sales
The video game industry has evolved into one of the world’s largest entertainment sectors, generating over $180 billion annually and surpassing both the movie and music industries combined. This market creates rich opportunities for data analysis, as gaming companies are constantly seeking to understand consumer preferences, market trends, and the factors that drive commercial success.
This dataset contains both numerical and categorical variables, making it ideal for developing data science skills by providing numerous cases for us to explore data grouping, analysis, and visualization.
There are 14 columns and 64016 rows of data for us to work with. These columns include:
-
title
: official title of the video game -
console
: video game console models by many different companies -
genre
: category for video games based on the content’s theme -
critic_score
: numerical rating assigned to each game on a scale of 1 to 10 reflecting the overall opinion of the game’s quality -
total_sales
: approximate number of copies (millions) sold of each game across all locations
Google Play Store
The Google Play Store, released in 2008 as "Android Market", is the very popular marketplace for users of Android devices. It is used by over 2.5 billion users monthly to discover, download, and update a vast number of applications including games, movies, books, and more. There are millions of apps and games available. This store has over 150 billion items downloaded from it in the past year.
There are 13 columns and 10841 rows of data for us to work with. The main columns we will be working with are:
-
Rating
: numerical decimal rating (1 - 5) showing average score of each app -
Reviews
: count of reviews left per app, ranging from 0 to over 10 million -
Content.Rating
: suggested target audience based on app’s theme
Questions
As before, please use the |
Question 1 (2 points)
Import the Video Game Sales dataset from the file /anvil/projects/tdm/data/grouping/vgchartz-2024.csv
into a dataframe called video_games
.
One thing we might choose to examine is the critic scores. These are important because they create a standardized way to gauge the quality of a game before buying it. We could look at the average critic scores by genre using tapply()
to determine which genre is usually the most trustworthy for a quality game.
The tapply()
function has format tapply(X, INDEX, FUN, …)
, where
-
X
is a vector to apply a function to -
INDEX
is a vector to group by -
FUN
is the function to apply (mean, sum, …) -
…
is any other additional arguments beyondX
,INDEX
, andFUN
to pass to the function.
So, we could do tapply(video_games$critic_score, video_games$genre, mean)
to find the average critic score within each video game genre. It all looks very simple - all one line, not too hard to think about - but is not the easiest to make sense of. In this case:
We didn’t have any additional arguments. Though once you run this code, you will soon see why we will need to add one. |
The output is not good. It shows us each of the genres, but each with a NA
value, which is not insightful beyond showing us that the dataset is imperfect. Try adding na.rm=TRUE
in place of an additional argument.
Let’s look at some new columns. total_sales
shows us the number of copies of each video game and console pairing sold worldwide. console
shows us the gaming system that each game is compatible with.
In a tapply()
statement with the total_sales
and console
columns, use sum
as the FUN
argument. This will show the total number of game sales across each console type.
It is often helpful to sort the data when wanting to learn from the output. By default, this is sorted by the console names, alphabetically. In this case, it would be useful for us to sort by the values of the total sales.
Use |
Sometimes we want to count the number of video games within a certain genre.
In this dataset, each row is a game, so we can sort by genre
and get the count of unique title
values.
We can do this by taking our two columns (title
, genre
), and setting FUN = length
. Sort these results and find which genre has the most games.
1.1 Which video game genre
has the highest average critic_score?
1.2 Which console
has the highest number of total_sales
?
1.3 In 3-5 sentences, explain in your own words what the tapply()
function does and why it is useful when working with datasets.
Please refer Dr. Ward’s following video below for more exercise with This is the code to read the data:
Let’s take the money for the donations (in the
|
Question 2 (2 points)
In 1.1, we found the average critic score sorted by genre by running tapply(video_games$critic_score, video_games$genre, mean, na.rm=TRUE)
. While R’s apply family of functions is really great for grouping data, it can get hard to read very quickly.
Sometimes it is useful to have another set of tools we can use for data analysis:
dplyr
is an R package that is useful when you want to easily work with dataframes. It is a part of the tidyverse
, a collection of packages that focus on making data analysis simpler and more readable. It provides clear, readable, and efficient tools for common data tasks like:
-
Sorting data
-
Summarizing groups
-
Creating new columns
-
Filtering rows
We start by loading the dplyr
library:
library(dplyr)
The first step when using dplyr
is to define what data we are working with. We will put video_games
in a clean cell, followed by our piping %>%
, such as
video_games %>%
the rest of your code...
In |
Then, Group the video game data by the genre
column.
video_games %>%
group_by(genre)
Any lines following that first line where the dataframe was declared are usually begun after a tabulated indentation. This helps us to understand that all of these manipulating functions are working within the same base dataframe (and are going into the same variable definition (when applicable)). |
If we ran this code in its current state, the system would try to print out every single line of the dataset, and it would look the same as it did originally. Why? The data has been grouped at this point, so R knows the data is split into groups, but no calculations have been done yet.
Now we want to get back to the critic_scores
we were sorting into the genre
groups in question 1.1. Our next task is to summarize the data. We want to find the average critic_score
by genre
, so now we want to calculate those average scores, and display the output sorted by genre
.
Learn about summarize()
here. Assign the average critic_score
(removing NAs) to a new variable called avg_critic_score
.
video_games %>%
group_by(genre) %>%
summarize(YOUR CODE COMES HERE))
What will this look like when we run it all together? We will see two columns, genre
, and avg_critic_score
. There are twenty genres, so there will be twenty rows. Each row of avg_critic_score
will be the average critic score for the corresponding genre.
We can see that the output is sorted by the alphabetically-ordered genre
column. What if we wanted to sort by the score values?
2.1 Table displaying average critic_scores
grouped by genre
using dplyr
2.2 Some key differences you have noticed so far between tapply()
and dplyr
. (At least 2 examples)
2.3 Which method (tapply()
or dplyr
) do you prefer so far? Why? (2-3 sentences)
Question 3 (2 points)
Before, we found the average counts. We are now going to look at finding the sums of values when grouping data. This is useful when we want to find which video game console has the most total game sales, or how many unique titles total there are per genre.
We now have just a little practice using group_by()
and summarize()
from the dplyr
library. These are still very new topics to us.
To start, let’s create a copy of the code we just made in Question 2. BUT we will now group by the console
column, and we will summarize these groups using sum()
(in place of mean()
) on the total_sales
column. DO NOT run this new code just yet. We will first save it as the variable console_sales
, and then print just the first few lines to see what the data looks like. to save your output to an object, you can use assign sign as follows:
console_sales <- video_games %>%
group_by(console) %>%
THE REST OF THE CODE COMES HERE
This grouping shows us the total number of video game sales across each console type. But the results are not sorted. What if we wanted to easily find the console with the most sales?
We can use arrange()
:
arrange()
is a function used to sort rows in a dataframe based on the values in one or more columns. By default, this will give results in ascending order (lowest to highest) based on the values in the column we’re sorting.
We will want to see which consoles had high or low sales. If we want to easily access the highest values, we could look at the |
What if we wanted to find how many video games are in each genre in this dataset? Each row in the dataset represents one video game. But there isn’t a dedicated column that counts the games that we can pair with the genre
column to do this grouping. This is OK.
There is a function in dplyr
called n()
that counts how many rows fall into each group defined by group_by()
. It is very useful for when we want to get the counts of the occurrences of values within a group.
Using group_by()
, summarize()
, arrange()
, and n()
from dplyr, we will find the count of how many games have been made in each genre.
The first instinct here may be to run something like:
genre_title_counts <- video_games %>%
group_by(genre) %>%
summarize(total_titles = sum(title, na.rm=TRUE)) %>%
arrange(desc(total_titles))
head(genre_title_counts)
So we will group by genre, get the sum of the titles, and arrange these results to show which title had the highest sum.
BUT this will not work! Why? The title
column is made of text values, and sum()
only works with numeric data. You can’t add the names of games together like numbers!
What actually works
We will need to replace what total_titles
is equal to. Try using n()
.
Read about using n()
here.
Our end code will read like:
-
Group by genre
-
Use
n()
to count how many rows (games) are in each group -
arrange(desc(…))
to sort the groups from most to least rows
3.1 Which console had the most total sales?
3.2 Which genre had the least title counts?
3.3 Create another summarized grouping between two columns.
Question 4 (2 points)
We are going to be calculating the values of each Rating
divided by the Reviews
for each video game. This will help to normalize the scoring values by combining popularity with quality, which allows us to compare each video game fairly, regardless of number of reviews.
Read in the Google Play Store dataset: /anvil/projects/tdm/data/grouping/googleplaystore.csv
into a dataframe called app_store
:
app_store <- read.csv('/anvil/projects/tdm/data/grouping/googleplaystore.csv')
To start, checkout the Rating
and Reviews
columns of app_store
. Each app in the Google Play Store has a rating (0-5). This is not the best way to judge an app’s quality, since:
-
Apps with high ratings but few reviews have a larger value,
-
Apps with many reviews and even the same rating will have a smaller value.
Finding the rating per review will help us detect under-reviewed apps that have high ratings but a lower vote count. Ideally, we would use these two columns (Rating
and Reviews
) to find out how each app was rated by each review it got.
Cleaning the data
This calculation of rating-per-review that we want to make is numerical, but looking at the classes of these columns shows that the Reviews
column is full of character data rather than numeric values.
We can’t go through and look at each unique value in the Reviews
column because there are hundreds of different values, and getting a table output of this would not be good. In an ideal world, this column would be numeric and contain only 0-9 characters. But real-world data is always messy. It is extremely likely that this column contains values like "1,234", "Free", or NA values, with the "," and letter characters that cannot be easily converted. We do not know for sure what is causing the issue.
There is a function in R called |
Save the gsub
function as a new column Clean_Reviews
. It should be a copy of the Reviews
column, minus any commas:
app_store$Clean_Reviews <- gsub(',','',app_store$Review)
We still have yet to change the class of the review data. Use as.numeric
to convert every entry in the Clean_Reviews
column to numeric data. There is a warning about NAs being introduced by coercion. Why didn’t this cleaning work? It did. The column now contains numeric data when we check the class. But there is still something causing an issue.
In this case, it is that we have not removed all of the problematic values just by removing all the commas. Run app_store[is.na(app_store$Numeric_Reviews) & !is.na(app_store$Reviews), "Reviews"]
. This will go through the app_store
dataset and find what values are causing a problem in the new Numeric_Reviews
numeric column, AND (&) what values are not causing a problem in the Reviews
character column. In the instances where both these columns have a value selected from the same row, the "Reviews" at the end of this line says to select the good value from the Reviews
column.
Running this will tell us the value(s) in the column that is causing the warning message to appear.
Now that we know a bit about reshaping datasets, we will learn how to use mutate()
and filter()
from the dplyr
library to do this same task.
Go to this page to read more about the mutate()
function.
It is OK to use AI for researching topics and asking for further explanation of something. Just be 100% certain your usage stays following the guidelines here |
In this case, we will use mutate()
to create new columns in the dataset. This will act the same as where we created the Clean_Reviews
and Numeric_Reviews
columns.
Build a mutate()
statement that uses gsub
and as.numeric
.
If you reuse the column names Clean_Reviews
and Numeric_Reviews
, it will overwrite your pre-existing columns - this is OK if you want two new columns instead of four.
app_store_cleaned <- app_store %>%
mutate(
Clean_Reviews = gsub(",", "", Reviews),
Numeric_Reviews = as.numeric(Clean_Reviews)
)
The lines in |
Another useful function is filter()
. Make sure you learn about this on your own enough to build a filter()
function to go through the Numeric_Reviews
and Reviews
columns and filter as we did in basic R. This will look like:
app_store_cleaned %>%
filter([part 1], [part 2])
Your code should read as "In app_store_cleaned
, filter for where Numeric_Reviews
is NA and where Reviews
is not NA, and select the Reviews
/non-NA value of the two". We will still get that one 3.0M
value as the result.
To finally remove that bothersome value, add filter(!is.na(Numeric_Reviews))
after the mutate()
function when creating app_store_cleaned
. There is still a warning, but re-running the check for problematic values / filter()
line will show no values creating a problem.
If there are errors in your code or the value persists after you believe it should have been removed, try checking that you have all of your piping in place and correct. |
Now with both the Rating
and Numeric_Reviews
columns, find the rating-to-review ratio. Use mutate()
to create a new column called rating_per_review in app_store_cleaned
that contains the Rating
per Numeric_Reviews
(how well an app is rated for each review it gets). Save this back into the app_store_cleaned
dataframe.
4.1 Create Numeric_Reviews
using basic R
4.2 Create Numeric_Reviews
using dplyr
and remove 3.0M
value
4.3 Display head of rating_per_review
column
Question 5 (2 points)
The apply()
family of functions in R provides efficient ways to work through data structures and apply functions to them. We already have some practice using tapply()
earlier in this project. In this question, we will begin working with some of the other functions to gain more experience using this family of functions in basic R. We will continue working with the Google Play Store dataset (either reread in or continued with changes from Question 4).
In the dataset, there is a column called Content.Rating
. This column contains 7 types of rating based on target audience for each game - which we can find using unique()
:
-
Everyone
-
Teen
-
Everyone 10+
-
Mature 17+
-
Adults only 18+
-
Unrated
-
''
(empty entries)
There are only 6 types in app_store_cleaned - the '' (empty values) being removed. |
It is good to check how many unique values there are before printing them all out. In this case, it was fine because there were 7, but in the future when looking for the unique values, it is useful to use the form |
In cases where there are a lot of unique values, it can happen that there are two values that are actually the same but were entered differently in a case-specific environment (such as everyone
vs Everyone
vs EVERYONE
). This common error can be fixed by setting all the text to capital letters. While it isn’t needed with this column, we can still apply this technique to these values as a learning case.
lapply()
stands for "list apply". It applies a function to each element of a list, and returns a new list, where each element is the result of applying that function.
Within lapply()
, there is a function called toupper
which will convert all of the characters to uppercase. Try using toupper
in lapply()
to convert the Content.Rating
column to all capital letters. Save this as a new column in your dataframe and compare the changes.
Another thing we might want to look at is the class of each column in the dataframe. We can easily use class()
to find one or two, but what about for every column in the dataframe?
sapply()
is very similar to lapply()
. Either will work just fine here for finding the classes. Read about and compare their differences. lapply()
will output a list, while sapply()
will give the best possible return type.
Returning to the Content.Rating
column, we can make another comparison. Using nchar
, we can get outputs telling us how many characters are in each output line of the column. Try using this in both lapply()
and sapply()
.
We will run these each with a |
The function |
5.1 Find the class of each column of the dataframe using two methods
5.2 Display the number of characters per row in Content.Rating
using two methods
5.3 Explain a bit about another function from the apply family (3-4 sentences)
Submitting your Work
Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.
-
firstname_lastname_project3.ipynb
It is necessary to document your work, with comments about each solution. All of your work needs to be your own work, with citations to any source that you used. Please make sure that your work is your own work, and that any outside sources (people, internet pages, generating AI, etc.) are cited properly in the project template. You must double check your Please take the time to double check your work. See here for instructions on how to double check this. You will not receive full credit if your |