TDM 10100: Project 7 - Writing Functions

Project Objectives

Motivation: Functions are an important part of programming. Sometimes it is useful to write your own functions to have an efficient tool to reuse.

Context: Functions, up to this point, have been pre-written for us. We will learn how to write them and begin writing some of our own.

Scope: R, function, tapply, data cleaning

Learning Objectives

Write functions in R
Gain experience cleaning data

Dataset

/anvil/projects/tdm/data/youtube/most_subscribed_youtube_channels.csv
/anvil/projects/tdm/data/spotify/taylor_swift_discography_updated.csv

If AI is used in any cases, such as for debugging, research, etc, we now require that you submit a link to the entire chat history. For example, if you used ChatGPT, there is an “Share” option in the conversation sidebar. Click on “Create Link” and please add the shareable link as a part of your citation.

The project template in the Examples Book now has a “Link to AI Chat History” section; please have this included in all your projects. If you did not use any AI tools, you may write “None”.

We allow using AI for learning purposes; however, all submitted materials (code, comments, and explanations) must all be your own work and in your own words. No content or ideas should be directly applied or copy pasted to your projects. Please refer to the-examples-book.com/projects/fall2025/syllabus#guidance-on-generative-ai. Failing to follow these guidelines is considered as academic dishonesty.

YouTube Channels

The YouTube platform was created to give everyone a chance to show their unique ideas to the world. It is fairly easy to post on YouTube, and is a user-friendly site to visit. That being said, there are accounts on YouTube that blow up in popularity. Currently, the most subscribed to account is MrBeast, with 418 million subscribers. An updated list can be found here.

YouTube has more than 2.7 billion monthly users, with more than 1 billion hours of content being played per day. The top content categories go from Comedy to Music to News & Politics. This dataset of the top 1000 Youtubers contains useful information about each Youtuber’s channel and content.

This dataset was last updated 3 years ago, but the top 1000 Youtubers have not changed too much. There are 7 columns and 1000 rows of data. These columns include:

rank: rank of the channel based on subscriber count
Youtuber: official channel name
subscribers: number of subscribers to each channel
video views: collective number of videos watched per channel
video count: number of videos uploaded by channel
category: genre of the channel’s content
started: year when the channel was started

Taylor Swift Spotify

Taylor Swift has built a massive universe of storylines and lyrics since 2006, spreading across her 11 original studio albums and worldwide fanbase. Earning over $2 billion in revenue from the Eras Tour alone, there is a lot of data to be collected from Taylor Swift. Her most recent album, "The Tortured Poets Department: The Anthology", with 31 tracks, was included in the most recent addition of this dataset. This is a huge album, but she has written hundreds of songs. Compiling data from Spotify about her collection alone allows for an interesting dataset to work with.

Some of these columns may seem strange, such as danceability, energy, liveness, speechiness, etc. These are some unique ways that Spotify tracks factors of songs, and allows for comparison of these traits across songs and artists.

There are 28 columns and 577 rows in this dataset. Some of these columns include:

track_name: name of the track
duration_ms: the duration of track in milliseconds
spotify_streams: number of streams on Spotify
album: name of the album track appears on
track_lyrics: lyrics of the track
energy: intensity and activity level of the track (0 to 1)

Questions

Question 1 (2 points)

Read in the YouTube Channels dataset as youtubers from /anvil/projects/tdm/data/youtube/most_subscribed_youtube_channels.csv. This dataset contains a few columns of mainly character data about the top 1000 youtubers as of three years ago.

The problem is, some columns we would expect to be numerical (such as video.count) have been entered as containing character data. But this is OK because we have dealt with problems like this before. Use gsub() and as.numeric() to make a new column containing the values from video.count, cleaned and converted to numbers:

youtubers$video.count2 <- as.numeric(gsub(",", "", youtubers$video.count))

Now with this new column, we can use tapply() and find the total number of videos made within each category. Save this as category_counts. Go ahead and display its contents because, while we are going to plot this, some of the labels will get cut off, and having shown category_counts here will make a nice reference.

Barplot category_counts and add a title, axis labels, las (turn the x-axis labels), and whatever other customizations you would like to see.

Make a subset of youtubers that contains the entries where category is either "Gaming" or "Music". From this subset, use tapply() again and get the total video counts for each category and started year.

When you make a grouped barplot of this, make sure to include the legend to be able to distingish the values belonging to their respective categories.

Use tapply() one more time and find the average video count for each started year. Make a plot() to show this.

Read about plot() here

Deliverables

1.1 (Barplot) Which category had the most total videos?
1.2 (Grouped barplot) Which year that the Gaming channels started in went on to produce the most videos overall?
1.3 What was the average video count for a channel made in 2008?

Question 2 (2 points)

When you want to solve a problem or automate a task, writing a function can be very useful. For example, we could create a simple function that takes a date as input and returns how many days it is from today. Then, by calling the function again and just changing the input date, we can quickly get new results without rewriting any code.

One basic function structure in R is:

functionname <- function(arg1, arg2, arg3, ...){
  do any code in here when called
  return(returnobject)
}

Read in the Taylor Swift dataset as ts_songs from /anvil/projects/tdm/data/spotify/taylor_swift_discography_updated.csv. Make sure to use read.csv2 here, because the Taylor Swift dataset is ';' (semi-colon) delimited (rather than comma). Check out the dimensions of the data.

There is a column (track_lyrics) that shows every lyric for each song in the dataset. To remove this column, you can use: ts_songs <- ts_songs[ , !(names(ts_songs) %in% "track_lyrics")]. After cleaning, you may use the head() function to check the first six rows of the dataset and get a better idea of what the dataset looks like.

Use options(repr.matrix.max.cols=50, repr.matric.max.rows=200) to change the maximum number of rows and columns you can view at once.

The energy column appears numeric but is stored as text. To work with it properly, convert energy to numeric:

ts_songs$energy <- as.numeric(ts_songs$energy)

In this case, there is no need to use gsub(), and just converting the column to numeric will do fine.

Build a function find_songs_with_energy using the basic function structure shown below. This function should take a dataframe and an energy threshold as inputs, and return all songs with energy greater than or equal to that threshold. The threshold is flexible, it can vary depending on the research question or what the user wants to explore. Here is the pseudocode:

find_songs_with_energy <- function(input_df, threshold) {
  my_output <- input_df[input_df$example_col >= threshold, ]
  return(my_output)
}

In R, a function is treated as an object, which means you can store it in a variable, call it later, and save its output into another object for further use. First, finalize your function. You will likely want to use the results produced by your function later. For example:

Test out your function using the ts_songs dataset, and the median value of the energy column (you will decide the median value (medianset) looking at the summary statistics for this column):

high_energy_df <- find_songs_with_energy(ts_songs, medianset)

Build a second function, find_songs_by_album, that takes a dataset and an album name as inputs and returns all songs from that album. Test your function using the album "The Tortured Poets Department: The Anthology". Your result should have 31 rows.

Deliverables

2.1 What was the maximum energy level possible from the songs? What was the minimum?
2.2 How many songs had high energy (greater or equal to median value)?
2.3 Write your function for finding songs by album and show the test on "The Tortured Poets Department: The Anthology".

Question 3 (2 points)

The spotify_streams column, which shows the number of times a song has been played on Spotify, is currently stored as character data. Try creating a new column that contains these values as numeric.:

ts_songs$numeric_streams <- as.numeric(ts_songs$spotify_streams)

This conversion may produce many NA values in that column (ts_songs$numeric_streams). To identify values that were not NA before converting but became NA after converting to numeric, you can use:

ts_songs[is.na(ts_songs$numeric_streams) & !is.na(ts_songs$spotify_streams), "spotify_streams"].

This will help you spot any problematic values.

After running it, you may have noticed that values containing more than one . cannot be converted to numeric. This is because R expects numeric values to have at most one decimal point, so any extra . makes the value invalid for numeric conversion, resulting in NA. In this case, you may want to remove all extra . characters before converting the column to numeric:

ts_songs$cleaned_streams <- gsub("\\.", "", ts_songs$spotify_streams)

When using gsub(), there are some characters that require "\\" before them. Read more here.

Use your function find_songs_by_album to find all of the songs from the album "evermore (deluxe version)". Save this as a separate dataframe and name it as evermore.

Deliverables

3.1 Why couldn’t the spotify_streams column be immediately converted to numeric?
3.2 How many songs were in "evermore (deluxe version)"?

Question 4 (2 points)

Using the find_songs_by_album function, create a new dataframe ts_1989 containing all of the songs from the "1989 (Taylor’s Version) [Deluxe]" album. There should be 22 songs.

Looking at the duration column for these songs, the values are very strange. This column of the dataset is counting the length of songs in milliseconds. Make two new columns:

duration_sec: the values from duration_ms divided by 1000
duration_min: the values from duration_sec divided by 60

Now, imagine you might want to perform these steps on other datasets or columns. In that case, it would be useful to create a generic function that takes a dataset and a column as inputs and performs the conversion. So, build a function convert_duration that takes a dataframe as input and converts the duration_ms column into two new columns:

duration_sec (duration in seconds)
duration_min (duration in minutes)

The function should return the updated dataframe. Test your function using the ts_1989 dataframe.

Deliverables

4.1 Show the first six lines of the data ts_1989
4.2 Write the convert_duration function, call it with ts_1989 and show the first six line of the result.

Question 5 (2 points)

In the youtubers dataset used in Question 1, make a second subscribers column (subscribers2) from youtubers$subscribers, cleaned and converted to numeric.

If we wanted to find the most-subscribed-to Youtuber from this dataset, it would not be challenging. But something cool that comes from building functions is that they are reusable. We can build a function that takes a dataframe and a selected genre and returns a Youtuber. When using this function, we can switch out whatever dataframe or genre is used in the input, and get completely different outputs without having to write too much.

In your function, you should take your inputted dataframe’s category column and find all entries that are the same as the inputted genre. This will be called genre_rows.

In your genre_rows, use which.max() to find the entry in your numeric subscribers column which has the highest subscribers count. Return this result.

To test this function, use the youtubers dataset and the category that is "Gaming". The result should be PewDiePie, with 111,000,000 subscribers at whatever time this dataset was last updated.

Look at the table of the category column from youtubers and choose another category to test this function on.

Deliverables

5.1 Function to find top-subscribed-to Youtubers
5.2 What second category did you use? Which Youtuber was it?
5.3 How does this dataset from 3 years ago relate to the current top Youtuber list?

Submitting your Work

Once you have completed the questions, save your Jupyter notebook. You can then download the notebook and submit it to Gradescope.

Items to submit

firstname_lastname_project7.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, markdown, and code output even though it may not. Please take the time to double check your work. See here for instructions on how to double check this.

You will not receive full credit if your .ipynb file does not contain all of the information you expect it to, or if it does not render properly in Gradescope. Please ask a TA if you need help with this.