STAT 19000: Project 11 — Fall 2021

Motivation: The ability to understand a problem, know what tools are available to you, and select the right tools to get the job done, takes practice. In this project we will use what you’ve learned so far this semester to solve data-driven problems. In previous projects, we’ve directed you towards certain tools. In this project, there will be less direction, and you will have the freedom to choose the tools you’d like.

Context: You’ve learned lots this semester about the R environment. You now have experience using a very balanced "portfolio" of R tools. We will practice using these tools on a set of YouTube data.

Scope: R

Learning Objectives
  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • Utilize apply functions in order to solve a data-driven problem.

  • Gain proficiency using split, merge, and subset.

  • Comprehend what a function is, and the components of a function in R.

  • Demonstrate the ability to use nested apply functions to solve a data-driven problem.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/youtube/*

Questions

Question 1

In project 5, we used a loop to combine the various countries youtube datasets into a single dataset called yt (for YouTube).

Now, we’ve provided you with code below to create such a dataset, with a subset of the countries we want to look at.

library(lubridate)

countries <- c('US', 'DE', 'CA', 'FR')

# Choose either the for loop or the sapply function for creating `yt`

# EITHER use a for loop to create the data frame `yt`
yt <- data.frame()
for (c in countries) {
    filename <- paste0("/depot/datamine/data/youtube/", c, "videos.csv")
    dat <- read.csv(filename)
    dat$country_code <- c
    yt <- rbind(yt, dat)
}

# OR use an sapply function to create the data frame `yt`
myDFlist <- lapply( countries, function(c) {
                    dat <- read.csv(paste0("/depot/datamine/data/youtube/", c, "videos.csv"))
                    dat$country_code <- c
                    return(dat)} )
yt <- do.call(rbind, myDFlist)

# convert columns to date formats
yt$trending_date <- ydm(yt$trending_date)
yt$publish_time <- ymd_hms(yt$publish_time)

# extract the trending_year and publish_year
yt$trending_year <- year(yt$trending_date)
yt$publish_year <- year(yt$publish_time)

Take a look at the tags column in our yt dataset. Create a function called count_tags that has an argument called tag_vector. Your count_tags function should be the count of how many unique tags the vector, tag_vector contains.

Take a look at the fixed argument in strsplit.

You can test your function with the following code.

tag_test <- yt$tags[2]
tag_test
count_tags(tag_test)
Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

Create a new column in your yt dataset called n_tags that contains the number of tags for the corresponding trending video.

Make sure to use your count_tags function. Which YouTube trending video has the highest number of unique tags for videos that are trending either in the US or Germany (DE)? How many tags does it have?

Make sure to use the USE.NAMES argument from sapply function

Begin by creating the new column n_tags. Then create a new dataset only for youtube videos trending in 'US' or 'DE'. For the subsetted dataset, get the YouTube trending video with highest number of tags.

It should be video_id with value 4AelFaljd7k.

Relevant topics: sapply, which.max, indexing, subset

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • The title of the YouTube video with the highest number of tags, and the number of tags it has.

Question 3

Is there an association between number of tags in a video and how many views it gets?

Make a scatterplot with number of views in the x-axis and number of tags (n_tags) in the y-axis. Based on your plot, write 1-2 sentences about whether you think number of tags and number of views are associated or not.

Hmmm, is a scatterplot a good choice to be able to see an association in this case? If so, explain why. If not, create a better plot for determining this, and explain why your plot is better, and try to explain if you see any association.

tapply could be useful for the follow up question.

Relevant topics: sapply, which.max, indexing, subset

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • 1-2 sentences explaining if you think number of views and number of tags a youtube video has are associated or not, and why.

Question 4

Compare the average number of views and average number of comments that the YouTube trending videos have per trending country.

Is there a different behavior between countries? Are the comparisons fair? To check if we are being fair, take a look at how many youtube trending videos we have per country.

Relevant topics: tapply, mean

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • 1-2 sentences comparing trending countries based on average number of views and comments.

  • 1-2 sentences explaining if you think we are being fair in our comparisons, and why or why not.

Question 5

How would you compare the YouTube trending videos across the different countries?

Make a comparison using plots and/or summary statistics. Explain what variables are you looking at, and why you are analyzing the data the way you are. Have fun with it!

There are no right/wrong answers here. Just dig in a little bit and see what you can find.

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • 1-2 sentences explaining your logic.

  • 1-2 sentences comparing the countries.

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.