STAT 19000: Project 5 — Fall 2021

Motivation: As briefly mentioned in project 4, R differs from other programming languages in that typically you will want to avoid using for loops, and instead use vectorized functions and the "apply" suite. In this project we will use vectorized functions to solve a variety of data-driven problems.

Context: While it was important to stop and learn about looping and if/else statements, in this project, we will explore the R way of doing things.

Scope: r, data.frames, recycling, factors, if/else, for loops, apply suite

Learning Objectives
  • Explain what "recycling" is in R and predict behavior of provided statements.

  • Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.

  • Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • List the differences between lists, vectors, factors, and data.frames, and when to use each.

  • Demonstrate a working knowledge of control flow in r: for loops.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the following dataset(s):

  • /depot/datamine/data/youtube/*.{csv,json}

Questions

Question 1

Read the dataset USvideos.csv into a data.frame called us_youtube. The dataset contains YouTube trending videos between 2017 and 2018.

The dataset has two columns that refer to time: trending_date and publish_time.

The column trending_date is organized in a [year].[day].[month] format, while the publish_time is in a different format.

When working with dates, it is important to use tools specifically for this purpose (rather, than using string manipulation, for example). We’ve provided you with the code below. The provided code uses the lubridate package, an excellent package which hides away many common issues that occur when working with dates. Feel free to check out the official cheatsheet in case you’d like to learn more about the package.

Run the code below to extract to create two new columns: trending_year and publish_year.

library(lubridate)

# convert columns to date formats
us_youtube$trending_date <- ydm(us_youtube$trending_date)
us_youtube$publish_time <- ymd_hms(us_youtube$publish_time)

# extract the trending_year and publish_year
us_youtube$trending_year <- year(us_youtube$trending_date)
us_youtube$publish_year <- year(us_youtube$publish_time)

unique(us_youtube$trending_year)
unique(us_youtube$publish_year)

Take a look at our newly created columns. What type are the new columns? In the provided code, which (if any) of the 4 functions are vectorized?

Now, duplicate the functionality of the provided code using only the following functions: as.numeric, substr, and regular vectorized operations like +, -, * and /. Which was easier?

Relevant topics: read.csv, typeof

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 2

While some great content certainly comes out of the United States, we have a lot of other great content from other countries. Plus, the size of the data is reasonable to combine into a single data.frame.

Look in the following directory: /depot/datamine/data/youtube. You will find files that look like this:

CAvideos.csv
DEvideos.csv
USvideos.csv
...

You will notice how each dataset follows the same naming convention. Each file starts with the country code, US, DE, CA, etc, followed immediately by "videos.csv".

Use a loop and the given vector to systematically combine the data into a new data.frame called yt.

countries <- c('CA', 'DE', 'FR', 'GB', 'IN', 'JP', 'KR', 'MX', 'RU', 'US')

In a loop, loop through each of the values in countries. Use the paste0 function to create a string that is the absolute path to each of the files. So, for example, the following would represent the steps to perform in the first loop.

  • In our first loop we have the value CA.

  • We would use paste0 to create a string containing the absolute path of the corresponding dataset: /depot/datamine/data/youtube/CAvideos.csv.

  • Then, we would then use that string as an argument to the read.csv function to read in the data into a data.frame.

  • Then, we would add the new column country_code to the data.frame with the value CA repeated for each row.

  • Finally, you would use the rbind function to combine the new data.frame with the previous data.frame.

In the end, you will end up with a single data.frame called yt, that contains the data for every country in the dataset. yt will also have a column called country_code that contains the country code for each row, so we know where the data originated.

When combining data, it is important that we don’t lose any data in the process. If we slapped together all of the data from each of the datasets into a single file named yt.csv, what data would we lose?

In order to prevent this loss of data, create a new column called country_code that includes this information in the dataset rather than in the filename.

Print a list of the columns in yt, in addition, print the dimensions of yt. Finally, create the trending_year and publish_year columns for yt.

# Dr Ward summarizes how to perform Question 2 in the video.
# Here is the analogous code for this question.
# We know that all of this is new for you.
# That is why we are guiding you through this question!

getdataframe <- function(mycountry) {
    myDF <- read.csv(paste0("/depot/datamine/data/youtube/", mycountry, "videos.csv"))
    myDF$country_code <- mycountry
    return(myDF)
}

countries <- c('CA', 'DE', 'FR', 'GB', 'IN', 'JP', 'KR', 'MX', 'RU', 'US')

myresults <- lapply(countries, getdataframe)

yt <- do.call(rbind, myresults)

Relevant topics: read.csv, paste0, rbind, dim, colnames

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 3

From this point on, unless specified, use the yt data.frame to answer the questions.

Which YouTube video took the longest time to trend from the time it was published? How many years did it take to trend?

Relevant topics: which.max, indexing

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • Name of the YouTube video, and how long it took to trend.

  • (Optional) Did you watch the video prior to the project? If so, what do you think about it?

Question 4

We are interested in seeing whether or not there is a difference in views between videos with ratings enabled vs. those with ratings disabled.

Calculate the average number of views for videos with ratings enabled and those with ratings disabled. Anecdotally, does it look like disabling the ratings helps or hurts the views?

You can use tapply to solve this problem if you are comfortable with the tapply function. Otherwise, stay tuned in a future project where we will explore the tapply function in more detail.

You may need to take a careful look at the ratings_disabled column. What type should this column be? Make sure to convert if necessary.

Relevant topics: mean, tapply indexing

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 5

Create two new columns in yt:

  • balance: the difference between likes and dislikes for a given video.

  • positive_balance: an indicator variable that is TRUE if balance is greater than zero, and FALSE otherwise.

How many videos have a positive balance?

Relevant topics: sum

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

Question 6

Compare videos with a positive positive_balance to those with a non-positive positive_balance. Make this comparison based on the comment_count and the views of the videos.

To make a comparison, pick a statistic to summarize and compare comment_count and views. Examples of statistics include: mean, median, max, min, var, and sd.

You can pick more than one statistic to compare, if you want, and each column may have its own statistic(s) to summarize it.

Relevant topics: tapply, mean, sum, var, sd, max, min, median

Items to submit
  • Code used to solve this problem.

  • Output from running the code.

  • 1-2 sentences explaining what statistic you chose to summarize each column, and why.

  • 1-2 sentences comparing videos with positive balance and non-positive balance based on comment_count and views. Is the result surprising to you?

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.