# STAT 19000: Project 5 — Fall 2021

Motivation: As briefly mentioned in project 4, R differs from other programming languages in that typically you will want to avoid using for loops, and instead use vectorized functions and the "apply" suite. In this project we will use vectorized functions to solve a variety of data-driven problems.

Context: While it was important to stop and learn about looping and if/else statements, in this project, we will explore the R way of doing things.

Scope: r, data.frames, recycling, factors, if/else, for loops, apply suite

Learning Objectives
• Explain what "recycling" is in R and predict behavior of provided statements.

• Explain and demonstrate how R handles missing data: NA, NaN, NULL, etc.

• Demonstrate the ability to use the following functions to solve data-driven problem(s): mean, var, table, cut, paste, rep, seq, sort, order, length, unique, etc.

• Read and write basic (csv) data.

• Explain and demonstrate: positional, named, and logical indexing.

• List the differences between lists, vectors, factors, and data.frames, and when to use each.

• Demonstrate a working knowledge of control flow in r: for loops.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

## Dataset(s)

The following questions will use the following dataset(s):

• `/depot/datamine/data/youtube/*.{csv,json}`

## Questions

### Question 1

Read the dataset `USvideos.csv` into a data.frame called `us_youtube`. The dataset contains YouTube trending videos between 2017 and 2018.

 The dataset has two columns that refer to time: `trending_date` and `publish_time`. The column `trending_date` is organized in a `[year].[day].[month]` format, while the `publish_time` is in a different format.

When working with dates, it is important to use tools specifically for this purpose (rather, than using string manipulation, for example). We’ve provided you with the code below. The provided code uses the `lubridate` package, an excellent package which hides away many common issues that occur when working with dates. Feel free to check out the official cheatsheet in case you’d like to learn more about the package.

Run the code below to extract to create two new columns: `trending_year` and `publish_year`.

``````library(lubridate)

# convert columns to date formats

# extract the trending_year and publish_year

Take a look at our newly created columns. What type are the new columns? In the provided code, which (if any) of the 4 functions are vectorized?

Now, duplicate the functionality of the provided code using only the following functions: `as.numeric`, `substr`, and regular vectorized operations like `+`, `-`, `*` and `/`. Which was easier?

Items to submit
• Code used to solve this problem.

• Output from running the code.

### Question 2

While some great content certainly comes out of the United States, we have a lot of other great content from other countries. Plus, the size of the data is reasonable to combine into a single data.frame.

Look in the following directory: `/depot/datamine/data/youtube`. You will find files that look like this:

```CAvideos.csv
DEvideos.csv
USvideos.csv
...```

You will notice how each dataset follows the same naming convention. Each file starts with the country code, `US`, `DE`, `CA`, etc, followed immediately by "videos.csv".

Use a loop and the given vector to systematically combine the data into a new data.frame called `yt`.

``countries <- c('CA', 'DE', 'FR', 'GB', 'IN', 'JP', 'KR', 'MX', 'RU', 'US')``

In a loop, loop through each of the values in `countries`. Use the `paste0` function to create a string that is the absolute path to each of the files. So, for example, the following would represent the steps to perform in the first loop.

• In our first loop we have the value `CA`.

• We would use `paste0` to create a string containing the absolute path of the corresponding dataset: `/depot/datamine/data/youtube/CAvideos.csv`.

• Then, we would then use that string as an argument to the `read.csv` function to read in the data into a data.frame.

• Then, we would add the new column `country_code` to the data.frame with the value `CA` repeated for each row.

• Finally, you would use the rbind function to combine the new data.frame with the previous data.frame.

In the end, you will end up with a single data.frame called `yt`, that contains the data for every country in the dataset. `yt` will also have a column called `country_code` that contains the country code for each row, so we know where the data originated.

 When combining data, it is important that we don’t lose any data in the process. If we slapped together all of the data from each of the datasets into a single file named `yt.csv`, what data would we lose?

In order to prevent this loss of data, create a new column called `country_code` that includes this information in the dataset rather than in the filename.

Print a list of the columns in `yt`, in addition, print the dimensions of `yt`. Finally, create the `trending_year` and `publish_year` columns for `yt`.

``````# Dr Ward summarizes how to perform Question 2 in the video.
# Here is the analogous code for this question.
# We know that all of this is new for you.
# That is why we are guiding you through this question!

getdataframe <- function(mycountry) {
myDF\$country_code <- mycountry
return(myDF)
}

countries <- c('CA', 'DE', 'FR', 'GB', 'IN', 'JP', 'KR', 'MX', 'RU', 'US')

myresults <- lapply(countries, getdataframe)

yt <- do.call(rbind, myresults)``````

Relevant topics: read.csv, paste0, rbind, dim, colnames

Items to submit
• Code used to solve this problem.

• Output from running the code.

### Question 3

 From this point on, unless specified, use the `yt` data.frame to answer the questions.

Which YouTube video took the longest time to trend from the time it was published? How many years did it take to trend?

Relevant topics: which.max, indexing

Items to submit
• Code used to solve this problem.

• Output from running the code.

• Name of the YouTube video, and how long it took to trend.

• (Optional) Did you watch the video prior to the project? If so, what do you think about it?

### Question 4

We are interested in seeing whether or not there is a difference in views between videos with ratings enabled vs. those with ratings disabled.

Calculate the average number of views for videos with ratings enabled and those with ratings disabled. Anecdotally, does it look like disabling the ratings helps or hurts the views?

 You can use `tapply` to solve this problem if you are comfortable with the `tapply` function. Otherwise, stay tuned in a future project where we will explore the `tapply` function in more detail.
 You may need to take a careful look at the `ratings_disabled` column. What type should this column be? Make sure to convert if necessary.

Relevant topics: mean, tapply indexing

Items to submit
• Code used to solve this problem.

• Output from running the code.

### Question 5

Create two new columns in `yt`:

• `balance`: the difference between `likes` and `dislikes` for a given video.

• `positive_balance`: an indicator variable that is `TRUE` if `balance` is greater than zero, and `FALSE` otherwise.

How many videos have a positive balance?

Relevant topics: sum

Items to submit
• Code used to solve this problem.

• Output from running the code.

### Question 6

Compare videos with a positive `positive_balance` to those with a non-positive `positive_balance`. Make this comparison based on the `comment_count` and the `views` of the videos.

To make a comparison, pick a statistic to summarize and compare `comment_count` and `views`. Examples of statistics include: `mean`, `median`, `max`, `min`, `var`, and `sd`.

You can pick more than one statistic to compare, if you want, and each column may have its own statistic(s) to summarize it.

Relevant topics: tapply, mean, sum, var, sd, max, min, median

Items to submit
• Code used to solve this problem.

• Output from running the code.

• 1-2 sentences explaining what statistic you chose to summarize each column, and why.

• 1-2 sentences comparing videos with positive balance and non-positive balance based on `comment_count` and `views`. Is the result surprising to you?

 Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.